The PDF that knew too much

Marcus is a freelance accountant. Every January, he assembles a package of documents for each client’s tax return — payslips, P60s, bank statements, a covering letter — and combines them into a single PDF before emailing them to the client’s solicitor. The documents contain national insurance numbers, bank account details, employer names, annual salaries, home addresses. He uploads them to a free online merger. The merged file comes back in a few seconds. Job done.

He’s done this for six years. He’s never thought about it.

Here is what happens inside a PDF that most people don’t know. A PDF is not a flat image. It’s a container format — closer to a ZIP file than a photograph. Inside a PDF you’ll typically find: the visible content (text, images, vector graphics), the fonts used to render that content, the document structure (page tree, cross-reference table), and metadata. Lots of metadata. Author name, creation date, modification date, the software that created the file, sometimes the printer that printed it, sometimes GPS coordinates if it was scanned on a phone, sometimes tracked changes and comments that were hidden before export but are still present in the file structure.

When Marcus uploads six documents to a PDF merger, he’s not just uploading text and images. He’s uploading a detailed record of how those documents were created, by whom, on what device, when, and sometimes where. The merger sees all of it.

What does it do with it? The honest answer is: it’s hard to know.

Free PDF tools operate under privacy policies that are written by lawyers to be defensible, not transparent. The standard formulation — “we do not sell your personal information to third parties” — is technically meaningful but practically limited. It says nothing about what happens to the document metadata that was extracted during processing. It says nothing about whether the documents are analysed for quality improvement purposes. It says nothing about who inside the company can access uploaded files, under what circumstances, or for how long. The phrase “we take your privacy seriously” appears in approximately every privacy policy ever written, including those of companies that have subsequently suffered data breaches involving millions of documents.

The retention question is the one that bothers security professionals most. “Files deleted within 24 hours” is a common promise. But 24 hours is a long time. It’s long enough for a backup to run. It’s long enough for a log entry to be created. It’s long enough for a file to end up in a place that wasn’t covered by the deletion policy. The gap between a company’s stated policy and its actual infrastructure behaviour is often significant, not because of malice, but because software is complicated and people make mistakes.

This is the specific risk of “free”: the economics of free require something to be monetised. In many cases that something is attention — advertising. In some cases it’s the user’s data. In all cases it’s the infrastructure, which means the files are somewhere, on servers, for some period of time, accessible to some set of people and processes. The alternative — running the merger locally — would make the business model impossible and the privacy question moot.

Merging PDFs is not a complicated operation. pdf-lib, a well-maintained open source library, can load multiple PDF documents, extract their pages, and combine them into a new document in a browser tab. There is no step in this process that requires a server. The files never need to leave the device they’re already on. The tax documents Marcus is combining don’t need to visit a data centre to be merged — they just need a browser.

The upload was always the unnecessary part.

fwip merges PDFs in your browser. Nothing leaves your device. Try it →