Compare PDFs Visually

Sometimes a problem seems hard, but the right insight can make it easy. If you were asked to write a program to compare two PDF files and show the differences, how hard do you think that would be? If you are [serhack], you’ll make it much easier than you might guess.

Of course, sometimes making something simple depends on making simplifying assumptions. If you are expecting a “diff-like” utility that shows insertion and deletions, that’s not what’s going on here. Instead, you’ll see an image of the PDF with changes highlighted with a red box. This is easy because the program uses available utilities to render the PDFs as images and then simply compares pixels in the resulting images, drawing red boxes over the parts that don’t match.

Obviously, this is best for PDFs that just have a few changes. Inserting a paragraph, for example, makes the output pretty useless. For that, you might consider extracting the text from the PDF using something like pdf2text (which uses the same underlying library this uses to generate images).

The program thows a lot of messages about missing files but seems to do the job anyway. Here is the result of comparing two versions of the Hackaday home page captured to PDF a few minutes apart:

You can see, though, that if a new article was posted and everything slid down by one, you’d have nothing but a giant red block.

It is still a clever idea. There are surprisingly few tools out there for this, although we did find a few others. There are, of course, plenty of Linux tools for manipulating PDFs. Many of them are mashups of other tools like this one is.

12 thoughts on “Compare PDFs Visually

  1. I’ve been using DiffPDF 2.1.3.1 by Mark Summerfield (open source, and prepackaged for Debian, Ubuntu, and others). Not perfect, but pretty good for tracking down changes in a huge, multi-page PDFs.

    One of the problems these tools have in general is that as text changes early in a document, it may change where the page breaks, which shows up as a diff, which snowballs into a larger change at the next page break, and so on. I’ve heard one can sidestep some of that by converting PDFs into infinitely long single pages before comparing them, but I wish that were handled internally in the diff tool.

    1. I’ve handled this differently for the OpenXR spec, which I have set up a PDF diff internally. I have a Python script split the PDF into sections before diffing it, which does help reduce the cascading diff problem.

  2. When I have two copies of the same page and want to ensure that there are no changes, I simply put them together and hold them against the light. Any difference will stand out. Printing singled side helps of course.

      1. Open both pdfs and switch repeatedly with Alt+Tab, the differences will blink; go to the next block with the key for “one screen height down” on both pdfs, this keeps the alignment. Much easier :)
        After an insertion or deletion adjust the following text with the scroll bar and proceed. May get tiring for long pdfs with different page breaks due to an insertion at the beginning, and may not work if the line width was changed. It helps to disable any window-switching animations for this.

  3. I wrote a script which launches the specified pair of files (any type – the associated app will open) at full screen. It then toggles between them every second. This makes it easy to spot differences on the viewed page. You then scroll each down as required. It’s not brilliant but it does work with any app which opens each document in a new window.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.