Linux Fu: PDF For Penguins

PostScript started out as a programming language for printers. While PostScript printers are still a thing, there are many other ways to send data to a printer. But PostScript also spawned the Portable Document Format or PDF and that has been crazy successful. Hardly a day goes by that you don’t see some kind of PDF document come across your computer screen. Sure, there are other competing formats but they hold a sliver of market share compared to PDF. Viewing PDFs under Linux is no problem. But what about editing them? Turns out, that’s easy, too, if you know how.

GUI Tools

You can use lots of tools to edit PDF files, but the trick is how good the results will look. Anything will work for this: LibreOffice Draw, Inkscape, or even GIMP. If all you want to do is remove something with a white box or make an annotation, these tools are usually great, but for more complicated changes, or pixel-perfect output, they may not be the right tool.

The biggest problem is that most of these tools deal with the PDF as an image or, at least, a collection of objects. For example, columns of text will probably turn into a collection of discrete lines. Changing something that causes a line to wrap will require you to change all the other lines to match. Sometimes text isn’t even text at all, but images. It largely depends on how the creator made the PDF to begin with.If you don’t mind using a Web-based tool, PDFEscape is free and works very well. Other options include Scribus and Okular. Both of these tools can’t really edit the file but can import them as images that you can further manipulate. For example, Okular’s review mode can add annotations like highlights and freehand lines.

Unsurprisingly, emacs can display a PDF file if it is running under X. You can use Control+C Control+C to switch to view a text representation. After all, most of the PDF file format is text and emacs can even handle binary files. So if you don’t mind working inside the PDF format — very much like PostScript — you can do your editing in emacs or even another text editor.

There are a few dedicated non-free editors out there and at least one open-source PDF-specific editor. Of course, like most things in Linux, you can also use the command line.

Hiding Text

The problem with working with PDFs as text — even in emacs — is that they are often compressed and otherwise unreadable. For example, words may appear a character at a time separated by formatting code or other data. So searching for arbitrary text in the PDF may not work.

You can convert the file to use more uncompressed text, although that’s no panacea. For example, if you open up this segment from an article on ham radio and want to change the word “convention”, it is hard to tell exactly where that text is, but it is somewhere in this general area:

3 0 obj << /Length 14770 /Filter /FlateDecode >> stream
H�|Wɒ�8��+p$gJ,�c��v�cS�Ҍc��J�$���\ZV�����\0�� �CTR�������r��[�}�7}����|��������I5u���`M�>�/��?l�.8�@��gBzq�r!#�%� AE�� �˜ ᥉��x!$��X8^%$��A�D�B���(���b�[H �>����#��{a���e0$^H&|/����U1$^��#��/�G�Us��/"/��\ <i�'qC���$xe�"X�x22�������G��F�Lp]Mnm�$] #TI��G�q�l��'3;!���!+�ȷ�{䕀���
��b��Qja����Q i� GRn�\0�g;L����x�Zܿ㌳�n�2�R& :"x�r�ky�[JPK��/���S��i��������]r�F�p����k�� |���
QI�mx>1�\�1�Q��y)ХǺ�Z�U.^�](pN��dx����;�֬;d�_�{˪�cYa�\�.t�s�}�ْ{<\0ZW�:�Ȅ�Oɴ��cS�UzluP�֨o}ި��Uqf��o��V��bT%mj|��t����;v�{s�Rj˺���

Good luck finding it in that soup. You want to convert it to unpacked text.

qpdf -qdf input.pdf output.txt

The resulting file is actually a PDF even though I named it .txt. However, it has everything unpacked. That’s still not great, though, but at least you could find the part you need to change:

1.2632 -1.1242 TD
0.0739 Tc
0.1263 Tw
(One potentially confusing Stamp)Tj
-1.2632 -1.1368 TD
0.026 Tc
0.1248 Tw
[(con)38.6(v)20.7(ention is that the I/O pin numbers)]TJ
0 -1.1242 TD
0.0262 Tc
0.0072 Tw
[<646f6e90>13.6(t correspond to the IC pin numbers.)]TJ
T*

Again, good luck searching for the word “convention,” for example. But it is still better than the first example. You can also find metadata even in unprocessed files using things like /Author and /Title.

Command Line Magic

The qpdf tool can convert a PDF file to another PDF file. It can optimize the output for Web serving, text editing, and it can do simple things like remove pages or merge pieces of multiple files. You can read the documentation, but here we use the QDF mode to produce a legitimate PDF file with all the objects in numerical order and with normal Unix-style line endings. This allows you to more easily edit the file with a text editor, but as you’ve seen that doesn’t always make it simple. Removing entire objects is a headache, but if you get rid of all the mentions of an object, you can run fix-qdf to recreate the proper QDF file.

Another way to make common edits to PDF files is to use PDFtk server (PDFtk without the server moniker is a GUI toolkit for Windows). Using PDFtk you can merge or split documents, rotate pages, and do many other common tasks. For example, to join two files in order:

pdftk in1.pdf in2.pdf cat output output.pdf

You can omit, say, page 9:

pdftk in1.pdf in2.pdf cat 1-8 10-end output output.pdf

You can also shuffle merged pages in different orders:

pdftk A=in1.pdf B=in2.pdf shuffle A B output output.pdf

Text to PDF and Back

If you want to convert text into PDF from the command line you have several options. Pandoc is an amazing tool that converts markdown to almost anything. It will not only convert markdown to PDF but just about anything else.

You can also use various combinations of ps2pdf (along with a tool to generate PostScript), pdf2text (part of poppler-utils), or Ghostscript to create PDFs or strip text out of them. Ghostscript can do a lot, including convert a PDF to a number of image formats if you want to, say, display them on a Web page as an image.

Special Printing and Other Tools

Sometimes you want to modify a PDF file so it will print a certain way. We’ve already talked about how to merge odd and even pages, for example, but there are a few other commands you might want for this purpose:

  • pdfxup – Uses pdflatex and Ghostscript to put multiple pages on one printed page (e.g., 2-up)
  • pdfjam – Uses LaTeX to put documents on different size pages or produce multiple pages on one printed page
  • pdfposter – Create giant output on multiple pages from a single page

If you prefer a GUI you might check out PDFsam basic. If you are interested in Java software, there is Multivalent.

Wrap Up

As usual, there are many ways to do daily tasks in Linux. Sometimes the challenge isn’t doing the work, but rather finding the tool that best fits your style of working.

Oddly enough pandoc keeps coming up for different reasons. If you prefer your documents on paper, you need a printer and bookbinding clamp.

 

41 thoughts on “Linux Fu: PDF For Penguins

  1. This is really useful information; thanks, Al!

    As a follow-up: What Linux tools enable indexing or searching a collection of pdf’s?

    For example, I have a large collection of magazine pdf’s – all paid for! – and I’d love to index the articles for search. What tools are available either to search the pdf’s on demand, or to index their contents ahead of a search?

    1. > What tools are available either to search the pdf’s on demand, or to index their contents ahead of a search?

      Some evil person don’t allow you to use f.e. mentioned in artice pdf2text output piped through grep?

      1. What if PDF is protected? As in “you can’t copy text from this document without a password”?
        As for solution I just use my vast memory to know, where everything is. I also use meaningful file names…

        On related note this proves again, that the GUI for Linux exists only to display multiple text terminals on one screen…

          1. But i case of Windows or macOS (and Android too), user rarely has to operate via command line. On Linux this is not optional, it’s mandatory. It’s all about user experience. The only times I have to resort to command line or text file editing is when I’m dealing with open source tools that either originate from Linux or were developed on it, or with Python. Text file edits are reserved for some games, when developers didn’t include a way to scale up the graphics…

      2. Of course I could, but it’s a bit cumbersome on a library of ~300 documents. That’s why I wondered whether there was a tool to search/index pdf’s more succinctly.

    2. Not exactly indexing, but I have used pdfgrep several times to find stuff in collections of well behaved pdf files. Another route may be using pdftotext and indexing the output. Finally, a document management system like mayan edms or paperless-ng may be helpful – both will run your pdfs through tesseract for OCR (yes, even if it’s not a scan) and store everything in a database.

    3. On Macs, Spotlight automatically indexes the contents of PDFs (along with everything else), assuming they have a text layer — some do, some don’t. To find out, try to select some text. If it works like a text editor (word by word), that’s good. If the selection is just a rectangle, there’s no text to index; what you have is just an image of the page. A “good” PDF will have the text “hidden” behind the page image, so you see the page exactly as the creator intended (the image), but you can still select the text. (And that’s why sometimes when you select text, it doesn’t seem to line up exactly with the letters you’re looking at).

      1. mdfind for macOS terminal uses.

        User-Name:Desktop username$ mdfind pdf -onlyin ./books “acrylic”

        /Users/username/Desktop/books/9780071257633_shigleys_mechanical_engineering_design_f5c5.pdf
        /Users/username/Desktop/books/9781560109143_art_of_acrylic_painting_0842.pdf
        /Users/username/Desktop/books/Mechanisms_and_Mechanical devices sourcebook 5e NEIL SCLATER.pdf
        /Users/username/Desktop/books/9780071704427_Mechanisms_and_Mechanical_Devices_5th_24fb.pdf

    1. Pandoc is good. Understands a whole bunch of markup languages. Markdown is probably the least featureful of them, but if that’s all you need, then the glove fits.

      Asciidoc and DocBook are worth a look if you need something more, at least in the print-media direction.

      All of these convert to PDF / HTML / LaTeX, etc. That’s what’s cool about pandoc. It makes the choice of markup language and final output about as independent of each other as is possible.

      It’s going the other way that’s hard. PDF is a Frankenstein’s monster, but it’s what we’ve got.

  2. I was really struggling to find a decent PDF editor for Linux that gives all the usual editing options as the name brand one. Especially for business stuff, it’s been really a pain to fill and sign non-fillable forms in OpenOffice or GIMP. Also for just quickly editing text, pictures and export it back into a normal non-image PDF and be done with it.
    Went through all the open source options, but then settled with Master PDF Editor. Perpetual license for a very reasonable price, as it should be. Am not affiliated with them just a satisfied user.

  3. PDF is terribly overused. It is not convenient at all, well outlined in the first few paragraphs here. I see it as only really useful for transmitting finished works that are not meant to be edited. It really kind of sucks.

    1. This .. but i dont mind it so. its how I choose to send invoices by email, I use linux on all my computers in work so i know the recipient will be able to open it regardless of underlying OS, also its clumbersome to edit so can be somewhat assured what i send is what they get and theres no messing about. (For example I often list problems found on the invoice during works so as to cover my ass from the ” it wasnt like that before you touched it” crowd.) and as im usually asked to do the work by one party on behalf of another its important I show what I did, and what I didnt do, but was aware of.

    2. You are really late for this party, the PDF standard was made decades ago, you might as well complain about the weather. If you want to make your own file format to solve your perceived issues, feel free to invent one and get folks to go along with it, good luck with that.

    3. I agree it is overused in some situations, for example if you get sent a form you need to sign and email back but it’s a pdf and you can’t edit it or from worksheets or cover sheets at Uni.

      PDFs should only be used for final copies of documents to be sent or put on a website like data sheets or read only documents or documents that need password protected.

    1. PdfArranger is cross platform? Pffft, maybe if you live long enough to cobble together all the tools you’ll need to get it to run on baseline Windows. Mac-OS would likely be easier, it’s BSD at its core and doesn’t blink at swiping stuff from Linux. PDF Arranger is a Python-GTK fork of the venerable PDF Shuffler app which is itself a GUI wrapper for the pikepdf Python PDF library.

  4. Some other useful tools I use regularly:
    – pdfshuffler (for Debian/Ubuntu/Mint) or pdfslicer (for Arch Linux) to reorder or delete pages and mix documents)
    – krop to cut pdf to big to be printed on a single A4 page (useful for maps… just use some glue after :-)
    – pdfbook (alias to pdfjam) to save paper and made physical pretty booklets

  5. I had to generate PDF invoices. HTML2PDF tools produced sub-optimal output. I found “prawn” surprisingly easy to use and an elegant way to learn the structure of PDF. I have full control over the output. I went from hating PDF to enjoying it.

    Only peripherally related, I like eps2svg (from the “geg” package) to take PostScript output such as gEDA schematics and produce SVG files for resolution-independent consumption on the web.

  6. I prefer EPUB over PDF, since EPUB flows to fit the display device geometry. For conversion, I use Calibre to convert PDF to EPUB. For editing, Sigil edits what looks like a HTML document, and outputs EPUB. I kinda think Calibre lets you create PDFs and other formats from EPUB but I’ve not bothered.

  7. The article fails to mention the most useful feature of all, the one thing that actually redeems pdf from being a total steaming pile of crud to just a pile of crud, and that is pdfmarks. Pdfmarks are processed by ghostscript and can be effectively appended to an existing pdf without altering any of the exsiting pdf format and writing over the top or adding to it.

    Including really, really useful stuff like adding named destions and the like.

    Waring, once you start delving into pdf you will lose substantial portions of your life to get to any real deep understanding, I see one guy has spent 15 years of his life to get to the point he can publish a tool that works with pdf at the most fundemental levels, Adobe have a lot to answer for.

  8. While we are talking about PDF’s, I sometimes use ‘pdfimages’, part of the poppler-utils package. It will extract all the images from a PDF file, saving them in jpg format.

    pdfimages -all “mypdf.pdf” /media/Workspace/Temp/

  9. @Al Williams said: “If you don’t mind using a Web-based tool, PDFEscape is free and works very well.”

    Well not for me at 05:35 UTC on 08-July-2021. The link:

    https://www.pdfescape.com/

    in the article looks OK to me, but the result gets redirected to:

    https://www.pdfescape.com/500/?aspxerrorpath=/500/

    With the following creepy message:

    “The page isn’t redirecting properly. An error occurred during a connection to http://www.pdfescape.com. This problem can sometimes be caused by disabling or refusing to accept cookies.”

    So what? I have to accept abusive tracking cookies or something to use PDFEscape? If yes, that makes PDFEscape both NOT free and totally broken. Or maybe the HaD overload effect is temporarily causing problems at PDFEscape? Yawn, I’ll try PDFEscape again tomorrow and see how it goes.

  10. Agree. I had a look at PDFEscape. Here from their privacy policy:

    “We take your privacy seriously. Please read the following information to learn more about our privacy practices.

    This policy covers what information we collect and how it is treated.
    This policy only applies to Red Software Websites and does not extend to the practices of other websites that we may link to, companies that we do not own or control, or to people that we do not employ or manage.”

    So far so good. And now from their page source:

    “if( Cookiebot.consented ? Cookiebot.consent.statistics : Cookiebot.isOutsideEU )
    {
    (function(i,s,o,g,r,a,m){i[‘GoogleAnalyticsObject’]=r;i[r]=i[r]||function(){ [… the usual Google spyware crap …] } }”

    Thanks, folks. I’ll pass.

    1. Dont block cookies, use ad-blockers. Best is uBlock Origin. It is better to just not load content than break internet by blocking cookies. Else, delete all cookies on browser close, else, use an extension that autodeletes cookies.

      Hopefully anyone against GoogleAnalytics also doesnt use their mail, android or search services. Else, anyone again Microsoft, hopefully also doesnt use their mail, platform or search services.

  11. This article feels totally bizarre to me, as I have never felt a need to edit a pdf all my life. Aren’t they intended for final documents not to be edited, or forms that people want you to print out because they need a real signature?

  12. If you need to annotate/redact there is the mupdf – not the classic X11 version the GL one: mupdf-gl , the redaction is so good I totally abandoned the Adobe (under wine) , one notice: its on the version 1.18+

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.