Linux Fu: PDF For Penguins

July 6, 2021

PostScript started out as a programming language for printers. While PostScript printers are still a thing, there are many other ways to send data to a printer. But PostScript also spawned the Portable Document Format or PDF and that has been crazy successful. Hardly a day goes by that you don’t see some kind of PDF document come across your computer screen. Sure, there are other competing formats but they hold a sliver of market share compared to PDF. Viewing PDFs under Linux is no problem. But what about editing them? Turns out, that’s easy, too, if you know how.

GUI Tools

You can use lots of tools to edit PDF files, but the trick is how good the results will look. Anything will work for this: LibreOffice Draw, Inkscape, or even GIMP. If all you want to do is remove something with a white box or make an annotation, these tools are usually great, but for more complicated changes, or pixel-perfect output, they may not be the right tool.

The biggest problem is that most of these tools deal with the PDF as an image or, at least, a collection of objects. For example, columns of text will probably turn into a collection of discrete lines. Changing something that causes a line to wrap will require you to change all the other lines to match. Sometimes text isn’t even text at all, but images. It largely depends on how the creator made the PDF to begin with.If you don’t mind using a Web-based tool, PDFEscape is free and works very well. Other options include Scribus and Okular. Both of these tools can’t really edit the file but can import them as images that you can further manipulate. For example, Okular’s review mode can add annotations like highlights and freehand lines.

Unsurprisingly, emacs can display a PDF file if it is running under X. You can use Control+C Control+C to switch to view a text representation. After all, most of the PDF file format is text and emacs can even handle binary files. So if you don’t mind working inside the PDF format — very much like PostScript — you can do your editing in emacs or even another text editor.

There are a few dedicated non-free editors out there and at least one open-source PDF-specific editor. Of course, like most things in Linux, you can also use the command line.

Hiding Text

The problem with working with PDFs as text — even in emacs — is that they are often compressed and otherwise unreadable. For example, words may appear a character at a time separated by formatting code or other data. So searching for arbitrary text in the PDF may not work.

You can convert the file to use more uncompressed text, although that’s no panacea. For example, if you open up this segment from an article on ham radio and want to change the word “convention”, it is hard to tell exactly where that text is, but it is somewhere in this general area:

3 0 obj << /Length 14770 /Filter /FlateDecode >> stream
H�|Wɒ�8��+p$gJ,�c��v�cS�Ҍc��J�$���\ZV�����\0�� �CTR�������r��[�}�7}����|��������I5u���`M�>�/��?l�.8�@��gBzq�r!#�%� AE�� � ᥉��x!$��X8^%$��A�D�B���(���b�[H �>����#��{a���e0$^H&|/����U1$^��#��/�G�Us��/"/��\ <i�'qC���$xe�"X�x22�������G��F�Lp]Mnm�$] #TI��G�q�l��'3;!���!+�ȷ�{䕀���
��b��Qja����Q i� GRn�\0�g;L����x�Zܿ㌳�n�2�R& :"x�r�ky�[JPK��/���S��i��������]r�F�p����k�� |���
QI�mx>1�\�1�Q��y)ХǺ�Z�U.^�](pN��dx����;�֬;d�_�{˪�cYa�\�.t�s�}�ْ{<\0ZW�:�Ȅ�Oɴ��cS�UzluP�֨o}ި��Uqf��o��V��bT%mj|��t����;v�{s�Rj˺���

Good luck finding it in that soup. You want to convert it to unpacked text.

qpdf -qdf input.pdf output.txt

The resulting file is actually a PDF even though I named it .txt. However, it has everything unpacked. That’s still not great, though, but at least you could find the part you need to change:

1.2632 -1.1242 TD
0.0739 Tc
0.1263 Tw
(One potentially confusing Stamp)Tj
-1.2632 -1.1368 TD
0.026 Tc
0.1248 Tw
[(con)38.6(v)20.7(ention is that the I/O pin numbers)]TJ
0 -1.1242 TD
0.0262 Tc
0.0072 Tw
[<646f6e90>13.6(t correspond to the IC pin numbers.)]TJ
T*

Again, good luck searching for the word “convention,” for example. But it is still better than the first example. You can also find metadata even in unprocessed files using things like /Author and /Title.

Command Line Magic

The qpdf tool can convert a PDF file to another PDF file. It can optimize the output for Web serving, text editing, and it can do simple things like remove pages or merge pieces of multiple files. You can read the documentation, but here we use the QDF mode to produce a legitimate PDF file with all the objects in numerical order and with normal Unix-style line endings. This allows you to more easily edit the file with a text editor, but as you’ve seen that doesn’t always make it simple. Removing entire objects is a headache, but if you get rid of all the mentions of an object, you can run fix-qdf to recreate the proper QDF file.

Another way to make common edits to PDF files is to use PDFtk server (PDFtk without the server moniker is a GUI toolkit for Windows). Using PDFtk you can merge or split documents, rotate pages, and do many other common tasks. For example, to join two files in order:

pdftk in1.pdf in2.pdf cat output output.pdf

You can omit, say, page 9:

pdftk in1.pdf in2.pdf cat 1-8 10-end output output.pdf

You can also shuffle merged pages in different orders:

pdftk A=in1.pdf B=in2.pdf shuffle A B output output.pdf

Text to PDF and Back

If you want to convert text into PDF from the command line you have several options. Pandoc is an amazing tool that converts markdown to almost anything. It will not only convert markdown to PDF but just about anything else.

You can also use various combinations of ps2pdf (along with a tool to generate PostScript), pdf2text (part of poppler-utils), or Ghostscript to create PDFs or strip text out of them. Ghostscript can do a lot, including convert a PDF to a number of image formats if you want to, say, display them on a Web page as an image.

Special Printing and Other Tools

Sometimes you want to modify a PDF file so it will print a certain way. We’ve already talked about how to merge odd and even pages, for example, but there are a few other commands you might want for this purpose:

pdfxup – Uses pdflatex and Ghostscript to put multiple pages on one printed page (e.g., 2-up)
pdfjam – Uses LaTeX to put documents on different size pages or produce multiple pages on one printed page
pdfposter – Create giant output on multiple pages from a single page

If you prefer a GUI you might check out PDFsam basic. If you are interested in Java software, there is Multivalent.

Wrap Up

As usual, there are many ways to do daily tasks in Linux. Sometimes the challenge isn’t doing the work, but rather finding the tool that best fits your style of working.

Oddly enough pandoc keeps coming up for different reasons. If you prefer your documents on paper, you need a printer and bookbinding clamp.

44 thoughts on “Linux Fu: PDF For Penguins”

lthemick says:

July 6, 2021 at 10:19 am

This is really useful information; thanks, Al!

As a follow-up: What Linux tools enable indexing or searching a collection of pdf’s?

For example, I have a large collection of magazine pdf’s – all paid for! – and I’d love to index the articles for search. What tools are available either to search the pdf’s on demand, or to index their contents ahead of a search?

Report comment

Reply
1. Stanson says:
  
  July 6, 2021 at 10:32 am
  
  > What tools are available either to search the pdf’s on demand, or to index their contents ahead of a search?
  
  Some evil person don’t allow you to use f.e. mentioned in artice pdf2text output piped through grep?
  
  Report comment
  
  Reply
  1. Moryc says:
    
    July 6, 2021 at 1:05 pm
    
    What if PDF is protected? As in “you can’t copy text from this document without a password”?
    As for solution I just use my vast memory to know, where everything is. I also use meaningful file names…
    
    On related note this proves again, that the GUI for Linux exists only to display multiple text terminals on one screen…
    
    Report comment
    
    Reply
    1. X says:
      
      July 6, 2021 at 1:44 pm
      
      This is true of every operating system. Windows is a thin layer on the Windows text based kernel and macOS is also just a layer on top of BSD.
      
      Report comment
      
      Reply
      1. Moryc says:
        
        July 6, 2021 at 9:42 pm
        
        But i case of Windows or macOS (and Android too), user rarely has to operate via command line. On Linux this is not optional, it’s mandatory. It’s all about user experience. The only times I have to resort to command line or text file editing is when I’m dealing with open source tools that either originate from Linux or were developed on it, or with Python. Text file edits are reserved for some games, when developers didn’t include a way to scale up the graphics…
        
        Report comment
  2. lthemick says:
    
    July 6, 2021 at 2:02 pm
    
    Of course I could, but it’s a bit cumbersome on a library of ~300 documents. That’s why I wondered whether there was a tool to search/index pdf’s more succinctly.
    
    Report comment
    
    Reply
2. Phil says:
  
  July 6, 2021 at 12:31 pm
  
  Not exactly indexing, but I have used pdfgrep several times to find stuff in collections of well behaved pdf files. Another route may be using pdftotext and indexing the output. Finally, a document management system like mayan edms or paperless-ng may be helpful – both will run your pdfs through tesseract for OCR (yes, even if it’s not a scan) and store everything in a database.
  
  Report comment
  
  Reply
3. Isaac says:
  
  July 6, 2021 at 9:11 pm
  
  On Macs, Spotlight automatically indexes the contents of PDFs (along with everything else), assuming they have a text layer — some do, some don’t. To find out, try to select some text. If it works like a text editor (word by word), that’s good. If the selection is just a rectangle, there’s no text to index; what you have is just an image of the page. A “good” PDF will have the text “hidden” behind the page image, so you see the page exactly as the creator intended (the image), but you can still select the text. (And that’s why sometimes when you select text, it doesn’t seem to line up exactly with the letters you’re looking at).
  
  Report comment
  
  Reply
  1. Eric says:
    
    July 7, 2021 at 3:04 pm
    
    mdfind for macOS terminal uses.
    
    User-Name:Desktop username$ mdfind pdf -onlyin ./books “acrylic”
    
    /Users/username/Desktop/books/9780071257633_shigleys_mechanical_engineering_design_f5c5.pdf
    /Users/username/Desktop/books/9781560109143_art_of_acrylic_painting_0842.pdf
    /Users/username/Desktop/books/Mechanisms_and_Mechanical devices sourcebook 5e NEIL SCLATER.pdf
    /Users/username/Desktop/books/9780071704427_Mechanisms_and_Mechanical_Devices_5th_24fb.pdf
    
    Report comment
    
    Reply
4. Trn says:
  
  July 8, 2021 at 2:45 pm
  
  recoll does it
  
  Report comment
  
  Reply
BT says:

July 6, 2021 at 10:33 am

“there are many ways to do daily tasks in Linux. Sometimes the challenge isn’t doing the work, but rather finding the tool”

Ain’t that the truth!

Report comment

Reply
therogerv says:

July 6, 2021 at 10:48 am

when it started talking about pandoc and converting Markdown to pdf, now that is what got me excited

Report comment

Reply
1. Elliot Williams says:
  
  July 7, 2021 at 2:10 am
  
  Pandoc is good. Understands a whole bunch of markup languages. Markdown is probably the least featureful of them, but if that’s all you need, then the glove fits.
  
  Asciidoc and DocBook are worth a look if you need something more, at least in the print-media direction.
  
  All of these convert to PDF / HTML / LaTeX, etc. That’s what’s cool about pandoc. It makes the choice of markup language and final output about as independent of each other as is possible.
  
  It’s going the other way that’s hard. PDF is a Frankenstein’s monster, but it’s what we’ve got.
  
  Report comment
  
  Reply
therogerv says:

July 6, 2021 at 10:51 am

which makes me think, hmm, a Hackaday article like this one on PDF, on the subject of Markdown, would be a good read

Report comment

Reply
rnjacobs says:

July 6, 2021 at 11:12 am

My favorite tool for this, as limited as it is, is “flpsed”. Some times you only need to add some text on top of a pdf.

Report comment

Reply
z_e says:

July 6, 2021 at 12:16 pm

I was really struggling to find a decent PDF editor for Linux that gives all the usual editing options as the name brand one. Especially for business stuff, it’s been really a pain to fill and sign non-fillable forms in OpenOffice or GIMP. Also for just quickly editing text, pictures and export it back into a normal non-image PDF and be done with it.
Went through all the open source options, but then settled with Master PDF Editor. Perpetual license for a very reasonable price, as it should be. Am not affiliated with them just a satisfied user.

Report comment

Reply
sqelch says:

July 6, 2021 at 12:24 pm

PDF is terribly overused. It is not convenient at all, well outlined in the first few paragraphs here. I see it as only really useful for transmitting finished works that are not meant to be edited. It really kind of sucks.

Report comment

Reply
1. denis says:
  
  July 6, 2021 at 12:44 pm
  
  This .. but i dont mind it so. its how I choose to send invoices by email, I use linux on all my computers in work so i know the recipient will be able to open it regardless of underlying OS, also its clumbersome to edit so can be somewhat assured what i send is what they get and theres no messing about. (For example I often list problems found on the invoice during works so as to cover my ass from the ” it wasnt like that before you touched it” crowd.) and as im usually asked to do the work by one party on behalf of another its important I show what I did, and what I didnt do, but was aware of.
  
  Report comment
  
  Reply
2. X says:
  
  July 6, 2021 at 2:01 pm
  
  You are really late for this party, the PDF standard was made decades ago, you might as well complain about the weather. If you want to make your own file format to solve your perceived issues, feel free to invent one and get folks to go along with it, good luck with that.
  
  Report comment
  
  Reply
3. Conor Stewart says:
  
  July 6, 2021 at 4:18 pm
  
  I agree it is overused in some situations, for example if you get sent a form you need to sign and email back but it’s a pdf and you can’t edit it or from worksheets or cover sheets at Uni.
  
  PDFs should only be used for final copies of documents to be sent or put on a website like data sheets or read only documents or documents that need password protected.
  
  Report comment
  
  Reply
rpavlik says:

July 6, 2021 at 12:59 pm

PdfArranger is also a useful tool: open source, cross platform, for doing things like the pdftk gui appears to

Report comment

Reply
1. Drone says:
  
  July 7, 2021 at 11:09 pm
  
  PdfArranger is cross platform? Pffft, maybe if you live long enough to cobble together all the tools you’ll need to get it to run on baseline Windows. Mac-OS would likely be easier, it’s BSD at its core and doesn’t blink at swiping stuff from Linux. PDF Arranger is a Python-GTK fork of the venerable PDF Shuffler app which is itself a GUI wrapper for the pikepdf Python PDF library.
  
  Report comment
  
  Reply
Michel says:

July 6, 2021 at 1:48 pm

Some other useful tools I use regularly:
– pdfshuffler (for Debian/Ubuntu/Mint) or pdfslicer (for Arch Linux) to reorder or delete pages and mix documents)
– krop to cut pdf to big to be printed on a single A4 page (useful for maps… just use some glue after :-)
– pdfbook (alias to pdfjam) to save paper and made physical pretty booklets

Report comment

Reply
Neil says:

July 6, 2021 at 1:54 pm

I had to generate PDF invoices. HTML2PDF tools produced sub-optimal output. I found “prawn” surprisingly easy to use and an elegant way to learn the structure of PDF. I have full control over the output. I went from hating PDF to enjoying it.

Only peripherally related, I like eps2svg (from the “geg” package) to take PostScript output such as gEDA schematics and produce SVG files for resolution-independent consumption on the web.

Report comment

Reply
Walla B. Goode says:

July 6, 2021 at 3:16 pm

I prefer EPUB over PDF, since EPUB flows to fit the display device geometry. For conversion, I use Calibre to convert PDF to EPUB. For editing, Sigil edits what looks like a HTML document, and outputs EPUB. I kinda think Calibre lets you create PDFs and other formats from EPUB but I’ve not bothered.

Report comment

Reply
Zelea says:

July 6, 2021 at 5:11 pm

There are also commercial packages that can be used to process PDF in linux and can do much more than the tools listed above.
My favorites are Apago: https://www.apagoinc.com and 3-Heights https://www.pdf-tools.com

Report comment

Reply
shermelin says:

July 6, 2021 at 6:59 pm

I actually use diurnal (http://sourceforge.net/projects/xournal/) for masking or adding stiff (signature, filling forms which are not forms, adding text, drawing…)
It looks like a rewrite appeared : https://xournalpp.github.io/

Report comment

Reply
gareth says:

July 7, 2021 at 4:42 am

The article fails to mention the most useful feature of all, the one thing that actually redeems pdf from being a total steaming pile of crud to just a pile of crud, and that is pdfmarks. Pdfmarks are processed by ghostscript and can be effectively appended to an existing pdf without altering any of the exsiting pdf format and writing over the top or adding to it.

Including really, really useful stuff like adding named destions and the like.

Waring, once you start delving into pdf you will lose substantial portions of your life to get to any real deep understanding, I see one guy has spent 15 years of his life to get to the point he can publish a tool that works with pdf at the most fundemental levels, Adobe have a lot to answer for.

Report comment

Reply
1. danindenver says:
  
  July 10, 2021 at 8:14 pm
  
  can’t parse “destions”
  
  Report comment
  
  Reply
Rex says:

July 7, 2021 at 7:56 am

While we are talking about PDF’s, I sometimes use ‘pdfimages’, part of the poppler-utils package. It will extract all the images from a PDF file, saving them in jpg format.

pdfimages -all “mypdf.pdf” /media/Workspace/Temp/

Report comment

Reply
b1twise says:

July 7, 2021 at 11:49 am

I’d like to mention hexapdf, which has become my favorite tool for pdf manipulation.

Report comment

Reply
Bruce Richardson says:

July 7, 2021 at 11:10 pm

Shout out to Tabula https://tabula.technology/ if you ever need to grab a table of data from a PDF.

Report comment

Reply
Drone says:

July 7, 2021 at 11:12 pm

@Al Williams said: “If you don’t mind using a Web-based tool, PDFEscape is free and works very well.”

Well not for me at 05:35 UTC on 08-July-2021. The link:

https://www.pdfescape.com/

in the article looks OK to me, but the result gets redirected to:

https://www.pdfescape.com/500/?aspxerrorpath=/500/

With the following creepy message:

“The page isn’t redirecting properly. An error occurred during a connection to http://www.pdfescape.com. This problem can sometimes be caused by disabling or refusing to accept cookies.”

So what? I have to accept abusive tracking cookies or something to use PDFEscape? If yes, that makes PDFEscape both NOT free and totally broken. Or maybe the HaD overload effect is temporarily causing problems at PDFEscape? Yawn, I’ll try PDFEscape again tomorrow and see how it goes.

Report comment

Reply
nu2eivuB says:

July 8, 2021 at 1:12 am

Agree. I had a look at PDFEscape. Here from their privacy policy:

“We take your privacy seriously. Please read the following information to learn more about our privacy practices.

This policy covers what information we collect and how it is treated.
This policy only applies to Red Software Websites and does not extend to the practices of other websites that we may link to, companies that we do not own or control, or to people that we do not employ or manage.”

So far so good. And now from their page source:

“if( Cookiebot.consented ? Cookiebot.consent.statistics : Cookiebot.isOutsideEU )
{
(function(i,s,o,g,r,a,m){i[‘GoogleAnalyticsObject’]=r;i[r]=i[r]||function(){ [… the usual Google spyware crap …] } }”

Thanks, folks. I’ll pass.

Report comment

Reply
1. AndIfOr says:
  
  July 9, 2021 at 12:58 pm
  
  Dont block cookies, use ad-blockers. Best is uBlock Origin. It is better to just not load content than break internet by blocking cookies. Else, delete all cookies on browser close, else, use an extension that autodeletes cookies.
  
  Hopefully anyone against GoogleAnalytics also doesnt use their mail, android or search services. Else, anyone again Microsoft, hopefully also doesnt use their mail, platform or search services.
  
  Report comment
  
  Reply
Gloop 81 says:

July 9, 2021 at 2:24 am

This article feels totally bizarre to me, as I have never felt a need to edit a pdf all my life. Aren’t they intended for final documents not to be edited, or forms that people want you to print out because they need a real signature?

Report comment

Reply
Fosselius says:

July 9, 2021 at 11:03 pm

I use https://xournalpp.github.io/ for note taking and annotation of PDFs. its a great tool for us with pen-enabled computers.

Report comment

Reply
xchris says:

July 10, 2021 at 1:33 am

If you need to annotate/redact there is the mupdf – not the classic X11 version the GL one: mupdf-gl , the redaction is so good I totally abandoned the Adobe (under wine) , one notice: its on the version 1.18+

Report comment

Reply
Neo says:

July 11, 2021 at 4:33 pm

The program, “mat2” (available in lots of Linux repositories but also comes with TAILS Linux) is good at “scrubbing” information from files, including PDF.

Report comment

Reply
Fred Gray says:

August 25, 2021 at 11:15 am

You can convert pdf to word with free online converters like https://onlineconverters.org/convert-pdf-to-word.

Report comment

Reply
Lenclume says:

August 26, 2021 at 10:19 am

I’ve come back to consult this article a few times; really a great reference

Report comment

Reply
Supporting Ainain says:

October 24, 2022 at 1:45 am

You can convert pdf to word with free online converters like https://onlineconvertersww.com/convert-pdf-to-word.

Report comment

Reply
bit crack pc says:

October 26, 2022 at 12:58 am

You can download 64 bit crack pc software like https://up4pc.com/internet-download-manager-crack/

Report comment

Reply
John Border says:

February 16, 2023 at 9:29 pm

That was useful..

Report comment

Reply