Turn PDF Into Text

Today’s exercise is simple, though it will rely on the terminal because we’re just going to turn PDF into text. This isn’t something complicated and it’s fairly effective. It’s also an exercise anyone can follow along with. So, if you want to turn PDF into text, read this article!

I’m sure this works in multiple distros, but I only have instructions for a few in my notes. Most everyone should be able to follow along with this article and turn PDF into text. You’ll see…

While I’m sure everyone is familiar with PDF, I’ll explain…

PDF stands for Portable Document Format and is one of many standards for documents. Specifically, it’s ISO 32000 and is a file format brought to use by Adobe. Adobe is a proprietary product but the standard is open, meaning you have your choice of PDF readers, editors, and creators.

On the other hand, PDF may not be as easily parsed as other file formats. You may also just want to extract some text from a PDF or turn it into something more information-dense, sans pictures and fluffy formatting. There are any number of reasons why you might want to turn PDF to text and it’s a simple operation that’s going to give you ‘acceptable’ results most of the time.

The tool we’ll be using…

pdftotext:

We’ll use a tool known as ‘pdftotext’ which does as its name implies. It’s a tool that lets you turn PDF into text, so from .pdf to .txt is the goal. Like many Linux tools, this is a terminal-based operation.

You can check to see if pdftotext is already installed with this command:

If the output matches this, you can skip the installation step:

If you want, you can check the man page and see that it is indeed the correct tool for the job if your job is to turn PDF into text. That’s this command:

That command will show you that the description is indeed what we want to accomplish in today’s article. That description is basically:

pdftotext – Portable Document Format (PDF) to text converter

(It may also tell you the version in that section, which is odd but is what it is.)

So, you can see that pdftotext is the correct tool for the job when you want to…

Turn PDF Into Text:

As I mentioned in the intro, if you want to turn PDF into text one of the ways to do so will require using the terminal. There are all sorts of GUI tools you can use to do this very same job, but we’ll do this in the terminal. So, you can usually get away with pressing CTRL + ALT + T to open your default terminal emulator. Otherwise, check your application menu and you’ll find a terminal option in there.

With your terminal open, we first will install a meta package so that we can use pdftotext to turn PDF into text. That application is ‘poppler’. You can pick from the following to match your package manager to install this.

Debian/Ubuntu/etc:

Arch/Manjaro/etc:

RHEL/Fedora/etc:

The poppler package contains pdftotext which is the tool we’re after in our quest to turn PDF into text. It’s a noble quest!

Now, the syntax is quite simple:

That will create a <file_name>.txt file in the same directory.

Now, if you checked the man page above, you’d see that there’s not a whole lot to this application. You can largely ignore all the options (and we will), though there aren’t that many.

The two options we are most interested in would be about just converting single pages into text. For that, you want the -f (first page) and -l last page flags. They do exactly what you’d expect and the syntax is as follows:

I’ll give you an example…

Let’s say you want to print pages 1 through 3. The syntax would be:

Sometimes this whole pdftotext thing doesn’t do a great job. If the PDF file is formatted in a fancy manner, it may just not come out in text all that well. Fortunately, PDF is an open standard and you can help it along with the -layout flag. 

The -layout flag is described like this:

Maintain (as best as possible) the original physical layout of the text. The default is to ´undo’ physical layout (columns, hyphenation, etc.) and output the text in reading order.

So, that flag will do its best to turn the layout into what it was in the original PDF. This is a handy flag for when the output isn’t usable. It’s possible to retain columns, advanced formatting, and all of that stuff, meaning the text file output is more useful. You won’t always need this option, but it can come in handy. You can safely ignore the remainder of the man page for the vast majority of what folks are going to do with this command.

That’s pretty much all you need to know about the pdftotext application. It does what you think it’d do. It’s the tool you use to turn PDF into text, just like it says on the tin! Pretty handy!

Closure:

So, that’s an article… 

If you’ve ever wanted to turn PDF into text, you now know how. You can use this to make a PDF easier to parse, easier to read, etc. It’s up to you how you use pdftotext. You now have the knowledge! You now have the power! Indeed, you have life by the horns. (Which is a rather silly place to grab onto.)

Man, this is a lot of articles… At this point, it’s almost habitual. Technically, I have published something every other day – for a long time. A couple of those articles weren’t really articles. They were placeholders because Mother Nature is a fickle beast and I live in a very remote location. We had a few major (deadly even) storms that took out our infrastructure. I think I can be forgiven for that – and I did upload articles saying that there’d be no article. 

The site has come a long way…

I haven’t done a meta article in a while…

Seriously, without you (my readers) I’d have never kept going this long. It’s obviously not a money-making operation, but it is an educational operation. That’s more important than money.

Thanks for reading! If you want to help, or if the site has helped you, you can donate, register to help, write an article, or buy inexpensive hosting to start your site. If you scroll down, you can sign up for the newsletter, vote for the article, and comment.

Linux Tips
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.