Command Line Archives • Page 6 of 77 • Linux Tips

Turn PDF Into Text

Today’s exercise is simple, though it will rely on the terminal because we’re just going to turn PDF into text. This isn’t something complicated and it’s fairly effective. It’s also an exercise anyone can follow along with. So, if you want to turn PDF into text, read this article!

I’m sure this works in multiple distros, but I only have instructions for a few in my notes. Most everyone should be able to follow along with this article and turn PDF into text. You’ll see…

While I’m sure everyone is familiar with PDF, I’ll explain…

PDF stands for Portable Document Format and is one of many standards for documents. Specifically, it’s ISO 32000 and is a file format brought to use by Adobe. Adobe is a proprietary product but the standard is open, meaning you have your choice of PDF readers, editors, and creators.

On the other hand, PDF may not be as easily parsed as other file formats. You may also just want to extract some text from a PDF or turn it into something more information-dense, sans pictures and fluffy formatting. There are any number of reasons why you might want to turn PDF to text and it’s a simple operation that’s going to give you ‘acceptable’ results most of the time.

The tool we’ll be using…

pdftotext:

We’ll use a tool known as ‘pdftotext’ which does as its name implies. It’s a tool that lets you turn PDF into text, so from .pdf to .txt is the goal. Like many Linux tools, this is a terminal-based operation.

You can check to see if pdftotext is already installed with this command:

which pdftotext

1	which pdftotext

If the output matches this, you can skip the installation step:

$ which pdftotext
/usr/bin/pdftotext

1 2	$ which pdftotext /usr/bin/pdftotext

If you want, you can check the man page and see that it is indeed the correct tool for the job if your job is to turn PDF into text. That’s this command:

man pdftotext

1	man pdftotext

That command will show you that the description is indeed what we want to accomplish in today’s article. That description is basically:

pdftotext – Portable Document Format (PDF) to text converter

(It may also tell you the version in that section, which is odd but is what it is.)

So, you can see that pdftotext is the correct tool for the job when you want to…

Turn PDF Into Text:

As I mentioned in the intro, if you want to turn PDF into text one of the ways to do so will require using the terminal. There are all sorts of GUI tools you can use to do this very same job, but we’ll do this in the terminal. So, you can usually get away with pressing CTRL + ALT + T to open your default terminal emulator. Otherwise, check your application menu and you’ll find a terminal option in there.

With your terminal open, we first will install a meta package so that we can use pdftotext to turn PDF into text. That application is ‘poppler’. You can pick from the following to match your package manager to install this.

Debian/Ubuntu/etc:

sudo apt install poppler-utils

1	sudo apt install poppler-utils

Arch/Manjaro/etc:

sudo pacman -S poppler

1	sudo pacman -S poppler

RHEL/Fedora/etc:

sudo dnf install poppler

1	sudo dnf install poppler

The poppler package contains pdftotext which is the tool we’re after in our quest to turn PDF into text. It’s a noble quest!

Now, the syntax is quite simple:

pdftotext <file_name>.pdf

1	pdftotext <file_name>.pdf

That will create a <file_name>.txt file in the same directory.

Now, if you checked the man page above, you’d see that there’s not a whole lot to this application. You can largely ignore all the options (and we will), though there aren’t that many.

The two options we are most interested in would be about just converting single pages into text. For that, you want the -f (first page) and -l last page flags. They do exactly what you’d expect and the syntax is as follows:

pdftotext -f <page_number> <-l <page_number> <file_name>.pdf

1	pdftotext -f <page_number> <-l <page_number> <file_name>.pdf

I’ll give you an example…

Let’s say you want to print pages 1 through 3. The syntax would be:

pdftotext -f 1 -l 3 <file_name.pdf>

1	pdftotext -f 1 -l 3 <file_name.pdf>

Sometimes this whole pdftotext thing doesn’t do a great job. If the PDF file is formatted in a fancy manner, it may just not come out in text all that well. Fortunately, PDF is an open standard and you can help it along with the -layout flag.

The -layout flag is described like this:

Maintain (as best as possible) the original physical layout of the text. The default is to ´undo’ physical layout (columns, hyphenation, etc.) and output the text in reading order.

So, that flag will do its best to turn the layout into what it was in the original PDF. This is a handy flag for when the output isn’t usable. It’s possible to retain columns, advanced formatting, and all of that stuff, meaning the text file output is more useful. You won’t always need this option, but it can come in handy. You can safely ignore the remainder of the man page for the vast majority of what folks are going to do with this command.

That’s pretty much all you need to know about the pdftotext application. It does what you think it’d do. It’s the tool you use to turn PDF into text, just like it says on the tin! Pretty handy!

Closure:

So, that’s an article…

If you’ve ever wanted to turn PDF into text, you now know how. You can use this to make a PDF easier to parse, easier to read, etc. It’s up to you how you use pdftotext. You now have the knowledge! You now have the power! Indeed, you have life by the horns. (Which is a rather silly place to grab onto.)

Man, this is a lot of articles… At this point, it’s almost habitual. Technically, I have published something every other day – for a long time. A couple of those articles weren’t really articles. They were placeholders because Mother Nature is a fickle beast and I live in a very remote location. We had a few major (deadly even) storms that took out our infrastructure. I think I can be forgiven for that – and I did upload articles saying that there’d be no article.

The site has come a long way…

I haven’t done a meta article in a while…

Seriously, without you (my readers) I’d have never kept going this long. It’s obviously not a money-making operation, but it is an educational operation. That’s more important than money.

Thanks for reading! If you want to help, or if the site has helped you, you can donate, register to help, write an article, or buy inexpensive hosting to start your site. If you scroll down, you can sign up for the newsletter, vote for the article, and comment.

Hide The Output From wget

This won’t be a very complicated article and will only apply to those who want to hide the output from wget. It’s just a matter of a simple flag so that it won’t be a very long article.

You can download from the terminal. You can transfer files from the terminal. One of the tools for this is wget. There’s also curl, but this article won’t be complicated and will only apply to those who want to hide the output from wget.

This could probably be called a short, but it’s something I wanted to cover.

wget:

You probably won’t need to install wget. It’s one of those tools that you’ll find installed by default. It’s a pretty handy tool. You can verify that wget is an available application with this command:

which wget

1	which wget

The output should match this:

$ which wget
/usr/bin/wget

1 2	$ which wget /usr/bin/wget

If you want to see why I’d cover such a small piece of wget, check the man page with the following command:

man wget

man wget

First, you’ll see the description of wget, which is this:

Wget – The non-interactive network downloader.

Now scroll down…

Keep scrolling…

And keep going…

There’s a whole lot to the wget command. It’s a very complicated command. If you’re a new Linux user, you will be overwhelmed by this man page.

This is the sort of command that you can learn to use bit by bit. You don’t need to learn everything. You almost certainly don’t need everything. That doesn’t mean you can’t use it for useful tasks.

I often use the wget command. I use it not only with my Lubuntu testing but also with my regular activities. I’ll often find the URL for a file and then use wget to download the file. When I do that, it’s because I want to monitor the output.

Other times, I don’t want to monitor the output. So, for that, I use wget in quiet mode. That’s what this article is about.

Hide The Output From wget:

The wget application is an application used in the terminal. I believe there are download managers that are GUIs that use wget in the background. We’ll ignore those and use the terminal. So, press CTRL + ALT + T and let’s learn how to hide the output from wget.

The command you’re after is just the wget command with the -q flag. It would look something like this:

wget -q <URL>

1	wget -q <URL>

The thing is, this now means that you no longer see the progress. You can tell wget to keep trying until it performs as expected. That’s the ‘complete’ flag ( -c) and looks like this:

wget -qc <URL>

1	wget -qc <URL>

You can try this on your own with this command:

wget -qc https://linux-tips.us/files/sort.txt

1	wget -qc https://linux-tips.us/files/sort.txt

That’s a pretty small file, so it won’t take a lot of time.

You won’t see any messages in your terminal, it will just download the file.

You can test this by running ls in your terminal after the fact. You’ll happily see that you’ve downloaded a file called ‘sort.txt’ and that it kept trying until it was completed.

So, now you know how to hide the output from wget…

Closure:

So, yeah, this probably could have been labeled a ‘short’ article, but I didn’t do so. I try to use that title for things that aren’t as involved, just a simple command in other words. This is pretty simple, but it’s also something you might use regularly.

The wget command is this hulking command with a bunch of options. Not even I fully understand all of the options and I’ve been using the application for years. There’s just a lot to it and that’s far more than we’ll ever cover and far more than most of you will ever use. Still, it can be a pretty handy command and you’ll see more of it in the future.

Save A Web Page As Text

If you’re at all like me, you document all sorts of things and you too might find it handy to know how to save a web page as text. It’s not a complicated task; you can do it in the terminal easily enough. So, if you want to save a web page as text, read on!

This intro should be rather short. Imagine that!

I don’t have to explain what a web page is. It’s a page (just a page) on a website.

I don’t have to explain what text means. We’ll just be using .txt files.

While this isn’t something I’ve bothered with in a long time, you might find it interesting and helpful. If you’re into keeping notes of things you want to learn more about and remember, you may find saving a web page as text worthwhile.

You can organize the text files however you want and one of the best benefits is that you can perform searches on your local documents easily enough. This might be something that interests you, especially if you’re new and browsing around the web looking for things to learn.

We’ll only be using a couple of tools. We will be using the terminal.

curl:

The first application you’ll need to save a web page as a text file will be the curl application. The curl application is used to transfer a URL. A curl command downloads a file and shows it in your standard output.

If you check the man page, you’ll see:

curl – transfer a URL

See? Exactly as I had said. It’s the correct tool for the job.

You can also see this article about curl:

Let’s Have a Limited Look at Linux’s cURL Application

html2text:

This should be obvious by the title. It should be made further obvious by the title of this article. This is an application that turns HTML (Hypertext Markup Language – what is used on web pages more often than not) into plain text.

If you check the man page, you’ll see:

html2text – an advanced HTML-to-text converter

Once again, a fine application for the task at hand. You’ll see!

Save A Web Page As Text:

As mentioned above, this is a terminal-based operation. We’re going to save a web page as text, but we’re going to do it in the Linux terminal. More often than not, a terminal can be opened by pressing CTRL + ALT + T on your keyboard.

I’ll give installation instructions for the apt-using distros out there. These packages will be available in your package manager if you’re using any of the major distros. Just adjust these commands to match your needs.

curl:

sudo apt install curl

1	sudo apt install curl

html2text:

sudo apt install html2text

1	sudo apt install html2text

We’re interested only in the -o (output) flag for this application of html2text.

The Process:

The syntax to save a web page as text is simple. It looks like this:

curl <URL> | html2text -o <saved_filename>.txt

1	curl <URL> \| html2text -o <saved_filename>.txt

Simply, we’re using the curl application to grab the data, we then send that data through the pipe command where it’s processed by the html2text application.

An example would look like this:

curl https://linux-tips.us | html2text -o linux-tips.txt

1	curl https://linux-tips.us \| html2text -o linux-tips.txt

You can, of course, save individual pages as text. Here’s an example:

curl https://linux-tips.us/create-a-new-user/ | html2text -o create_a_new_user.txt

1	curl https://linux-tips.us/create-a-new-user/ \| html2text -o create_a_new_user.txt

The terminal output is interesting:

$ curl https://linux-tips.us/create-a-new-user/ | html2text -o create_a_new_user.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  195k    0  195k    0     0  66303      0 --:--:--  0:00:03 --:--:-- 66319

$ curl https://linux-tips.us/create-a-new-user/ | html2text -o create_a_new_user.txt

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 195k 0 195k 0 0 66303 0 --:--:-- 0:00:03 --:--:-- 66319

Then, you can use a plain text editor to read (and edit) the text file. You can view it in the terminal with just the cat command. That’d look like this:

cat <saved_filename>.txt

1	cat <saved_filename>.txt

Though, it’s probably easier to read the saved file with a decent plain text editor that has a GUI. There’s an abundance of text editors available for Linux, so pick your favorite and use that to read the saved output.

Closure:

Well, if you have ever wanted to save a web page as text, you now know how to do that. This was an article that came not from my notes but from my memory. I used to do this with some regularity but I’ve stopped doing so as of late. I haven’t kept so many new notes lately, though I’m not sure why not.

Anyhow, this is a nice and simple exercise that anyone should be able to follow. If you’re using a different package manager it may take a bit more effort, but it’s not complicated. The packages should be available in all the major distros, or something similar. The curl application will certainly be available and might even be installed by default.

Mastering the Linux Terminal Pipe Command

Well, if it’s not obvious by the title, it soon will be obvious that I’ve once again leaned on AI to write an article, this time about the pipe command. I decided to stick (mostly) to the title AI gave this article, but it was longer than it should be.

AI tried to title this:

“Mastering the Linux Terminal Pipe Command: A Comprehensive Guide”

Anyhow, this is one of those articles that I just can’t write. No matter what I write, it will not be adequate – even though the pipe is a simple enough concept. Much like a recent grep article, this is just one of those articles I won’t write well.

Also, I’m not sure that I should call it a command. It’s more an operator than a command, but the references I see refer to it as a command more frequently than as an operator. Perhaps the word would be ‘operand’? But, for convenience and convention’s sake, I will call it the pipe command.

No, this isn’t something you install. This is a command that you use with other commands. It’s a lot like the operators I’ve already written about. If you’re unfamiliar with the concept, read this article:

How To: Write Text To A File From The Terminal with “>” and “>>”

The short of it is that the pipe takes the output from one command and inserts it into another command. This lets you take the output from one command and parse it with another command. That’s all there is to it – which is why I’m unable to write this article.

Which is why I leaned on my good buddy AI for this article…

Mastering the Linux Terminal Pipe Command:

In the world of Linux, the terminal pipe command stands as a quintessential tool, offering a powerful and flexible means to manipulate data streams. Understanding and mastering the pipe command can significantly enhance your efficiency and productivity in the Linux environment. In this comprehensive guide, we delve into the intricacies of the pipe command, exploring its functionalities, use cases, and advanced techniques.

What is the Pipe Command?

At its core, the pipe command, represented by the symbol |, allows you to redirect the output of one command as input to another command. This seamless connection between commands enables the creation of complex data processing pipelines, facilitating the manipulation and transformation of data with remarkable ease.

Basic Usage:

The basic syntax of the pipe command is straightforward:

command1 | command2

1	command1 \| command2

Here, the output generated by command1 is passed as input to command2. This chaining of commands enables the execution of multiple operations in a single line, streamlining workflows and reducing the need for intermediate files.

Practical Examples:

Let’s explore some practical examples to illustrate the utility of the pipe command:

Counting Words in a File:

cat file.txt | wc -w

1	cat file.txt \| wc -w

This command displays the number of words in the file file.txt. The cat command outputs the contents of the file, which are then piped to wc -w, which counts the words.

Searching for a Pattern:

grep "pattern" file.txt | wc -l

1	grep "pattern" file.txt \| wc -l

Here, grep is used to search for the specified pattern in the file file.txt. The output, which consists of lines containing the pattern, is then piped to wc -l, which counts the number of matching lines.

Sorting Data:

sort file.txt | uniq

1	sort file.txt \| uniq

This command sorts the lines in the file file.txt alphabetically and removes duplicate lines using the uniq command.

Advanced Techniques:

While the basic usage of the pipe command is invaluable, mastering advanced techniques can unlock its full potential:

Chaining Multiple Commands:

command1 | command2 | command3 | ... | commandN

1	command1 \| command2 \| command3 \| ... \| commandN

You can chain multiple commands together to create sophisticated data processing pipelines. Each command in the pipeline operates on the output of the preceding command, enabling complex transformations with minimal effort.

Combining Filters:

command1 | filter1 | filter2 | ... | filterN

1	command1 \| filter1 \| filter2 \| ... \| filterN

Filters such as grep, sed, and awk can be combined to perform intricate text-processing tasks. By leveraging the unique capabilities of each filter, you can manipulate data in a myriad of ways, ranging from pattern matching to text substitution.

Redirecting Output:

In addition to chaining commands, you can redirect the output of a pipeline to a file using the > operator. This allows you to capture the results of your data processing pipeline for future reference or analysis.

Common Pitfalls:

While the pipe command offers immense flexibility, it’s essential to be mindful of potential pitfalls:

Order of Operations:
The order in which commands are chained together matters. Ensure that you sequence commands appropriately to achieve the desired outcome. A misplaced command in the pipeline can lead to unexpected results or errors.
Input Compatibility:
Ensure that the output of one command is compatible with the input requirements of the next command in the pipeline. Mismatched data formats or incompatible options can result in errors or unintended behavior.
Performance Considerations:
Complex pipelines involving multiple commands may incur performance overhead. Be mindful of resource utilization, especially when working with large datasets or computationally intensive operations.

Addendum:

The Linux terminal pipe command is a versatile tool that empowers users to streamline data processing tasks effectively. By mastering its usage and exploring advanced techniques, you can unlock the full potential of the Linux command-line environment. Whether you’re a novice user or a seasoned Linux enthusiast, the pipe command remains an indispensable asset in your toolkit, enabling you to easily conquer complex data manipulation challenges.

Closure:

So, that was AI doing my job for me. Like grep, I tried to write an article about pipe, which was a hot mess. I can usually salvage articles and publish something of my work, but I just didn’t do well with a couple of subjects. This is one of them.

The thing is, I refer to the pipe command with some regularity. I don’t have an article about pipes, so I can’t link to that. This leaves the reader with a search engine and I’d rather they have an excuse to open an additional link. It’s not just good SEO, it’s good hospitality. I’ll never explain everything, but I can explain some things and people won’t need to leave the site to learn those things.

Also, even AI had issues with this article. I told it to write 1200 words and it came up with maybe 600 words. I applaud those who can turn the pipe command into more than a blurb with a few examples that help people grasp the concept. Seriously, hats off to them. I don’t write nearly as well as my volume of articles would imply.

I don’t think I’ll need to use AI for any near-future articles. I’m doing two of them fairly close together because they’re things I feel need to be done. They are articles that need to be written. It is information that needs to be on the site. I did separate the two AI-written articles by some time, just to give folks a break between them. I know, they’re not preferred and they surely don’t match my writing style.

Thanks for indulging me, if nothing else. Amusingly, this isn’t much of a time-saver. The way ChatGPT formats stuff is not compatible with the editor used by my instance of WordPress. I spend a lot of time just formatting things.

Speaking of time invested…

Avoid Storing Commonly Used Commands In Your Bash History

This article won’t need to be all that long but it might be complicated as we discuss how to avoid storing commonly used commands in your Bash history. Yes, it’s a long title.

This is also a bit contrary. It is one of those things that is easier done than said. It’s a very wordy thing, after all. I’ll do my best to describe what’s going on and why you might want to do this.

In this case, Bash stands for Bourne Again Shell. This article only applies to those who are using Bash. Bash is not the only shell available and people may opt to use other shells. If you’re one of those people, I don’t think this is going to work for you.

When you’re using the terminal, you’re using Bash. The commands you enter into the terminal are stored in ~/.bash_history, a hidden file in your home directory. We’ve discussed some of this before.

How To: Have Infinite Bash History
Playing With Your Bash History
How To: Not Save A Command To Bash History
How To: Reload Your .bash_profile

Well, you may type common commands, such as uptime. You may not want to store that command in your Bash history. Do you want to store every time you’ve typed the ls command?

You don’t have to. You have options!

What can you do? Well, you can tell Bash not to store certain commands in the ~/.bash_history file. This is actually a simple operation. To avoid storing commonly used commands in your Bash history, you need only to edit your ~/.bashrc file. I’ll show you how!

Man, this is going to impact the layout…

How To Avoid Storing Commonly Used Commands In Your Bash History:

Yeah, no amount of formatting is going to make that look good.

Anyhow, if this isn’t obvious, you’re going to learn how to do this in the terminal. You could edit your ~/.bashrc file with your favorite GUI editor but we’ll be doing this entire thing in the terminal.

As such, you should have an open terminal. More often than not, you can open your terminal by pressing CTRL + ALT + T. If that doesn’t work, you can find a shortcut to your terminal in your application menu. Should that not work, you’re probably already in the terminal!

So, first, we need to use Nano to edit the ~/.bashrc file. That’s an easy command:

nano ~/.bashrc

1	nano ~/.bashrc

Use your arrow button to navigate to the bottom of that file. Go to the absolute bottom and press enter to start a new line. You can press that button twice to provide some separation and to make it easier to read.

Now, let’s say we don’t want to store the ls, uptime, or touch commands in your Bash history file. We’ll use those as our examples. You should also probably leave a comment in your ~/.bashrc file so that you can easily identify what the code does and remember why you added it. That’s also useful if there are other users.

So, add the following lines:

# ignore commonly used terminal commands
export HISTIGNORE=":ls:uptime:touch:"

1 2	# ignore commonly used terminal commands export HISTIGNORE=":ls:uptime:touch:"

Next, save that file. As we’re using Nano, you save the file by pressing CTRL + X, then Y, and then ENTER on your keyboard.

Next, you reload your ~/.bashrc file much like you reloaded your Bash profile (which was a link in the intro, should you wish to read it). You reload the ~/.bashrc file with this command:

source ~/.bashrc

1	source ~/.bashrc

That should reload the file. If it doesn’t, you can close all your terminal instances and open a new one. If that doesn’t work, you can log out and log back in again.

Anyhow…

Commands starting with :<command>: entries you used will not be stored in the ~/.bash_history file. If you type a command starting with those entries, it will be ignored, meaning they won’t clutter up your ~/.bash_history file with commands you’re already familiar with or commands that don’t need to be stored for things like auditing or security reasons.

It’s pretty simple to do, though it’s a bit of a pain in the butt to explain. This is how you avoid storing commonly used commands in your Bash history – something nobody is going to search for. (If you did find your way here via a search engine, be sure to leave a comment. I want to know who you are!)

Closure:

I realize that this is an awkward article and I’m okay with that. This isn’t something everyone is going to bother with, especially those people who don’t do much in the terminal. Still, it’s possible to avoid storing commonly used commands in your Bash history and now you know how.

Then, someday, someone’s going to search for this exact string of characters and, hopefully, they’ll find this article. I hope this satisfies their curiosity and helps them reach their Linux goals! If you did read this and find it valuable, you can always leave a comment.