Networking

Save A Web Page As Text

If you’re at all like me, you document all sorts of things and you too might find it handy to know how to save a web page as text. It’s not a complicated task; you can do it in the terminal easily enough. So, if you want to save a web page as text, read on! 

This intro should be rather short. Imagine that!

I don’t have to explain what a web page is. It’s a page (just a page) on a website.

I don’t have to explain what text means. We’ll just be using .txt files.

While this isn’t something I’ve bothered with in a long time, you might find it interesting and helpful. If you’re into keeping notes of things you want to learn more about and remember, you may find saving a web page as text worthwhile.

You can organize the text files however you want and one of the best benefits is that you can perform searches on your local documents easily enough. This might be something that interests you, especially if you’re new and browsing around the web looking for things to learn.

We’ll only be using a couple of tools. We will be using the terminal.

curl:

The first application you’ll need to save a web page as a text file will be the curl application. The curl application is used to transfer a URL. A curl command downloads a file and shows it in your standard output.

If you check the man page, you’ll see:

curl – transfer a URL

See? Exactly as I had said. It’s the correct tool for the job. 

You can also see this article about curl:

Let’s Have a Limited Look at Linux’s cURL Application

html2text:

This should be obvious by the title. It should be made further obvious by the title of this article. This is an application that turns HTML (Hypertext Markup Language – what is used on web pages more often than not) into plain text.

If you check the man page, you’ll see:

html2text – an advanced HTML-to-text converter

Once again, a fine application for the task at hand. You’ll see!

Save A Web Page As Text:

As mentioned above, this is a terminal-based operation. We’re going to save a web page as text, but we’re going to do it in the Linux terminal. More often than not, a terminal can be opened by pressing CTRL + ALT + T on your keyboard.

I’ll give installation instructions for the apt-using distros out there. These packages will be available in your package manager if you’re using any of the major distros. Just adjust these commands to match your needs.

curl:
sudo apt install curl
html2text:
sudo apt install html2text

We’re interested only in the -o (output) flag for this application of html2text.

The Process:

The syntax to save a web page as text is simple. It looks like this:

curl <URL> | html2text -o <saved_filename>.txt

Simply, we’re using the curl application to grab the data, we then send that data through the pipe command where it’s processed by the html2text application.

An example would look like this:

curl https://linux-tips.us | html2text -o linux-tips.txt

You can, of course, save individual pages as text. Here’s an example:

curl https://linux-tips.us/create-a-new-user/ | html2text -o create_a_new_user.txt

The terminal output is interesting:

$ curl https://linux-tips.us/create-a-new-user/ | html2text -o create_a_new_user.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  195k    0  195k    0     0  66303      0 --:--:--  0:00:03 --:--:-- 66319

Then, you can use a plain text editor to read (and edit) the text file. You can view it in the terminal with just the cat command. That’d look like this:

cat <saved_filename>.txt

Though, it’s probably easier to read the saved file with a decent plain text editor that has a GUI. There’s an abundance of text editors available for Linux, so pick your favorite and use that to read the saved output.

Closure:

Well, if you have ever wanted to save a web page as text, you now know how to do that. This was an article that came not from my notes but from my memory. I used to do this with some regularity but I’ve stopped doing so as of late. I haven’t kept so many new notes lately, though I’m not sure why not.

Anyhow, this is a nice and simple exercise that anyone should be able to follow. If you’re using a different package manager it may take a bit more effort, but it’s not complicated. The packages should be available in all the major distros, or something similar. The curl application will certainly be available and might even be installed by default.

Thanks for reading! If you want to help, or if the site has helped you, you can donate, register to help, write an article, or buy inexpensive hosting to start your site. If you scroll down, you can sign up for the newsletter, vote for the article, and comment.

KGIII

Retired mathematician, residing in the mountains of Maine. I may be old and wise, but I am not infallible. Please point out any errors. And, as always, thanks again for reading.

Recent Posts

Enable/Disable Your Network Interface

Today we'll cover one way to enable or disable your network interface in the Linux…

5 months ago

Check Your NIC Speed In The Terminal

Today's exercise is a nice and simple exercise where we check your NIC speed in…

5 months ago

Easily Monitor Your Wireless Connection

Have you ever wanted to easily monitor your wireless connection? Well, now you can learn…

6 months ago

Count The Files In A Directory

I think I've covered this before with the ls command but this time we'll count…

6 months ago

Get System Information With The ‘uname’ Command In Linux

Today we'll be learning about a basic Linux command that's known as 'uname' and it…

6 months ago

hardinfo Has Been Rebooted As hardinfo2

If you've used hardinfo in the past, it may interest you to know that hardinfo…

6 months ago