If you’re at all like me, you document all sorts of things and you too might find it handy to know how to save a web page as text. It’s not a complicated task; you can do it in the terminal easily enough. So, if you want to save a web page as text, read on!
This intro should be rather short. Imagine that!
I don’t have to explain what a web page is. It’s a page (just a page) on a website.
I don’t have to explain what text means. We’ll just be using .txt files.
While this isn’t something I’ve bothered with in a long time, you might find it interesting and helpful. If you’re into keeping notes of things you want to learn more about and remember, you may find saving a web page as text worthwhile.
You can organize the text files however you want and one of the best benefits is that you can perform searches on your local documents easily enough. This might be something that interests you, especially if you’re new and browsing around the web looking for things to learn.
We’ll only be using a couple of tools. We will be using the terminal.
curl:
The first application you’ll need to save a web page as a text file will be the curl application. The curl application is used to transfer a URL. A curl command downloads a file and shows it in your standard output.
If you check the man page, you’ll see:
curl – transfer a URL
See? Exactly as I had said. It’s the correct tool for the job.
You can also see this article about curl:
Let’s Have a Limited Look at Linux’s cURL Application
html2text:
This should be obvious by the title. It should be made further obvious by the title of this article. This is an application that turns HTML (Hypertext Markup Language – what is used on web pages more often than not) into plain text.
If you check the man page, you’ll see:
html2text – an advanced HTML-to-text converter
Once again, a fine application for the task at hand. You’ll see!
Save A Web Page As Text:
As mentioned above, this is a terminal-based operation. We’re going to save a web page as text, but we’re going to do it in the Linux terminal. More often than not, a terminal can be opened by pressing
I’ll give installation instructions for the apt-using distros out there. These packages will be available in your package manager if you’re using any of the major distros. Just adjust these commands to match your needs.
curl:
1 | sudo apt install curl |
html2text:
1 | sudo apt install html2text |
We’re interested only in the -o (output) flag for this application of html2text.
The Process:
The syntax to save a web page as text is simple. It looks like this:
1 | curl <URL> | html2text -o <saved_filename>.txt |
Simply, we’re using the curl application to grab the data, we then send that data through the pipe command where it’s processed by the html2text application.
An example would look like this:
1 | curl https://linux-tips.us | html2text -o linux-tips.txt |
You can, of course, save individual pages as text. Here’s an example:
1 | curl https://linux-tips.us/create-a-new-user/ | html2text -o create_a_new_user.txt |
The terminal output is interesting:
1 2 3 4 | $ curl https://linux-tips.us/create-a-new-user/ | html2text -o create_a_new_user.txt % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 195k 0 195k 0 0 66303 0 --:--:-- 0:00:03 --:--:-- 66319 |
Then, you can use a plain text editor to read (and edit) the text file. You can view it in the terminal with just the cat command. That’d look like this:
1 | cat <saved_filename>.txt |
Though, it’s probably easier to read the saved file with a decent plain text editor that has a GUI. There’s an abundance of text editors available for Linux, so pick your favorite and use that to read the saved output.
Closure:
Well, if you have ever wanted to save a web page as text, you now know how to do that. This was an article that came not from my notes but from my memory. I used to do this with some regularity but I’ve stopped doing so as of late. I haven’t kept so many new notes lately, though I’m not sure why not.
Anyhow, this is a nice and simple exercise that anyone should be able to follow. If you’re using a different package manager it may take a bit more effort, but it’s not complicated. The packages should be available in all the major distros, or something similar. The curl application will certainly be available and might even be installed by default.
Thanks for reading! If you want to help, or if the site has helped you, you can donate, register to help, write an article, or buy inexpensive hosting to start your site. If you scroll down, you can sign up for the newsletter, vote for the article, and comment.