Software

Extract Text From Multiple File Types

Today we will have a fairly simple exercise as we’re going to just use a Python application to extract text from multiple file types. This is a pretty standard operation but will require some preparation.

Fortunately, I’m ahead of the game! You’re good to go if you follow along on the site and have already enabled PIP. Otherwise…

You will need to install PIP for this article. This is not complicated.

First, read this article:

Install Python’s PIP Part One

Technically, you could just do that. However, you should add the path so that you don’t have to specify the location of your Python applications and can easily use them from the terminal.

So, read this article:

Install Python’s PIP Part Two

Now that you’ve done those two things, you’re good to proceed. See? It was worth the time to write those articles! They’re useful and save a lot of time.

The tool we’re going to use is known as “Textract“. Don’t quote me on this, but I believe this could also apply to Windows users, though installing the dependencies for this would be a different process. I’m not a Windows user. If you are, feel free to comment and let us know how things work on your side of life.

Textract:

While there is no built-in man page, the Textract application is described like this:

While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.

It is a pretty handy application and claims to extract the text from more file types than I could reasonably expect to test. Here’s a list of files that you should be able to extract text from.

.csv via python builtins
.doc via antiword
.docx via python-docx2txt
.eml via python builtins
.epub via ebooklib
.gif via tesseract-ocr
.jpg and .jpeg via tesseract-ocr
.json via python builtins
.html and .htm via beautifulsoup4
.mp3 via sox, SpeechRecognition, and pocketsphinx
.msg via msg-extractor
.odt via python builtins
.ogg via sox, SpeechRecognition, and pocketsphinx
.pdf via pdftotext (default) or pdfminer.six
.png via tesseract-ocr
.pptx via python-pptx
.ps via ps2text
.rtf via unrtf
.tiff and .tif via tesseract-ocr
.txt via python builtins
.wav via SpeechRecognition and pocketsphinx
.xlsx via xlrd
.xls via xlrd

You may need to install specific packages for some of these file formats. Those packages can usually be found in your default repositories. It otherwise comes with quite a lot of functionality out of the box.

I did test some of those formats and it seemed to work okay. Your mileage may vary, of course. However, Textract was able to extract text from multiple file types.

Extract Text From Multiple File Types:

If you want to extract text from multiple file types with Textract (a fantastic name for an application) then you’ll first need to install it. I’ve yet to find a working GUI PIP installation tool, so that means you’re going to need an open terminal.

More often than not, you can open your terminal by simply pressing CTRL + ALT + T on your keyboard. If your distro doesn’t adhere to the norms, you can find a terminal in your application menu. If you don’t use an application menu, you already know how to open a terminal and you don’t need any help from me.

First, let’s install Textract:

pip3 install textract

Note the lack of sudo. You’re installing this for your user account and do not need elevated permissions for this. Python packages go right into your ~/ directory. See below, as you’ll want to install some dependencies for full functionality.

You may see an error or two during installation but that doesn’t seem to matter. It will take a minute to install and watching the installation chug along is good fun.

Using Textract:

With Textract installed, you can now extract text from a whole variety of file types. The syntax is as follows:

textract /path/to/file

That sends the output to the standard output (your terminal). I suspect that most folks are going to want to save the output to a file. For that, you just need to add the -o flag and a file name. So, something like this:

textract /path/to/file -o <new_file_name>.txt

That’s going to extract the text from some file types but not all of them.

Now, this is from a Lubuntu installation…

This isn’t going to work with all the listed file types at this time. You need some dependencies to be installed. For me, and it’s a long one, the command was:

sudo apt install python2-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev

That’s slightly different from the command they include on their page, but it appears to do the trick. You’ll have some of those installed by default but running the command will sort itself out. You’ll have to modify the command to suit your distro, but that should work with Debian, Ubuntu, Linux Mint, and other Debian-based distros.

With that installed, I can even grab the text from image files.

Here’s an example:

This is some simple text to test how well Textract really works.

Here’s the command:

textract sample_text.png

Here’s the output:

$ textract simple_text.png
This is some simple text.
Let's see Textract in operation!

I dare say that’s pretty good. I tried other pictures and it was good enough to get the gist of things. Complicated image files with many columns appear to be a bit of a stumbling block. But it’s not terrible.

It has no trouble at all with other file formats.

It can be a bit fussy to get Textract properly installed but it seems to do the trick once installed. If you want to extract text from multiple file types, Textract is a pretty good piece of software.

Closure:

If you want to extract text from multiple file types, this is definitely a good tool for the job. It certainly handles a lot of files and does a good job with them. It’s not perfect. None of these tools are. Complicated image files threw it off a bit, but Textract lives up to its name.

There was a reason I wrote those articles about PIP. Being able to install Python packages via a repository is a great thing. There’s some great Python software out there and we’ve barely touched the surface. Linux is great like that, that is offering great Python support.

Do you have a use for this in your daily activities? If so, leave a comment letting us know how you use Textract and what makes you pick it over other applications. You can even use a real email address. I never send spam. I never sell your information.

Thanks for reading! If you want to help, or if the site has helped you, you can donate, register to help, write an article, or buy inexpensive hosting to start your site. If you scroll down, you can sign up for the newsletter, vote for the article, and comment.

KGIII

Retired mathematician, residing in the mountains of Maine. I may be old and wise, but I am not infallible. Please point out any errors. And, as always, thanks again for reading.