Optical character recognition for PDF files.

Pypdfocr is very nice.   The input is a PDF file, for example the output of scanner.  The output is another PDF, which looks like the original but now has the words recognized in it.  That lets you can search it, and if you index all your documents then that’s very useful.  Spotlight on my Mac sees into these.

You can extract the raw text using pdftotxt, which is nice for reading on the train.

I was delighted the it understands columns pretty well.  It is not so good at paragraph breaks though.

I gather that a some of people use this to scan all the paper, receipts, et. al. that comes into their home.  It has some clever switches to help with that usecase.

It is a bit of a pain to install, lots of homebrew packages and pip packages are required; and then – at least in my case – it works but it complains that I didn’t get it right.  There are pages that talk about these things; but I’m happy enough now.

Leave a Reply

Your email address will not be published. Required fields are marked *