Is there any freeware OCR software (for Linux and/or Windows) that can take a PD
ID: 661000 • Letter: I
Question
Is there any freeware OCR software (for Linux and/or Windows) that can take a PDF scanned document as input and output a Searchable PDF like Adobe Acrobat does?
With searchable PDF I meant that the OCRed text is invisible over the original text and can be selected with the mouse and copied.
I know that gscan2pdf on Linux can do something like this, but the text is placed in the top left corner of the page and is way too small, not at all synchronized with the text on the background scanned page. This because gscan2pdf feeds the whole page to an OCR engine. It should decompose the image in small images with single lines of text or small paragraphs to send to OCR software.
Explanation / Answer
The newer version of Tesseract (3.03 RC at the time of writing this) can do this:
free, opensource and cross-plarform
starting from version 3.03 PDF output is available
CLI software
multiple languages support
unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.
This is the command:
tesseract -l <lang> input.tif output pdf
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.