difference between content and OCR of a document? [ANSWERED]

When things don't work as they should.
Post Reply
bwakkie
Posts: 18
Joined: Fri Feb 14, 2020 8:28 pm

difference between content and OCR of a document? [ANSWERED]

Post by bwakkie »

Hi,

I am a bit confused about the difference between content and OCR. In my view if there is a content I do not need to OCR. But having them both is strange to me.

regards,
Bastiaan
Last edited by bwakkie on Thu Oct 01, 2020 8:08 am, edited 1 time in total.
User avatar
oohlaf
Posts: 4
Joined: Tue Jul 07, 2020 9:03 am

Re: difference between content and OCR of a document?

Post by oohlaf »

Content is there when a native parser is able to get textual output from a document without OCR.

For example, my multi functional printer (MFP) is able to OCR on the device when scanning a document. The PDF that the scanner produces contains plain text content next to the scanned images which can be read by Mayan.

However, my MFP has very poor OCR results for certain types of documents and for some typefaces it mixes letters and numbers (like O and 0).

When I upload such a document to Mayan, the content attribute contains the plain text found in the PDF (the OCR result of my MFP) and when I schedule the document for OCR using tesseract the ocr_content attribute will have the (usually higher quality) OCR content.
User avatar
michael
Developer
Developer
Posts: 61
Joined: Sun Apr 19, 2020 6:21 am

Re: difference between content and OCR of a document?

Post by michael »

Thanks oohlaf for the reply.

To paraphrase, the content tab is the text embedded in the document's file, it can also be the actual text in case of an office or text file. The OCR tab is the text that resulted from Mayan's OCR processing.

Even for documents that include text, like an office document, it is a good idea to have the OCR enabled because you can get text from diagrams and illustrations.
bwakkie
Posts: 18
Joined: Fri Feb 14, 2020 8:28 pm

Re: difference between content and OCR of a document?

Post by bwakkie »

Thanks for the explanation. I see the advantage for ocr-ing even when there is content. So when there is no content and there is ocr do I just leave it like that or should a workflow at one point move the ocr data to the content section?
Post Reply