New document properties for OCR and parsing status

Requests for new functionality or improvements in existing functionality. Please provide clear descriptions of your request, an example or if possible a real life scenario.
Post Reply
Posts: 21
Joined: Tue Aug 21, 2018 2:32 pm

New document properties for OCR and parsing status

Post by daniel1113 » Thu Oct 04, 2018 2:14 pm

I am struggling with identifying documents that have been parsed and OCRd. As far as I can tell, there is no way to quickly search for, index, or distinguish documents based on whether or not they have been OCRd or parsed. To get around this, I created two simple workflows. One tracks OCR status, the other parsing status. Both rely on automatic transition triggers when a document version is queued for OCRing or parsing, and when it is complete.

image.png (18.5 KiB) Viewed 508 times

This works okay, but I don't have a lot of confidence that it will always be correct. I can foresee a situation where, for whatever reason, a workflow transition doesn't get triggered, so the workflow status will not accurately reflect whether a document has been OCRd or parsed.

Would it be possible to expose a document property for the OCR and parsing status? I'm thinking simple booleans like "has_ocr" and "has_content." These properties would be set as TRUE if all of the pages in a given document have been OCRd or parsed, and otherwise, FALSE.

It would also be very useful if there were visual identifiers in document lists showing which documents have OCR text or parsed content. Small icons or tags that show for those specific documents only.

User avatar
Posts: 284
Joined: Tue Aug 21, 2018 3:28 am

Re: New document properties for OCR and parsing status

Post by rosarior » Thu Oct 04, 2018 8:21 pm

Thanks for the feedback.

We added a custom signal and then an event that shows a document has finished doing OCR. The workflow you described should work 100% of the time since workflow triggering is guaranteed to occur. The workflow trigger binding is done at the event system level and events are not dropped. In some cases they could actually be duplicated like when there are concurrency issues, but they never get dropped. We encapsulate the event code in database transaction so that they happen if the event they described actually happens too.

The problem with OCR indicators is knowing when to declare a document has OCR or not. If we use the logic that all pages must have text, then if one page is blank and there doesn't have text, then the indicator will say there is no OCR. We tried something similar in the past and it cause confusion so we removed it and started making to the switch to an events instead of trying to analyze text content. OCR indicators are a good idea we just haven't found a reliable logic to implement it.

We added support to indexing based on OCR and parsing signals about two day ago :) ... abf2169d44 ... 7ef9786d45

These changes are planned for version 3.2 but I'll see if we can push them for the next 3.1 (3.1.4) bug fix release. This way at least you can list which documents have finished doing text capture while we find a way to implement text indicators.

Post Reply