New document properties for OCR and parsing status

Requests for new functionality or improvements in existing functionality. Please provide clear descriptions of your request, an example or if possible a real life scenario.
Post Reply
daniel1113
Posts: 24
Joined: Tue Aug 21, 2018 2:32 pm

New document properties for OCR and parsing status

Post by daniel1113 » Thu Oct 04, 2018 2:14 pm

I am struggling with identifying documents that have been parsed and OCRd. As far as I can tell, there is no way to quickly search for, index, or distinguish documents based on whether or not they have been OCRd or parsed. To get around this, I created two simple workflows. One tracks OCR status, the other parsing status. Both rely on automatic transition triggers when a document version is queued for OCRing or parsing, and when it is complete.

image.png
image.png (18.5 KiB) Viewed 947 times

This works okay, but I don't have a lot of confidence that it will always be correct. I can foresee a situation where, for whatever reason, a workflow transition doesn't get triggered, so the workflow status will not accurately reflect whether a document has been OCRd or parsed.

Would it be possible to expose a document property for the OCR and parsing status? I'm thinking simple booleans like "has_ocr" and "has_content." These properties would be set as TRUE if all of the pages in a given document have been OCRd or parsed, and otherwise, FALSE.

It would also be very useful if there were visual identifiers in document lists showing which documents have OCR text or parsed content. Small icons or tags that show for those specific documents only.

User avatar
rosarior
Posts: 440
Joined: Tue Aug 21, 2018 3:28 am

Re: New document properties for OCR and parsing status

Post by rosarior » Thu Oct 04, 2018 8:21 pm

Thanks for the feedback.

We added a custom signal and then an event that shows a document has finished doing OCR. The workflow you described should work 100% of the time since workflow triggering is guaranteed to occur. The workflow trigger binding is done at the event system level and events are not dropped. In some cases they could actually be duplicated like when there are concurrency issues, but they never get dropped. We encapsulate the event code in database transaction so that they happen if the event they described actually happens too.

The problem with OCR indicators is knowing when to declare a document has OCR or not. If we use the logic that all pages must have text, then if one page is blank and there doesn't have text, then the indicator will say there is no OCR. We tried something similar in the past and it cause confusion so we removed it and started making to the switch to an events instead of trying to analyze text content. OCR indicators are a good idea we just haven't found a reliable logic to implement it.

We added support to indexing based on OCR and parsing signals about two day ago :)
https://gitlab.com/mayan-edms/mayan-edm ... abf2169d44
https://gitlab.com/mayan-edms/mayan-edm ... 7ef9786d45

These changes are planned for version 3.2 but I'll see if we can push them for the next 3.1 (3.1.4) bug fix release. This way at least you can list which documents have finished doing text capture while we find a way to implement text indicators.

daniel1113
Posts: 24
Joined: Tue Aug 21, 2018 2:32 pm

Re: New document properties for OCR and parsing status

Post by daniel1113 » Thu Sep 05, 2019 5:40 pm

Roberto:

On a somewhat related topic, could you add the option to remove parsing data or OCR data from a document? This can be done from the Django admin panel, but it would be nice to do it from within Mayan. I'm imagining two options in the document Actions dropdown menu like "Remove OCR" and "Remove parsed content."

And assuming that is possible, could you add transition triggers for when parsing data or OCR data is completely removed from a document using either of these menu options? That is, when the "Remove OCR" or "Remove parsed content" processes are done, a custom signal and event gets thrown for workflow purposes?

User avatar
rosarior
Posts: 440
Joined: Tue Aug 21, 2018 3:28 am

Re: New document properties for OCR and parsing status

Post by rosarior » Fri Sep 06, 2019 10:15 pm

The workflow looks good and it should work pretty much always. Workflow state changes are tied to the event system. It is not possible for the workflow to miss the event after it fires.

The request for a flag for OCR content has been done before. The problem is that a flag is a state, but the OCR system works on the background as a task. Adding a flag that way would open up the door for a race condition. The flag would be unreliable. This is why we use and we recommend to use events which are guaranteed to fire in a very specific and repeatable way.

daniel1113
Posts: 24
Joined: Tue Aug 21, 2018 2:32 pm

Re: New document properties for OCR and parsing status

Post by daniel1113 » Sat Sep 07, 2019 3:07 am

Roberto:

Can you read my post again? I think you answered a different question than the one I asked. I understand the limitations/problems with flagging OCR content, which is why I'm not asking for that. I'd like to trigger a workflow transition when a process to remove ALL OCR content (not just a single page) is executed. Should't that work the same way as setting a trigger upon OCR completion, just in reverse? Thanks.

User avatar
rosarior
Posts: 440
Joined: Tue Aug 21, 2018 3:28 am

Re: New document properties for OCR and parsing status

Post by rosarior » Tue Sep 10, 2019 5:31 am

You are correct, apologies.

Yes, adding two document actions to remove the OCR and parsed content should be a straight forward matter. Instead of saving empty content deleting the related field on the OCR and parsed model (the default when no OCR or parsed content is available) should reset the recognized content in a repeatable manner.

Thanks for the report, added to the version 3.2.8 work list.

daniel1113
Posts: 24
Joined: Tue Aug 21, 2018 2:32 pm

Re: New document properties for OCR and parsing status

Post by daniel1113 » Tue Sep 10, 2019 3:18 pm

That's great. Thanks so much!

User avatar
rosarior
Posts: 440
Joined: Tue Aug 21, 2018 3:28 am

Re: New document properties for OCR and parsing status

Post by rosarior » Wed Oct 02, 2019 3:53 am

Version 3.2.8 is out and includes support for deleting the OCR and parsed content of documents. When the deletion of each is finished, an event is committed that can be used as a workflow transition trigger.

Post Reply