Set a document to a cabinet based on OCR

yesinH · March 4, 2023, 5:39pm

Hello everyone,

I have tested indexing by OCR content and it worked perfectly. However, I am wondering if the documents can also be assigned to a specific cabinet based on the OCR content. If that is not possible, I would also consider assigning the documents to a specific cabinet based on their indexing as a workaround (using API). Does anyone have experience with this?

Thank you in advance!

Best Regards,
Yassine

roberto.rosario · March 5, 2023, 1:45am

Hi,

The best way to do this is with a workflow. In this example I’ll place an image in the “2023 memes” cabinet if they have the words ‘eggs’ and ‘gas’ in the OCR text.

Create the cabinet first
Create the workflow as follows

a. Two states: created and OCR complete.
b. Add a single transition from created to OCR complete.
c. Set the transition trigger to be the OCR complete event.

d. Add an action to the OCR complete state to add the document to the cabinet.

e. Set the action condition to be the OCR content.

f. Associate the workflow with a document type.
g. Upload a document and test.

yesinH · March 5, 2023, 9:40am

Hi Roberto,

Thank you so much for your quick reply!

Your solution looks very promising since it does not require the use of the API.

I wish you a nice day!

Best regards,
Yassine

WhizzWr · April 6, 2023, 7:53am

@roberto.rosario semi-related question: is it possible to use the Django templating engine to find the first date on a OCR-ed content?

Some combination of for and parse date maybe?

roberto.rosario · April 6, 2023, 6:38pm

Yes it can be done.

{% spaceless %}
{% set document.ocr_content|join:"" as  ocr_text %}
{% regex_search "\d*/\d*/\d{4}" ocr_text as matches %}
{{ matches.0 }}
{% endspaceless %}

This template will match a regular expression against the OCR content and produce all matches. You can then select the first (or any or all dates) and use it for indexing or store it as metadata using a workflow action.

By changing the regular expression this same snippet can be reused for many scenarios.

WhizzWr · April 6, 2023, 7:17pm

Thanks, in the meantime I came up with practically the same templating.

{% regex_search "[0-9]{1,2}(\.|\-|\/)[0-9]{1,2}(\.|\-|\/)[0-9]{4}" document.content|join:"" as date %}
{{ date.group | date_parse | date:"d-m-Y" }}

I’ve put this into a Workflow with “parsing completed” as a trigger, and I have date filler that works at least 70% of all time—letter and receipt would have date somewhere on the top part.

roberto.rosario · April 8, 2023, 7:51pm

This is a very good solution. I like the date parsing and validation

jens.schaerer · April 14, 2023, 10:48am

I didn’t want to start a new topic because I think this is related here. I have a really simple workflow which adds some metadata based on the filename to the document. This works well with the action (which doesn’t have a condition) to add the value from document label:

{% spaceless %}
{% regex_search  "\d{4}-\d{2}-\d{2}" document.label as matches %}
{{ matches.0 }}
{% endspaceless %}

But in the same status I try to add a document to a cabinet when the document label / filename contains a specific string:

This doesn’t work. If I remove the condition, the document is added. So I think there is something wrong. In playground, I get “True” for the specific document for my if statement.

What is wrong here?

system · September 13, 2023, 11:25am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.