Invoice Number Detection in Documents Using OCR and Parsing

Hi everyone,

I’ve been working on a solution to detect invoice numbers (pattern: INV\d{6}) example INV123456 in documents processed through Mayan EDMS. My setup involves both OCR and parsing are enabled, and I’m handling four potential outcomes:

  1. OCR fails but parsing succeeds.
  2. Parsing fails but OCR succeeds.
  3. Both OCR and parsing fail.
  4. Both OCR and parsing succeed.

I wanted to ensure that if either OCR or parsing succeeds, I can extract the INV number from either source. Here’s the Django template code I came up with:

{% spaceless %}

{% set document.content|join:"" as content_text %}
{% set document.ocr_content|join:"" as ocr_text %}
{% regex_search "INV\d{6}" content_text as matches1 %}
{% regex_search "INV\d{6}" ocr_text as matches2 %}

{% if matches1 and matches2 %}
  Both parsing and OCR succeeded. INV from parsing: {{ matches1.0 }}, INV from OCR: {{ matches2.0 }}
{% elif matches1 %}
  Parsing succeeded. INV Number: {{ matches1.0 }}
{% elif matches2 %}
  OCR succeeded. INV Number: {{ matches2.0 }}
{% else %}
  Both parsing and OCR failed. No INV number detected.
{% endif %}

{% endspaceless %}

The solution works well, and I’m able to capture the INV number correctly in all cases. However, I’m curious if anyone has ideas on how to optimize or improve this code. Is there a more efficient way to handle the OCR and parsing conditions?

Looking forward to your thoughts and suggestions! Thank you in advance.

1 Like