Question: Automatic meta data extraction and mapping


I have the use case that I want to automatically extract meta data from the ocr’ed.

Initially I need the two values:

  • gross price (related to invoices)
  • and the document date

for that purpose I thought about extracting the biggest decimal value matching a regular expression like \b\d+[\.,]\d{2,2}\b. For that purpose I’d need to extract a list of values from the ocr’ed text, convert those values to a floating point value, sort those values and finally pick the largest value, the last element of the list.

for the second requirement I need to do the same but with a more complex regular expression, potentially even having to take the document’s language into consideration in order to correctly parse the date, and I also need to convert the date to a sortable date type in order properly sort those objects and pick either the earliest or the latest date from the document.

My question is, coming from a programming background but not from a python programming background (basic python scrips yes, enterprise software no), where do I start and what is the best approach to reach my goal?

Where do I start and what do I roughly need to do to achieve that goal?