How to implement custom postprocess to de-hyphenation after OCR took place?

When things don't work as they should.
Post Reply
bwakkie
Posts: 18
Joined: Fri Feb 14, 2020 8:28 pm

How to implement custom postprocess to de-hyphenation after OCR took place?

Post by bwakkie »

Hi,

To improve the text quality after OCR has finished how does mayan-edms process hyphened words?

How would I implement a custom automated post-process that would simply run the following regex (vim)...
:%s:\v([a-z])-\n([a-z]):\1\2:
... to remove hyphen where a sentence ends with a lowercase letter, a hyphen, a new line and a lowercase letter (on the next line).

Another regex (also in vim form) to remove newline inside paragraphs (a paragraph being two newlines after each other) might be included in this post-process.
:%s:\v([a-zA-Z0-9,\:;\.\(\)]+)\n([a-zA-Z0-9\(\)]+):\1 \2

The end result is the text I like to insert in de database for searching.

I also though about creating a postgresql trigger maybe on the document_parsing_documentpage table on insert to preform this. Is that a good option?

Kind regards,
Bastiaan
User avatar
michael
Developer
Developer
Posts: 61
Joined: Sun Apr 19, 2020 6:21 am

Re: How to implement custom postprocess to de-hyphenation after OCR took place?

Post by michael »

Hi,

This was added recently to the upcoming 3.5 series branch as a workflow action. Template tags were also added to allow matching, searching and replacing text using regular expressions.

We are close to bumping the stage of the current 3.5 code from alpha to beta and will be looking for feedback on this feature.
Post Reply