Post by ue4hobbyist »

Many non-english documents have names or words in english but currently OCR works only for the currenty selected document language without the ability to add a second one. If I am not mistaken tesseract OCR supports multi langiage OCR.
Re: Dual language OCR

Post by rosarior »

What we have been talking about is making the document language field optional. This way it will up to the OCR engine to use language detection. This should not be something that breaks anything but will need to test if before we add the database migration and code changes. Thanks for the feedback!
Re: Dual language OCR

Post by bwakkie »

My question was not about the default language but about the possibility to add extra languages to the tesseract engine, could even be three. As sometimes a document contains multiple language even in one page. For scientific documents a lot of times the sources are in multiple languages too

Tesseract does not detect *any* language and is already defaulting to English if I am not mistaken. Tesseract can detect the language 'script' (eg. Latin, Arabic, Han) in the document though. Based on that I wrote a python script that per document detects from the middle pages the language.

Its a three step tesseract process:
1) Detect 'script'
2) ocr on candidate languages belonging to the 'script' and detect language of the resulting text.
3) preform the REAL ocr over the whole document within Mayan-EDMS

language_detection.py Chineese+a_littlebit_of_english_test.pdf
['chn' 'gbr']
language_detection.py Russian_test.pdf

So I need to add those languages to the Mayan application attached to the document.
what is the best way (in simple noob explenation please) to include a local script into a workflow so the language (after bulk upload defaulted all to 'english') is altered en re-ocr them if the language has changed?
