Add languages to OCR

When things doesn't work as they should.
Post Reply
Gasur
Posts: 4
Joined: Sun Jan 13, 2019 12:34 am

Add languages to OCR

Post by Gasur » Sun Jan 13, 2019 12:45 am

Hello.

I've been trying to add Danish to my list of languages for OCR support. But unsuccessful so far. By default it has support for odd languages like ancient greek and similar, but not Danish.
I am using the docker version, manual with MySQL instead of PostreSQL, however I did spin up a complete fresh one using the one line installer on your website, no luck.
Danish is supported byTesseract as shown here: https://github.com/tesseract-ocr/tesser ... Data-Files.

Code: Select all

root@db34fa5eaea5:/opt/mayan-edms# apt-cache search tesseract-ocr
tesseract-ocr-dan-frak - tesseract-ocr language files for Danish (Fraktur)
tesseract-ocr-dan - tesseract-ocr language files for Danish
tesseract-ocr-osd - tesseract-ocr language files for script and orientation
tesseract-ocr-eng - tesseract-ocr language files for English
tesseract-ocr - Tesseract command line OCR tool
tesseract-ocr-equ - tesseract-ocr language files for equations
What I've tried so far:
  • I've tried reinstalling the dockers with different databases (albeit should be the difference).
  • Installed OCR packages using the -e MAYA_APT_INSTALL parameter
  • Installed it manually inside the container, using apt install tesseract-ocr-dan tesseract-ocr-dan-frak
  • Tried changing the OCR tool from the default one to ocr.backends.tesseract.Tesseract, albeit the docker crashed stating that no such module exist.

Any ideas?

Gasur
Posts: 4
Joined: Sun Jan 13, 2019 12:34 am

Re: Add languages to OCR

Post by Gasur » Wed Jan 16, 2019 10:31 pm

I did some more investigating.

Code: Select all

tesseract --list-langs 
shows that dan and dan_frak is in fact installed, along with a lot of other languages that Mayan apparently does not support.

Looking at the source files, and searching for Danish, I can find Danish support in the language translation in 0001_initial.py on line 660.
I can also find the language in 0029_auto_20160122_0755.py on line 17, clearly displaying Danish. However, it is still not supported.


I do not understand why it cannot find Danish, while Tesseract supports it just fine and that pyocr is a wrapper for Tesseract, thus it should work out of the box.

Gasur
Posts: 4
Joined: Sun Jan 13, 2019 12:34 am

Re: Add languages to OCR

Post by Gasur » Thu Jan 17, 2019 6:17 am

After endless hours, I finally figured out a way to add languages for OCR. This should work with both docker and direct version, however, I have not yet tested the direct version.

After following the guide on how to setup docker here https://docs.mayan-edms.com/chapters/do ... er-install, do following:

First enter the docker container

Code: Select all

docker exec -it mayan-edms bash
Then you need to enter the folder that Mayan is installed in

Code: Select all

cd /var/lib/mayan
You then need to copy the default config file

Code: Select all

cp config_backup.yml config.yml
We need to edit the backup file, but to do this, we need to install a tool to edit files with. You can install any you'd like, I'll install Nano.

Code: Select all

apt-get update; apt-get install nano
The fun part starts, we have to edit the file, but be *extremely* careful. YML files are extremely picky with how the spacing is, one wrong space and the config file is non functional

Code: Select all

nano config.yml
You'll see a lot of junk that is not useful to your case, but about 40 lines down, you'll see a setting called "DOCUMENT_LANGUAGE_CODES:", this is the want we want. Simply go to the very end of the following languages and enter the language you want in the same manner. It HAS to be ISO 639-3, so English is not en or en_gb but instead eng. If this is not typed correctly, it will not work. Read more about languages here https://docs.mayan-edms.com/chapters/languages.html - please note, as seen in the OP, those commandline parameters do not work anymore.
To exit nano, hold ctrl and press X, followed by Y and enter.

After you've edited the file, we now need to exit the docker and restart it. If it says "bash: docker: command not found", type exit until it works. You are not outside the container yet.

Code: Select all

exit
docker container restart mayan-edms
I hope it helped anyone who wants some of the languages that are not added to the list by default.


Read more here: https://docs.mayan-edms.com/topics/sett ... ation-file

Post Reply