(OCR error) Failed loading language / Tesseract couldn't load any languages / Could not initialize tesseract (possible solution)

If you are having language problems with OCR like the one above, try this step-by-step guide.
As the error states, you must install the language you need, in my case it is the “Portuguese-BR” language.
Open “Docker Desktop” and click on the 3 dots next to the “app-1” container, after that, click on “open in terminal”:

A local container terminal will open, there you will enter the following commands in order:

sudo apt update
apt list --upgradable
sudo apt upgrade
tesseract --version
sudo apt install tesseract-ocr-por

You can replace “por” (Portuguese) with the language you want.
I believe there is no need to restart any container in docker, just upload a new file to test the OCR. Hope this helps.

You can also use the MAYAN_APT_INSTALLS env variable in your compose file to make the containers install the additional packages automatically. Especially useful in a multi container setup:

MAYAN_APT_INSTALLS=“tesseract-ocr-deu tesseract-ocr-eng”

2 Likes

It seems that the MAYAN_APT_INSTALLS variable in the .env file is no longer functioning as expected. I am currently using version 4.8.2 of Mayan EDMS, and the workaround I’ve been using is:

docker exec -it your-container-name /bin/bash
apt-get update && apt-get install -y tesseract-ocr-ara

Could you advise if there is a more efficient or recommended approach for handling this?

I’m still on 4.7.2 but I cannot find anything on the release notes regarding removal of MAYAN_APT_INSTALLS variable.

Your approach seems ok but you have to make sure to execute the install everytime you recreate the container. With the MAYAN_APT_INSTALLS the container will always install the missing packages automatically on startup.

1 Like

it’s definitly a bug somewhere.

mayan is looking for /usr/share/tesseract-ocr/5/tessdata/Eng.traineddata but on the filesystem, it’s /usr/share/tesseract-ocr/5/tessdata/eng.traineddata (note the difference in the case Eng.traineddata instead eng.traineddata ).

I did “patch” the problem in the container but it’s not a great solution: ln -s eng.traineddata Eng.traineddata

how can I run that command when the container is created ?

Exception calling Tesseract with language option: Eng; RAN: /usr/bin/tesseract - - -l Eng STDOUT: STDERR: Error opening data file /usr/share/tesseract-ocr/5/tessdata/Eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your “tessdata” directory. Failed loading language ‘Eng’ Tesseract couldn’t load any languages! Could not initialize tesseract. /nThe requested OCR language “Eng” is not available and needs to be installed.

Hello @germain I have the same issue last week and I got it solved by Mayan supporter team if you want quick help I’ll suggest you to contact their support team via @Rose_Stevie and please beware of scammers around don’t fall into their trap

Oh boy! There is a lot to unravel here and people’s feelings are going to be hurt but this is the kind of ignorant entitled behavior and disinformation that is hurting open source project everywhere.

I can’t count how many times the devs have stated (and it is basic common sense) that if you are going to report a “bug” use a clean install and do so using the methods things approved and well tested. Don’t go installing a Frankenstein install on Windows fake Linux environment and then blame it on Mayan.

If you get caught lying or omitting info, don’t double down. The community is pretty smart catching people like you and the devs are even smarter and can smell the same BS even sooner.

Read the documentation, the help texts, and the forum. Don’t go inventing stuff or changing the code. Use the tools provided to install new packages don’t go changing file and then complain something broke, that’s your doing.

That’s the problem with the first post: Not using Docker composite, using Windows, installing direct files in the container. Then posting as if doing a tutorial of the worst possible advice that directly contradicts what the devs recommend.

Second post correctly advises the OP. OP never acknowledges the correction. Another common issue with open source projects. New users just want help and as soon as they get it leave never helping others.

Drive-by support whining. If you do this you are not a community member you are just a parasite user installing open source just to get free stuff or because a “techie Youtuber” told you it was a cool things for your toaster “homelab”.

Third post does the same thing. Gets creative and tries to fix the problem by inventing a solution that only creates more issues if you delete the container you lose the changes.

Same poster as #2 offers good advice. Previous poster never acknowledges the correction. Another drive-by whiner.

Last poster, oh boy… Another random drive by “bug” reporter that does not do the basic due diligence and just like to shit on open source projects. Clumps all previous unrelated post under the same “issue”. I read the other posts of this user and it is clear he is doing funky on his install by the “errors” he is reporting.

I’m not a coder but can read enough Python to be dangerous.

The language that is passed to tesseract is always lower case because mayan uses ISO language codes. That means mayan is passing “eng” to tesseract not “Eng”.

That your mayan copy is passing the wrong language to tesseract (which is generated by python codes and not a hard list input by humans) means you are either modding the code or purposely creating misinformation and sending the devs into wild goose chases.