OCR backend arguments to change tesseract defaults?

ammonite · July 7, 2024, 10:57pm

would it be possible to have a tutorial or article for the knowledgebase going into more detail regarding changing the default tesseract behaviours?

I’m aware information has been posted regarding adding languages - however I wish to be able to try modifying the settings of tesseract to hopefully improve the OCR results on my documents.

I’m also aware that tesseract has different settings regarding the binarization, thresholds for example and changing these settings may improve the results on my documents (scanned pdf’s of old letters created on a typewriter with ink bleed and non white backgrounds) although there may well be other settings that i need to invoke or change the value of.

am I right in thinking that these should be addressed via the OCR_BACKEND_ARGUMENTS setting and if so how can I interreact with this feature, a tutorial or article about how Mayan interacts with tesseract and how to change settings would be very helpful as I could test various results then implement them for all documents.

if this is possible is it also possible to implement these setting changes within a workflow or other process so that document of a set type “letter” can have one set of rules and settings for tesseract and and a different document type “invoice” can have a set of different argument or settings applied? as I do not believe with the various types of documents I wish to use Mayan for will all fit one generic ruleset.

many thanks

roberto.rosario · July 8, 2024, 12:38am

That is correct. The purpose of the setting OCR_BACKEND_ARGUMENTS is to fine tune Tesseract by passing it command line options.

By default it is passed:

{'OMP_THREAD_LIMIT': '1'}

To limit the number of threads. Other Tesseract options might be possible.

binarization, threshold

What are the Tesseract command line options to change these settings?
Do you have some example Tesseract settings to try out?

ammonite · July 8, 2024, 1:33am

Hi Roberto thanks for the fast response

I have been reading various areas of the internet for the best advice for example the rough guide at - Improving the quality of the output | tessdoc - provides a number of potential ways to improve the ocr output quality.

after looking in the app-1 docker container terminal and running the command mentioned in the above

“tesseract --print-parameters | grep thresholding_”

I can see that tesseract 5.3 has the options internally to change thresholding_method between 3 methods as well as some changeable parameters for thresholding for those types, although other thresholding options are available via scripts mentioned in the article as well

the output yields

thresholding_method 0 Thresholding method: 0 = Otsu, 1 = LeptonicaOtsu, 2 = Sauvola
thresholding_debug 0 Debug the thresholding process
thresholding_window_size 0.33 Window size for measuring local statistics (to be multiplied by image DPI). This parameter is used by the Sauvola thresholding method
thresholding_kfactor 0.34 Factor for reducing threshold due to variance. This parameter is used by the Sauvola thresholding method. Normal range: 0.2-0.5
thresholding_tile_size 0.33 Desired tile size (to be multiplied by image DPI). This parameter is used by the LeptonicaOtsu thresholding method
thresholding_smooth_kernel_size 0 Size of convolution kernel applied to threshold array (to be multiplied by image DPI). Use 0 for no smoothing. This parameter is used by the LeptonicaOtsu thresholding method
thresholding_score_fraction 0.1 Fraction of the max Otsu score. This parameter is used by the LeptonicaOtsu thresholding method. For standard Otsu use 0.0, otherwise 0.1 is recommended

Further looking online I have found GitHub discussions - RFC: allow flexible or better binarization · Issue #3083 · tesseract-ocr/tesseract · GitHub - that suggest this should be the correct command if running at command line to change the main thresholding method, I would assume the other parameters are simply added to this command

tesseract in.png out -c thresholding_method=2

however as I am running from Mayan inside the docker container I have no “in.png” to truly test this at the moment - for fine tuning and testing I would just run this elsewhere but wanted to see how to interact with tesseract though the Mayan backends interface as I am unsure how for example to address the document or document type Mayan is currently passing to tesseract and attempting OCR on. I tried just adding the command but that did not work so I must have done it wrong, so a guide to pass arguments to the OCR back end would be helpful.

I hope this makes sense

many thanks

ammonite · July 22, 2024, 11:39am

Hi Roberto, I have tried passing the command line argument ‘‘thresholding_method’’: ‘‘2’’ in my ocr back end arguments but i am still having issues a follow up would be very appreciated.

looking in the app1 container I’m getting the error

TypeError: mayan.apps.ocr.backends.tesseract.Tesseract() argument after ** must be a mapping, not str

so I’m still unsure how to pass commands to tesseract from mayan, a guide would be most helpful.