Document processing process

Errol · January 31, 2024, 4:06pm

What exactly happens when I upload a document? I mean, what is the order of processing it? I am asking this because when I upload couple of hundred pdf files (just technical articles) especially OCR processing doesn’t start or carried out. I found that I needed to restart the mayan container to initiate the OCR processing. Even after that it seems the process is intermitted.
Does this make any sense?
What is the best practice to make sure the uploaded documents go into OCR’ing process?
I am running mayan container on docker desktop on win10.

roberto.rosario · January 31, 2024, 4:28pm

What exactly happens when I upload a document? I mean, what is the order of processing it?

To allow scaling, almost everything in Mayan is a background task. The exact order is not deterministic and a task will listen to events of a previous one before launching the subsequent execution. The OCR task waits until the document version is fully created and populated with pages before launching. Then the OCR task will launch individual OCR tasks for each page image to allow parallel OCR.

Article on the background task system:

Background tasks

Article on the distributed locking system that orchestrates the tasks’ access to shared objects:

The lock manager app

I am asking this because when I upload couple of hundred pdf files (just technical articles) especially OCR processing doesn’t start or carried out. I found that I needed to restart the mayan container to initiate the OCR processing. Even after that it seems the process is intermitted.

By default and to allow a simple installation of Mayan with just two commands, the deployment uses a single container for the entire stack. This is very similar to the Omnibus approach of GitLab, Sentry, and others. However this does not scale much at least during usage spikes. This is a ease of use vs. performance trade off. The OCR process is very CPU and memory intensive. The OCR task when using a single container, is executed using a lower Linux nice level to avoid the OCR code from blocking the UI or starving the entire stack out of memory.

If your use case starts to reach this stage, Mayan includes a multi container profile. Unlike other open source projects, this Docker Compose profile is available for free too.

Please read the following article on how to switch your installation to a multi container deployment.

How to use the multi container Docker Compose profile

And the article on how to scale up different parts of the stack for your use case.

Scaling up a multi container installation

You can enable the RabbitMQ administrative portal to examine the creation and ingestion rate of the different queues and determine which workers/containers need to be scaled up.

How to enable RabbitMQ’s administrative portal