Document Processing Times

DocCyblade · February 3, 2023, 6:48pm

So I migrated 1000+ pdfs into Mayan (v4.3) last week, moving away from the old file and folder storage. I noticed the server was cracking hard at 90% of CPU. I figured this was normal as its doing OCR and other things. What I did not expect was its still pegging at 70% after almost a week. The fact Mayan can have a dedicated processing backend shows me this process can take some time but did not think it would take this long. Server specs below. Any way I can peek under the hood and see if it’s almost done or stuck and broken? For reference the pdfs are mostly multi page color scanned 150-300dpi each pdf is avg 10-15 pages

Server is a VM running in Proxmox 7
Built with Turnkey’s Mayan install, upgraded to 4.3
vCPU: 6 x (Xeon(R) CPU E3-1245 v5 @ 3.50GHz)
vRAM: 10GB (DDR4)
vDisk: 500GB (backed by NVMe SSD PCI3)

roberto.rosario · February 4, 2023, 11:48am

All processes involved in initial document processing are very CPU heavy. Some of these are: MIME type inspection, extracting text content, extracting file metadata, extracting individual images from each PDF page, resizing each image, then do OCR on each separate image, all the while coordinating the caching of images and evicting the images from the cache to avoid runaway storage usage. Mayan hides all these complexities from the user, but their impact on resources cannot be avoided.

Narrowing down the cause of the CPU load takes time and it could be one (or several) of multiple things. For example, your system can have enough CPU and storage speed, but if your image cache is too small for your type of workload, Mayan is going to be spending most of its time balancing between apps wanting to cache things and evicting the cache. If an image is evicted and the OCR for that image kicks it, the image needs to be scheduled first which will trigger another eviction to make space. If another process was using that image, a lock error is raised and the task is sent back to be retried.

Keep in mind that Mayan is not a monolithic program but a collection of apps providing services to each other, including its own synchronization and orchestration. When debugging Mayan, think of it more like an operating system or a group of container in a Kubernetes cluster. One seemingly innocuous thing like a small swap file can bring the entire OS performance down. Same concept applies to Mayan.

On distributed systems hardware specifications don’t always have a direct correlation to performance. Parallelization and tuning of the system to your workload have the most effect. Without access to your document workload, the system and the ability to modify, restart, and test it, it is impossible to provide a concrete solution.

General recommendations are:

Enable the RabbitMQ administration interface. Check to see which queues are remaining filled with task, or are receiving more tasks than they are able to consume.
Use top or similar to see which parts of the stack or workers are using the most CPU.
Gather information by installing (or opening an account) Sentry.io.

Mayan divides all the task queue into 4 basic workers based on the expected latency needs of the tasks. Tasks are grouped based on latency compatibility. From Worker A which consumes the interactive task with the lowest latency needs like image previews, to Worker D which consumes the long lived tasks like OCR.

Queue membership of each worker can be obtained with the command:

MAYAN_WORKER_NAME=worker_a ./manage.py platform_template worker_queues

Will return:

>> converter,sources_fast

These membership are the default that work on most situations but need to be tweaked for each workload. You can spin up more worker A if you require fast interactive images for high user concurrency or documents with high page counts. Or you can spin up more worker D when uploading a bulk of new documents.

You can also spin up workers that consume a single queue for maximum task processing efficiency.

DocCyblade · February 4, 2023, 3:04pm

@roberto.rosario fantastic post! This is exactly what I needed to know and then some info I needed to know that I did not know I needed lol! But really, this is great stuff. Another reason I picked Mayan over the other open source solutions. Thanks again sir! As I start getting better with Mayan, I will keep trying to help out here on the forums and give back in the spirit of the open source community. Cheers!

DocCyblade · February 4, 2023, 8:11pm

Wanted to post my findings to share with the community.

Friendly reminder that before you start mucking around with configuration, make a backup of the file(s) before you do!

Armed with the knowlage that @roberto.rosario posted, I started with RabbitMQ Admin. By default the admin panel is turned off. So I had to enable it and create a user to access it. My googlefoo found this page describing how to do it.

Once I enabled that and logged in I was able to see the queues and what was causing the high (50%) cpu rate. It was 90% but now I see all the OCR is done and looks like (I assume) its indexing the OCR text since the queue is called “search” correct me if Im wrong here.

I did try and add more “B workers” by adjusting my supervisor config located at /etc/supervisor/conf.d/mayan.conf now granted I am using a “manual install” that as of now is not supported, so I would guss that being docker the config is environment variables. I set mine to 15 in my case that was the point where more than that did not make a difference. Also note the concurrency settings default value is the number of CPUs on the system.

While direct documentation of these settings can seem not easy to find, I searched the official documentation for MAYAN_WORKER_B_CONCURRENCY and found it here on the docs site with links to other documentation specifically for celery here. I did find that if you want to specify more “workers or concurrency” you have to add --concurrency=xx to the variable as its passed as a command line option. You can see part of my supervisor config below. Another note, probably only matters for manual installs, was I had to reload supervisor not just stop/start it so it re-reads the config file. I used htop to monitor the processes. see below screen shot as well.

I am not sure what the best ratio or workers to tasks would be, I assume that for each worker (concurrency) that each one can have 100 sub tasks? Just guessing here. I would assume also these values depending on the type of work being done as well.

# my config file /etc/supervisor/conf.d/mayan.conf on a manual install
...
[supervisord]
environment=
    PYTHONPATH="/opt/mayan-edms/media/user_settings",
    MAYAN_ALLOWED_HOSTS='["*"]',
    MAYAN_MEDIA_ROOT="/opt/mayan-edms/media",
    MAYAN_PYTHON_BIN_DIR=/opt/mayan-edms/bin/,
    MAYAN_GUNICORN_BIN=/opt/mayan-edms/bin/gunicorn,
    MAYAN_GUNICORN_LIMIT_REQUEST_LINE=4094,
    MAYAN_GUNICORN_MAX_REQUESTS=500,
    MAYAN_GUNICORN_REQUESTS_JITTER=50,
    MAYAN_GUNICORN_TEMPORARY_DIRECTORY="",
    MAYAN_GUNICORN_TIMEOUT=120,
    MAYAN_GUNICORN_WORKER_CLASS=sync,
    MAYAN_GUNICORN_WORKERS=3,
    MAYAN_SETTINGS_MODULE=mayan.settings.production,
    MAYAN_WORKER_A_CONCURRENCY="",
    MAYAN_WORKER_A_MAX_MEMORY_PER_CHILD="--max-memory-per-child=300000",
    MAYAN_WORKER_A_MAX_TASKS_PER_CHILD="--max-tasks-per-child=100",
    MAYAN_WORKER_B_CONCURRENCY="--concurrency=15",
    MAYAN_WORKER_B_MAX_MEMORY_PER_CHILD="--max-memory-per-child=300000",
    MAYAN_WORKER_B_MAX_TASKS_PER_CHILD="--max-tasks-per-child=100",
    MAYAN_WORKER_C_CONCURRENCY="",
    MAYAN_WORKER_C_MAX_MEMORY_PER_CHILD="--max-memory-per-child=300000",
    MAYAN_WORKER_C_MAX_TASKS_PER_CHILD="--max-tasks-per-child=100",
    MAYAN_WORKER_D_CONCURRENCY="--concurrency=1",
    MAYAN_WORKER_D_MAX_MEMORY_PER_CHILD="--max-memory-per-child=300000",
    MAYAN_WORKER_D_MAX_TASKS_PER_CHILD="--max-tasks-per-child=5",
    _LAST_LINE=""
...

with this knowlage I figure it will take another 24-48 hours before it’s all done.

Thanks again @roberto.rosario for the explanation and if I misspoke above in any of my assumptions let me know.

Danynad · February 8, 2023, 9:20am

Hello @DocCyblade,

I’m facing similar issues. All details here in a gitlab issue recently reopened:

Just sharing my experience, maybe joining our efforts we can help find where the problem is.

DocCyblade · February 11, 2023, 6:30pm

@Danynad - I ended up moving Mayan to a server with crazy amount of resourced (48 CPUs and 64GB of RAM) and it still took 24 hours. It seems that whenever a document is added, once the OCR is done, and the file data is scanned it is submitted to the search queue. I don’t know what that worker is doing that takes so long but that was my issue.

I just submitted a 9 page cell phone statement (1mb PDF) and RabbitMQ is showing 45,000 messages.

I started it at 1pm and was done by 4:30pm so 3 hours to process 1 x 9 page PDF. Just for those browsing. That no issue for me as we already did the mass import, now its 1-2 documents upload a day if that.

@roberto.rosario - What is Mayan doing in that “search” queue that’s that CPU intensive?

roberto.rosario · February 12, 2023, 1:53pm

Resources and performace

If the installation is not parallelized and properly tuned, adding more resources won’t have much impact on real life performance. This is why Docker Compose became the default installation process. It allows tuning several aspects of the stack better than a direct install. A direct install inside a VM can’t make the most of the resources available. The single resource domain is going to cause one worker to take resources from the other workers. Performance artifact isolation happens only at the single OS host level as processes. For example the OCR tasks will slow down the entire VM. There is also no swap isolation.

Search engines

One of the tasks Worker B takes care of is indexing the search engine. Search engines are fast when retrieving data but are slow to update and use a lot of storage. That’s the trade off. Some search engines/libraries don’t support partial updates and as a compatibility measure, the entire object is reindexed when a change occurs, even for a single field. We’ve looked into deduplication of tasks and that’s one of the next improvements we plan to explore for the search system.

Another challenge is that search engines are not the same a database manager, and tend to be flat, support for object referencing, if any, is custom. This means that if a tag is attached to a document, Mayan needs to refresh the search index of the document as well as the index of the tag, completely as a single operation. If a tag that is attached to a document is renamed, the same thing happens. Now image a renaming a single tag attached to 100 documents. That’s at least 101 search indexes refreshes. The same problems applies to every single object in a Mayan installation. One modification to a single field can result in several hundred search refresh jobs, which is what you are seeing. This is the challenge when working with search engines and the reason we held this feature for as much as possible using the database as the search source, they are fast but many changes needs to be made to integrate them into a single seamless experience.

By default Mayan uses the Whoosh library as a search engine. Whoosh is easy to setup and get started but for better performance you can consider ElasticSearch. ElasticSearch runs as a separate service and isolated from Mayan. However the memory, CPU requirements, and maintenance requirements of ElasticSearch rivals those of Mayan itself. That’s the tradeoff, simple/easy/low resources vs. fast/complicated/high resources.

Performance tuning

It can take several days to fine tune a single installation to match your document patterns, usage pattern, and hardware specs. There is no single set of suggestions that will yield the same results for everybody. Even if we doubled or tripled the current team, it would still be impossible take time from the project tasks (this reply alone has taken about 30 minutes) to help fine tune installations. That is why support and fine tuning of installations is a paid service. It is the only way to be able to devote the time and resources (even external consultants for specialized topics) to improve an installation to work at its best. Besides the time and effort, technical support opens up a can of legal worms. There are legal and liability implications which require putting an agreement on paper.

Possibilities

Taking the time and effort to fine tune Mayan can yield incredible results. As an example, any given time I have 10 or more Mayan installations running as development deployments, Docker Compose in VMs and Kubernetes deployments. Some of these installs have millions of pages and get stress tested around the clock. All of this is hosted along with my personal VMs, in a single R620 from 2012. This server is using low voltage E5-2648L v2 CPUs and remote NFS disks both are slow. Total server CPU load hovers at 20%. Power usage 200-300 watt range. This setup has taken months and is still a work in progress.

DocCyblade · February 12, 2023, 5:45pm

@roberto.rosario thank you for taking the time to explain this. I figured as much why you’re moving to containerized approach and this makes sense. I like Mayan so much I am venturing out into the Docker space. Like any technology, it has its own vernacular and learning curve and will need a lot of time to get to docker. I spun up a VM with docker to dip my feet in the docker water and I have a lot to learn. Any good reading your recommend for those of use that are diving into containerization/docker land?

roberto.rosario · February 12, 2023, 11:23pm

Docker (or more precisely containers) are another isolation technology like VMs. But unlike VMs where the entire machine is emulated, in containers only the operating system is “emulated”. Containers can be thought of as similar to FreeBSD Jails. Is a virtual filesystem which is running normal processes in the same context as the host OS, but the processes are treated as a special group. They are isolated in terms of resources, networking, and storage from the host OS.

The storage is also isolated and divided into two types, “frozen” and persistent. The frozen part are all the files the container image includes when you download the image. The persistent part is the volume you define when container images is actually deployed and run as a container. This is where all your data is stored, even after your stop and delete the container.

While not technically correct, the elevator pitch for containers would be something like: containers can be thought as lightweight virtualization where only the OS, storage, and network are virtualized.

I written a bit and given talks about containers, mostly from Python/Django/Mayan point of view.

Danynad · February 16, 2023, 3:54pm

thanks @DocCyblade for your feedback, unfortunately I can’t afford giving mayan such an amount of resources right now. It seems really overkill tbh.

and thanks a lot @roberto.rosario for all your explanations, it’s very appreciated. I’ve seen the talk too. Would have been interesting to see it live, I’m from Italy but not from Pisa.

I did try ElasticSearch before as search backend, but didn’t notice much difference in search indexing time and the problems were the same if I recall correctly.
Should I suppose using ElasticSearch over whoosh, the initial search indexing when adding documents would be faster? Heavier in resources is fine but as long as it’s (way) faster; I’d go for it.

DocCyblade · February 16, 2023, 4:39pm

@Danynad - As @roberto.rosario said before, fine tuning is an art form and can take a lot of time depending on the setup. My use case for Mayan is just for my family needs, and I open this thread up as I wanted to make sure it was working correctly, and it was.

I would recommend opening a new topic and be specific in what your goals/needs and your use case. Maybe someone also looking at doing the same or already crossed the bridge and knows how to do what you’re wanting to do. That way it does not get buried in old thread. Or if your budget allows get some paid one on one time to get where you need to be.