Rabbitmq growing too much in size - search queue

All other performances and functionalities are great and I love Mayan. But having queues taking weeks at 100% CPU usage is a server-kill.

This is not caused by Mayan, it is caused by running multiple OCR engines to do parallel OCR processing. It can be disabled or fine tuned to your specific use case and needs. You can lower parallel OCR tasks (or almost any tasks) to lower resource usage, but that will increase wait times to have the data available. It is impossible to accomplish both. The configuration out of the box is a one-size-fits-all because it is impossible for Mayan to predict how you want to balance out wait time vs. CPU usage.

More details are discussed here:

So It seems my Mayan server is very CPU hungry with in regards to the search workers and search message queue. As it was pointed out in the above referenced post, one change can require a lot of search updates.

Any change to a document causes a exponential cascade of search index updates. This means that working with 10 documents could cause 1000 search update tasks to be scheduled.

One project in early stages is adding a custom implementation of task deduplication. Mayan will check is a document is already in the search queue for index update and if so avoid queueing it again. Deduplication is almost like caching in terms of the dangers for false positives and edge cases.

Full text search was a common request, this is the downside.

Search engines do not operate like a database engine. They have none or very little concept of related objects. If a tag is modified then all documents that have that tag attached also need to be updated in the search engine. If 10,000 document have the tag, editing a single character in the tag will generate 10,000 document search engine update events.

Creating a system that was able to transparently translate between database transactions to search engine index update events took more than a year and several versions to perfect. No other open source document management system (or Django project) has such an advanced search synchronization system (mayan/apps/dynamic_search/tasks.py · master · Mayan EDMS / Mayan EDMS · GitLab). On top of that our system is able to work exactly the same regardless of the search engine being used (database, Whoosh, ElasticSearch, others). Flexibility and performance as usually at the opposite side of the spectrum and is up to the user to move the needle and adjust the knobs to their preference and needs.

The complete search syntax and functionality is found here: https://docs.mayan-edms.com/chapters/search.html#

ElasticSearch is very powerful and with power comes resources. ElasticSearch resource requirements alone match and in some cases exceed the resource requirements of Mayan. Keep this in mind when enabling it and manage expectations. For lower resources use Whoosh, which still requires updates when changes are made. For no indexing requirements use the database backends. The downside is that databases are not meant for searching and it will be very slow for basic searches. Use advanced search the for database backend.

When adding features, the first step is getting a new feature working correctly. Once it is debug and tested over a few releases, the next step is refactor based on lessons learned and optimizations while ensuring it continues to work for all use cases. In every version we either add, improve, or optimize.

However, like almost everything in Mayan, our search system is so far ahead of anything else available, that no existing solutions work and we need to once again solve this with custom implementations. It will take time but it is getting addressed. We already have a group working on our own custom background task implementation to add features like task deduplication. Knowledge and experience on asynchronous task deduplication on distributed systems is very scarce and people willing to help the project on this are even more scarce.