Rabbitmq growing too much in size - search queue

talbottech · March 31, 2023, 1:24pm

I’m having problem with rabbitmq volume growing too much and filling disk space.
I opened the web console to look inside rabbit and found the search queue is like of out of control.

After we send several docs via API (be it thousands or even just a few hundred) the documents seems to be processed normaly and we see them indexed.

But search queue is showing it recieves over 1k or 2k messages per second and it can’t process that much. So it grows until out of disk.

What could be the problem? What could be providing that queue that much volume? We are sending a couple GB of files and it is generating hundreads of GB in messase queue.

And we are seeing this behavior in different installations. One is an upgrade from v3, another is fresh v4. Both are docker-compose.

talbottech · March 31, 2023, 10:37pm

Hi,

Followup.

I made some test uploading single documents with web form.

Even a single document was creating over 300k peak messages that took around 40 minutes to clear.
First i tried disabling OCR / Text extraction, but everything was the same.
Tried different type of documents and found that ones with smaller indexes take a lot less.
Then tried one type that has bigger indexes that my first tries and search queue got up to 1.5 million messages with a single new doc.

What i find odd is that uploaded documents are actually indexed quite fast in 4.X (this comes from a 3.X install that was painfully slow to index new docs on big indexes). If index build/update is already done, what is generating so many messages/task?

I captured some messages: mayan.apps.dynamic_search.tasks.task_index_instance

We have tried reset/rebuild and in 4.X is really fast. What in 3.X could take us days or weeks in 4.X takes hours.

While i do get a new document generates many messages/task as each Metadata is processed independently; i still find hundreads or even millions of messages/task for a single new document to be too much. But maybe i’m wrong; Is there a equation (docs/meta/nodes/instance) to aproximate how many message are to be expected?

Thanks

Danynad · April 3, 2023, 8:43am

Hi,
I’ve been facing exactly the same problem as yours lately. Actually it’s been quite some time, since I was testing early 4.x releases.
But so far I can’t tell what causes this huge amount of messages.
I’ve discussed a related issue here:

All other performances and functionalities are great and I love Mayan. But having queues taking weeks at 100% CPU usage is a server-kill.

talbottech · April 11, 2023, 1:57am

It is indeed very frustrating.

I keep testing and still haven’t even found a workaround to make things at least tolerable.

Tried changing search backend to elastic, it was worst as elastic managed to chocke my CPUs and the queue consumer was 2 to 4 times slower than django backend.
Tried disabling everything that could require processing tasks: disabled all index triggers, diabled ocr, disabled text extraction, disabled file metadata, disabled indexes…
Nothing helped, nothing at all; a single new document (a test page in pdf) generates two spikes of massive search tasks.
Since disabling all that i could did not change the amount of messages generated, i think the number does not have to do with big index/nodes and more to do with the actual amount of documents of the same type.
To be close to precise i would say around 50+ maybe even 60 messages per existing document of the same type.
Tried adding a new custom worker just for search queue. It does help having two consumers, but is a small benefit that caps my CPU, and leaving no room for OCR. Before moving to docker and 4.X i used to give OCR multiple workers to speed things. Now is forcing 8 vCPU to max without even doing OCR just because the search queue is out of control.
I started looking at code, but while y do find the tasks in dynamic search app, i cannnot find yet what could trigger them.

So far, purging the search queue is the only thing to do for the time being to avoid filling disk, guess i will have to look at the rabbitmq api to automatically do it everyday in off hours. And cross finger i don’t actually loose useful tasks in the middle of massive chunks.

roberto.rosario · April 11, 2023, 9:06pm

All other performances and functionalities are great and I love Mayan. But having queues taking weeks at 100% CPU usage is a server-kill.

This is not caused by Mayan, it is caused by running multiple OCR engines to do parallel OCR processing. It can be disabled or fine tuned to your specific use case and needs. You can lower parallel OCR tasks (or almost any tasks) to lower resource usage, but that will increase wait times to have the data available. It is impossible to accomplish both. The configuration out of the box is a one-size-fits-all because it is impossible for Mayan to predict how you want to balance out wait time vs. CPU usage.

More details are discussed here:

So It seems my Mayan server is very CPU hungry with in regards to the search workers and search message queue. As it was pointed out in the above referenced post, one change can require a lot of search updates.

Any change to a document causes a exponential cascade of search index updates. This means that working with 10 documents could cause 1000 search update tasks to be scheduled.

One project in early stages is adding a custom implementation of task deduplication. Mayan will check is a document is already in the search queue for index update and if so avoid queueing it again. Deduplication is almost like caching in terms of the dangers for false positives and edge cases.

Full text search was a common request, this is the downside.

Search engines do not operate like a database engine. They have none or very little concept of related objects. If a tag is modified then all documents that have that tag attached also need to be updated in the search engine. If 10,000 document have the tag, editing a single character in the tag will generate 10,000 document search engine update events.

Creating a system that was able to transparently translate between database transactions to search engine index update events took more than a year and several versions to perfect. No other open source document management system (or Django project) has such an advanced search synchronization system (mayan/apps/dynamic_search/tasks.py · master · Mayan EDMS / Mayan EDMS · GitLab). On top of that our system is able to work exactly the same regardless of the search engine being used (database, Whoosh, ElasticSearch, others). Flexibility and performance as usually at the opposite side of the spectrum and is up to the user to move the needle and adjust the knobs to their preference and needs.

The complete search syntax and functionality is found here: https://docs.mayan-edms.com/chapters/search.html#

ElasticSearch is very powerful and with power comes resources. ElasticSearch resource requirements alone match and in some cases exceed the resource requirements of Mayan. Keep this in mind when enabling it and manage expectations. For lower resources use Whoosh, which still requires updates when changes are made. For no indexing requirements use the database backends. The downside is that databases are not meant for searching and it will be very slow for basic searches. Use advanced search the for database backend.

When adding features, the first step is getting a new feature working correctly. Once it is debug and tested over a few releases, the next step is refactor based on lessons learned and optimizations while ensuring it continues to work for all use cases. In every version we either add, improve, or optimize.

Version 4.0 added the abstracted search system and search scopes: Version 4.0 — Mayan EDMS 4.5.8 documentation
Version 4.1 improved reindexing and abstraction: Version 4.1 — Mayan EDMS 4.5.8 documentation
Version 4.2 included feature complete Whoosh support, command line command to manage the search indexes, bulk indexing support, initial support for ElasticSearch, and a separate task queue for search indexing (Version 4.2 — Mayan EDMS 4.5.8 documentation).
Version 4.3 expanded the search system to work to filter user interface and API list of objects, improved sanitation of text for search indexing, first round of optimizations. (Version 4.3 — Mayan EDMS 4.5.8 documentation)
Version 4.4 standardized the search syntax across all backends, added new search types, search operands, virtual search fields, and made search a first citizen feature by including it in the main menu bar (Version 4.4 — Mayan EDMS 4.5.8 documentation).

However, like almost everything in Mayan, our search system is so far ahead of anything else available, that no existing solutions work and we need to once again solve this with custom implementations. It will take time but it is getting addressed. We already have a group working on our own custom background task implementation to add features like task deduplication. Knowledge and experience on asynchronous task deduplication on distributed systems is very scarce and people willing to help the project on this are even more scarce.

talbottech · April 11, 2023, 10:06pm

Hi Roberto,

I get that if i update something that affects 10k documents, then 10k tasks could be generated.
What i don’t get is how adding a SINGLE document, that happens to be same type of another 10k documents, then 600k tasks are generated.

Don’t get me wrong, i see the improvements in mayan with 4.X and we love some of them. We actually want to move a mayan v3 to v4 because we currently have problems with big indexes. That piece looks actually solved pretty well in 4.x; but this amount of tasks in the search queue… i think there must be something wrong to generate that much.
We also are experiencing the same thing on a brand new 4.x implementation. At first we did not notice as it had few documents. But after surpasing 1500 total documents adding even a 100 documents per day is a problem for search queue.

Both instances are docker compose, and my testing shows that neither OCR, document metadata, indexes/nodes nor file metadata have any significant impact on the generated messasges in the search queue; only the amount of same document type.

talbottech · April 18, 2023, 6:29pm

@roberto.rosario

More info after testing in three environments and capturing queue messages. Two with database search backend, one with whoosh.

The queue messages always corresponds to this task: mayan.apps.dynamic_search.tasks.task_index_instance

In particular: it is ‘model_name’: ‘documentversionpage’ and ‘model_name’: ‘documentfilepage’

So, it looks that for each page in existance of the particular document type a new doc is uploaded, it is creating 2 tasks.

That number correlates to the amount of task acumulated and being two models would explain the two spikes is see.

In any case. I think adding a new document while not touching the others should not trigger search reindex of each page of that particular document type. It is like reindexing (the search piece, not the actual indexes) of the whole document type with each new document, absolutely insane if we have not modified anything.

I have checked the ID numbers of the pages and indeed are for old pages of old documents.

talbottech · May 12, 2023, 5:15pm

@roberto.rosario any posibility of this issue being addressed?

I think i have provided more than enough information to prove that the problem is well beyond the scope of what you have responded. But if you think anything else is missing from my tests and report please let me know.

vintager · May 25, 2023, 9:35am

Thank you for this thread, @talbottech! I’ve got exactly the same problem, down to the number of documents (1.5k with ~17k pages). Just to be clear, for us, 1.5k is simply a testing run, the real amount of documents in production is expected to be magnitudes higher.

I’ve run into this problem on an updated installation (4.3 → 4.4), and just to be on the safe side, 3 days ago I did a clean 4.4.6 Docker install (no cabinets, no indices, no tags, no workflows, one default document type), and the search queue is still sitting at 600.000+ messages.

I’ve tried increasing the search chunk size, I’ve tried “spinning up” (starting, really) additional B and A workers just for the search queue. The servers we’re testing on both have a 100GB SSD; one is 6 cores, 12 GB RAM, the other is 12 cores, 24 GB RAM. In both cases, Mayan’s shitting into the search queue like there’s no tomorrow.

Adding resources doesn’t seem to help–Mayan simply starts spamming the queue faster.

To make matters worse, this seems to be an old problem:

(notice how the thread starts with the high CPU load, but then moves to “just like in the above thread I have the search queue eating up all my resources endless”)

Some other guy having pretty much the same problem:

@roberto.rosario, you’re saying “it is getting addressed,” but the reality is, current version of the system is completely unusable. Well, maybe it is–for a hundred documents or so. Are you doing any QA? How could this slip through?

Honestly, I’m completely flabbergasted and after months of working with Mayan will probably be recommending to my organization that we try something else.