Scaling up concurrency

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
rssfed23
Posts: 7
Joined: Mon Oct 14, 2019 1:18 pm

Scaling up concurrency

Post by rssfed23 » Wed Oct 16, 2019 2:07 pm

:D Hey guys.

I recently deployed a standalone instance with RabbitMQ as the backend.

I added 3000 documents as a test and noticed it was taking absoloutely ages to process the indexes (and initially the OCR as well).

In the documentation it says to remove concurrency or increase it for the fast worker thread but doesn't mention medium or low.
I'm on a 32 core machine so increased fast to 10 and the other 2 to 15 or 20 but notice that things are actually slower than when I have the medium/slow at concurrency=1.
Why is this?
I'd have thought if I increased the medium/slow threads to 10 then it would be able to do 10 different OCR recognitions at once and 10 different indexes at once. Is that not the case?
Judging by the amount of outstanding messages in the queue (9000odd that isn't decreasing fast at all) is it possible what I've actually done is make the system index/OCR the same thing/message 10 times rather than the concurrency I'm thinking of?

When I put the medium back to 10 I notice the messages are clearing a LOT faster than when I had 15 medium threads (see picture) the drop at the end is when I switched from 15 back to 1 and I notice theno of pending messages awaiting ACK drops significantly as well.

Is there any documentation on this so I can understand things better? I'm going to be ingesting a few hundred thousand documents soon and wanted to make things as multi threaded as possible. Perhaps I don't understand messaging though :)

Any other tips to improve speed? On a fairly powerful dedicated box if that can help.

(this is all running in LXC so effectively the same as running it on the bare metal box).

Many thanks!
2019-10-16 15_03_13-RabbitMQ-Overview - Grafana.png
2019-10-16 15_03_13-RabbitMQ-Overview - Grafana.png (21.56 KiB) Viewed 259 times
Last edited by rssfed23 on Thu Oct 17, 2019 12:07 am, edited 1 time in total.

rssfed23
Posts: 7
Joined: Mon Oct 14, 2019 1:18 pm

Re: Scaling up concurrency

Post by rssfed23 » Wed Oct 16, 2019 10:56 pm

Okay I *think* I know what's going on.

When scaling up the concurrency on medium and during the indexing phase (which is where I was in the process when that screenshot was taken) it will put loads of entries like this in the celery log
Task mayan.apps.document_indexing.tasks.task_index_document[0da22810-8dfe-441f-9846-a334aa228f82] retry: Retry in 5s: LockError()

So I'm going to assume that we can't add concurrency to the medium or slow threads as they can't get a lock on the db table or something(??) and then it effectively ends up in a huge retry loop!

My question to the awesome devs is: are there some processes which can be run concurrently (e.g OCR). If so I can create a medium and slow queue for concurrent-capable processes as well as those that can't handle concurrency (like indexing seems to have). I think that's the best way to optimise performance? If you have a list of that kind of thing then let me know :)
I also ask this as in the e-book it talks about the queues a bit but says if you remove the concurrency option it'll default to using the no of CPUs for medium and slow which would cause users to run into this problem I think? I did notice it was able to add messages (I assume the fast process) a lot quicker during the initial unzip and upload but processing them all some tasks seem to dislike concurrency.

In the interim I think I can just try it out task by task perhaps and see what happens? I'd really love to make use of these 32 cores where I can :)

rssfed23
Posts: 7
Joined: Mon Oct 14, 2019 1:18 pm

Re: Scaling up concurrency

Post by rssfed23 » Wed Oct 16, 2019 11:12 pm

On a related note; indexing messages one by one as they come in if you've done a bulk load seems to take significantly longer than dumping the queue (10,000 documents) and doing a manual rebuild index. I guess this makes sense invoking it for everything vs document by document but I wonder if there's a way (in the future; feature request :D?) to optimise this somehow maybe a tickbox when uploading a document or a custom document type that kicks off indexing at the end rather than per document.

rssfed23
Posts: 7
Joined: Mon Oct 14, 2019 1:18 pm

[SOLVED] Re: Scaling up concurrency

Post by rssfed23 » Wed Oct 16, 2019 11:44 pm

Okay final note. Sorry about the spam but the thought process may be useful for others:
Enabled info logging on all workers and queue by queue found out it's only the indexer that seems to have concurrency issues everything else (and importantly OCR) can run concurrently.
So I can fire in 200,000 docs and the server grinds to a halt because of the indexer queue retries. I purge that queue and volia all other threads that I can see (during a document upload process) are working as I'd expect them to do.
I'm glad my idea of concurrency and messaging was largely correct :).
Not sure if it's a bug or not that indexing doesn't work that way also I imagine it's a DB issue.

So I'll separate out indexing into another worker thread with only 1 concurrent thread and problem solved. What took 28 hours to do yesterday just completed in under 1 hour.

I'm not sure if any of the other queues/worker processes have similar restrictions with concurrency as this was only a quick test on a massive zip document upload, so if you do know of any others then that would be very helpful :)

User avatar
rosarior
Posts: 406
Joined: Tue Aug 21, 2018 3:28 am

Re: Scaling up concurrency

Post by rosarior » Mon Oct 28, 2019 12:37 am

Thank you for documenting your process! I'm still going thru the posts to understand all the details. Thanks!

Post Reply