150k pages taking 2+ minutes to search

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

150k pages taking 2+ minutes to search

Post by Hybred »

Hello --

I just recently OCR'ed about 150,000 documents into Mayan. It's taking over 2 minutes to search and ultimately times out. (Due to default docker setting)

I'm going to work towards increasing the number of Unicorn workers to 4+1 to match one of the host sockets, but are there any other major optimizations I can do to increase the search speed? For the most part no new data will really be added, so it's going to be the query speed that ultimately will have the largest impact.

Currently I'm considering creating a new container with some settings increasing the following:

MAYAN_GUNICORN_WORKERS
MAYAN_WORKER_FAST_CONCURRENCY
MAYAN_GUNICORN_TIMEOUT

Are there any other settings I should be aware of to improve search speeds?

Thank you!

User avatar
rssfed23
Moderator
Moderator
Posts: 185
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

When it comes to search, I'd have to check the code but I don't think increasing the FAST concurrency will have any effect.
Searches aren't done using a task but a direct gunicorn query to your database.
I don't think increasing gunicorn threads will help either as I *think* the query will be run once by one thread but don't quote me on that.
Gunicorn timeout *might* help. It defaults to 120 seconds. As you mention 2 minutes in your title I have a suspicion that the timeout may be occurring and then Mayan is just displaying all the results returned so far and not necessarily all of the results in the database.
Are you getting any timeout errors in the log?

What database backend are you using? Hopefully postgres as that will give the best performance.

This really boils down to a database performance issue more than anything else. As you start scaling up the database can quickly become a bottleneck on a single node as you'd expect. If there's a lot of OCR text then there's a lot of data to search through.

Your focus should be on increasing performance of postgres. That usually means running it on a dedicated host not inside of a container (dockerised postgres/DBs will generally always be slower than a bare metal approach, depending on how the data is being accessed).

Could you answer for me the following please:
Is the postgres DB running on shared/remote storage of some kind (like NFS)?
Are you using a docker volume, a bind mount (-p /docker-volume/...:/var/lib/postgres) or some kind of remote shared storage for the postgres database?
Have you checked the postgres slow query log? - https://www.cybertec-postgresql.com/en/ ... ostgresql/
What are the specs of the host running postgres?
When the query is running and you do a top, is it postgres taking up most of the CPU time and do you have any % value next to wa, in the 2nd row down in top? - that's diso IO wait and shows that the disks aren't fast enough for postgres to handle it.
Have you investigated tuning postgres performance? - https://stackify.com/postgresql-performance-tutorial/ has some tips in
Have you allocated any huge pages? These can help significantly with postgres performance. I don't know if containerised postgres can utilise them but a direct postgres deployment can and can increase speed up to 40% immediately. https://medium.com/@FranckPachot/did-yo ... 97e7727b03

If you're able to provide slow query logs that can help the devs identify if there's a bug here. But please don't be surprised if what you're experiencing is expected behaviour due to resource constraints on the postgres database.

When environments start getting into the hundreds of thousands of documents that's when you need to start investigating multi-node setups and considering direct installs over docker. The first step is splitting postgres off of the same node as mayan itself and running it as a dedicated database node (not in a container, using local SSD's, etc).

I should also mention that helping with performance/scaling especially in an enterprise environment is when you might want to consider a mayan support plan or professional services/consulting (https://www.mayan-edms.com/support/). Once a setup gets over a certian size it will need to scale beyond one node and designing an enterprise production-ready scalable environment is something the consultants are experienced with. There's only so much help we can give over the forum if the issue is postgres/database performance/tuning and not with Mayan itself.
But in the meantime answering all the above will help point us in some direction.

Rob
Please don't PM for general support; start a new thread with your issue instead.

Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

Re: 150k pages taking 2+ minutes to search

Post by Hybred »

Thanks for the quick response!

I followed this guide with almost no changes. So yes, it's a PostGRE databse running in a container on the same host as the EDMS.

https://docs.mayan-edms.com/topics/docker.html

The host specs I've allocated are 4 cores and 8gb memory, but can be doubled on each end fairly easily if needed. As it stands right now though when it's running a search it's only using at most 35% of the allocated host CPU resources. The drives are 7200RPM SAS.

In terms of large pages, I'm not following 100%. But the entire reason we've had to move into an EDMS is due to some poor storage decisions that were made over the last couple years where documents were being pooled into massive 200+ page PDFs instead of spread across many smaller documents. So on average, each PDF is 30MB+ and takes quite a bit of time to comb through.

I'll be grabbing an Enterprise support plan as soon as I can get this standing up properly. If it requires me to grab it to do, I can get it approved. Sadly it worked fairly well (slow) up until I added 70,000 more documents. Now it's hitting the 120s timeout threshold and failing to finish.

Are there query logs I can grab and point to?

User avatar
rssfed23
Moderator
Moderator
Posts: 185
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

No worries :)

That makes a lot of sense. I'm certain it's a postgres bottleneck you're experiencing.

When I said huge pages this was referring to a linux kernel feature not your actual documents themselves (that I linked to https://medium.com/@FranckPachot/did-yo ... 97e7727b03 which explains things further), but of course if you've got lots of documents with lots of pages that just adds support to my theory on it being postgres.

In terms of logs, they can be obtained with a "docker logs <containername>" where containername is the name of your mayan container. The postgres container *might* have some logs in it but if it's just running slow not likely. If Gunicorn is timing out when running a query it should appear in the mayan container log.

Given your use case I can see why document search within OCR'd documents is so important to you! - It's the only way you're able to find the actual page range a document resides in. It's on the roadmap that mayan will be able to split documents in the future (not relevant to this discussion here just mentioning it's coming as it will help you further).

Right, back to postgres:
It's possible it's waiting on disk. If you do a top how much % disk wait is showing when the query is being run?

Do you have another machine/VM you could set up the database on? You can try increasing CPU but as you say it's not using 100% of CPU already. Some more RAM might help (you can check with "free -h" to see how much memory is in use) but I suspect the issue is more disk related.

When running in the enterprise with lots of documents you really want at least to separate the DB from the Mayan application server for the reason you're experiencing now: performance.
The container provides a good balance between speed and performance for many use cases and for people trying Mayan out. What you're experiencing now isn't entirely a Mayan problem (as the search does work; just slowly) but a postgres one and there's not much we can do aside from have you split off postgres to another machine or apply the performance tweaks I linked to in previous articles (most of which you can't apply with postgres in a container).

If you are considering support it may be possible for the team to provide some more specific guidance. They can also help with general document management best practice and how best to import the documents you've got into Mayan to avoid these bottlenecks or to improve overall structure and fix the problems of the past.
What I suggest you do is to log a GitLab issue: https://gitlab.com/mayan-edms/mayan-edms/issues. There's a "mark issue as confidential button" at the top of the new issue page. Click that when filling out the issue form and that means only the project team can see it.
We can discuss this issue further in the GitLab issue (and potentially have a call), and it's also where the consulting team can also help out and if it leads to it a support agreement can discussed there also in confidence.
It also gives us a chance to understands your requirements further so we can help Mayan meet your needs.
(to be clear; I'll still try and help in the gitlab issue before any purchase. It's a better place for us to put this discussion as the consulting team can see it there also).
Please don't PM for general support; start a new thread with your issue instead.

Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

Re: 150k pages taking 2+ minutes to search

Post by Hybred »

After doing a top I'm seeing the following:

https://i.imgur.com/aT03HZy.png

So it seems like it too is having an issue using the full 4 cores allocated to the host. I've tried running the following in an attempt to get the container to use more of the host resources but it doesn't seem to work.

docker update --cpus 4 mayan-edms-postgres

How would I achieve that desired result?

If I were to migrate to a full Windows VM, is there a simple way to transfer the data?

Thank you!

User avatar
rssfed23
Moderator
Moderator
Posts: 185
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

Hybred wrote:
Mon Jan 06, 2020 2:20 pm

If I were to migrate to a full Windows VM, is there a simple way to transfer the data?
What do you mean by a "full windows VM"? - How are you running Mayan today?
Are you not using a Linux VM or physical Linux machine or are you using docker for windows, WSL or something?
Please don't PM for general support; start a new thread with your issue instead.

Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

Re: 150k pages taking 2+ minutes to search

Post by Hybred »

Right now, the Mayan DB is running on PostGRE within Docker on Alpine Linux, per the Mayan instructions.

I have a PostGRE instance on Server 2012 I could migrate to.

Edit: Though I would prefer not to* if possible

User avatar
rssfed23
Moderator
Moderator
Posts: 185
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

I see what you mean now thanks for clarifying.

The host you're running the docker containers on, is that windows also or is that Ubuntu or another Linux?
(I ask as if it's docker for windows then you can change the number of cores and memory the docker VM has access to in the docker for windows settings)


To export the data from Mayan you can run:
"docker exec -ti PostgresContainerName bash" to get inside the container running postgres

Then run
"pg_dump mayan -c -U mayan > dump.sql"
This will dump the data to dump.sql

However this file is currently inside the container. To get it to your local desktop run:
docker cp containername:/dump.sql .

Which will transfer the dump.sql file to your local machine.

From there on you can get the dump file over to the other server and load it in with the pg_admin web browser or however you normally do database management on that server.
Please don't PM for general support; start a new thread with your issue instead.

User avatar
rssfed23
Moderator
Moderator
Posts: 185
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

I found why postgres isn't using more than 1 CPU: https://stackoverflow.com/questions/182 ... tion-query

It's an inherent limitation of postgres 9.6 (the default version in the mayan docker instructions) and nothing you can do really to fix it. In this case your only option to speed up Postgres is to move it to a faster machine that has better single core performance..

You could try setting up a seperate postgres 10/11 (12 is known to have issues) instance as in version 10 there is some native parallelisation: https://dba.stackexchange.com/questions ... iple-cores
But Mayan hasn't been as extensively tested on 10/11 and there's no guarantee that it will actually speed things up. If you do upgrade to a newer version you have to exec into the container and run "pg_upgrade" to migrate the data files from 9.6 to 10/11 (make a backup before you attempt any of that though).

Ultimately the only guaranteed way to speed up the searches is to run the postgres container on a more powerful machine so it can search faster.
Please don't PM for general support; start a new thread with your issue instead.

Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

Re: 150k pages taking 2+ minutes to search

Post by Hybred »

Interesting. This would be the first time I've seen PostGRE be the hangup on a deployment. But it's also the first time I've run one within docker, which it seems like it's not a huge fan of.

Post Reply