Expected performance

When things don't work as they should.
Post Reply
vdkeybus
Posts: 2
Joined: Sat Jan 18, 2020 10:49 am

Expected performance

Post by vdkeybus »

I have added approx. 5k documents to Mayan.

It is a mix of .jpg, .png and .pdf. The pdfs are both 'native' pdf (no bitmaps) and generated from scanned pages. Sometimes also a combination of scanned pages and 'native' pdf. Created by either gscanpdf or imagemagick (convert).

The documents have all been processed by mayan for text content, file metadata and OCR.

There is only one document class. I intend to move to multiple classes as I get along with the system.

Typing a keyword in the dashboard always takes 2+ minutes to start getting results, and then easily an additional 2 minutes to get thumbnails in view. It is actually faster to just look for the document in the traditional way (i.e. in paper).

A couple of questions:
- are there any recommendations to speed this up ? The usual process of adjusting a search query to better match the desired result is next to impossible to use (in addition to the fact that you don't get to see your initial query with the results);
- is there any way to sort the results on e.g. date ?

Thanks,

User avatar
rssfed23
Moderator
Moderator
Posts: 185
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: Expected performance

Post by rssfed23 »

Thumbnail generation and searching are 2 different performance areas to talk about. I'll start with the database:

The current searching method isn't optimal for full text queries and there are improvements planned in this area (switching to Whoosh will take those 2 minutes down to seconds).
If you read through viewtopic.php?f=7&t=1560&p=2987 and you'll see this is a known issue of sorts and is partly due to the backend Mayan uses but also due to the way Postgresql works and its lack of concurrency.
The way to speed things up for you in the interim is to speed postgres up as postgres is acting as the bottleneck here due to how the queries are being run. You can do this either by putting it on more powerful hardware but also by running standard postgres optimisations (setting up huge pages, setting up caching, etc. Items that are out of scope for us to advise on in the forum really).
If searches are a problem, one of the reason we provide indexes and other organisation features is to avoid having to manually search for documents also.

For thumbnails:
In terms of thumbnail generation, it would help to know a little bit about the environment. Is it a docker install or direct install? What kind of hardware is it running on? It's not on AWS or another cloud is it?
I ask as there's some tweaks we can make to docker to speed up the impression of generating thumbnails that are platform/install method dependent. One of those if using docker is a cause of a particularly nasty slowdown in thumbnail generation especially when running on a public cloud.

We can also increase the concurrency - how many are generated at once.
For that we switch to RabbitMQ as the task queue and then you can have as many converter (what processes the thumbnails) threads as you want (recommendation is no of cores+1).
This is detailed in the direct deployment docs at https://docs.mayan-edms.com/chapters/deploying.html# (Scroll down to advanced deployment).
If you're using docker, then using the docker compose file at https://gitlab.com/mayan-edms/mayan-edm ... ompose.yml you can enable RabbitMQ for task queues instead of Redis and then increase the number of fast worker threads.

You also definitely want to increase the document cache value. It can go up to 2gb and will make browsing the document view pages much faster in most cases (as document thumbnails won't have to be generated on the fly they can be cached). Setting DOCUMENTS_CACHE_MAXIMUM_SIZE: 2000000000 will set the cache to just under 2GB.

You're at the size now where a single machine docker simple install will start to show it's limitations. The next step is increasing the number of workers as described above and seeing how that performs for thumbnails. Eventually - if this is a production installation - you'll want to look into multiple worker VMs/nodes to distribute the load and speed things up but you've likely got a bit of growing room first before needing to do that.

Also; migrating off docker to a direct install can provide immediate performance benefits for a number of reasons. Especially if your postgres is containerised.
is there any way to sort the results on e.g. date ?
Not currently no, however work on this has already begun. The code for list mode instead of the thumbnail view has already been done and is planned for version 4. User selectable columns will come after that (viewtopic.php?f=8&t=1528).

Mayan can comfortably scale to millions of documents with tens of millions of pages, and the team has got experience with deployments of this size. If following the above isn't working out for you I strongly recommend you consider a Mayan EDMS support plan where more personalised/directed guidance can be given. If this is a production installation and will eventually need to scale even further than a custom consulting agreement can be made to deal with designing a suitable architecture for the scale you need.

So in summary:
For searching:
- Slow searches are on the roadmap to be improved
- In the interim, speeding up postgres using industry standard postgresql tweaks can help significantly
- Split postgres off onto a separate dedicated server/VM
- Don't run postgres containerised so you can take advantage of pg hugetables which brings with it potentially an immediate 40-60% performance improvement in queries right away
- Consider implementing some kind of DB cache for postgres
- Searching for documents should ideally be a last resort in a Mayan configured for optimal organisation. Using indexes in combination with metadata, tags and cabinets will reduce the dependency on searching

For thumbnail generation:
- Convert to RabbitMQ for the task queue so you can safely increase the number of fast worker threads (to the no of cores+1)
- Converting to RabbitMQ also paves the way for running Mayan on multiple nodes with some dedicated to thumbnail generation if required
- We need to know if you're running in docker and especially if you're running Mayan on the public cloud as there's some special optimisations we can make. AWS is particularly nasty! You're welcome to make the change linked in the issue if you're running dockerised Mayan as it should immediately improve the responsiveness "feel" of thumbnail generation. This issue (among others) is one reason a direct install can be more performant. The docker image is a good balance between no of documents/speed/performance. As you increase no of documents, the other two will naturally decrease in proportion.
- Increase your documents cache size so more thumbnails can be cached
- If Mayan is on a spinning disk, move to a SSD (thumbnail can be quite high on IO)
- Run mayan on faster hardware
- Ultimately clustering mayan with multiple worker nodes using shared storage is used when you get into the 100,000 document+ range

For both:
- Consider a Mayan EDMS support plan so we can provide tailored advice to your environment and needs. If this is a production company Mayan instance then an on site consulting engagement can be utilised to deliver an install that meets your speed and budget requirements. sales@mayan-edms.com is the address to contact for that.


I might add a few more things covering other areas bar thumbnail generation and make this a sticky post for future reference. Thanks for asking.
Please don't PM for general support; start a new thread with your issue instead.

vdkeybus
Posts: 2
Joined: Sat Jan 18, 2020 10:49 am

Re: Expected performance

Post by vdkeybus »

Thanks for the really elaborate answer.

It tells me that the lower performance isn't due to e.g. a configuration issue. This is for a private system where performance is not critical, but I would of course fix any glaring problems in my setup if they were there.

As for the thumbnail generation, my Mayan EDMS runs on an AMD E450 with only 2 cores and 8GB of RAM, and it's also doing other things (5 days for OCR'ing 5k pages is indicative of that).

FYI, this is a bare-metal setup running Debian with native PostgreSQL and Mayan in a venv. I will eventually also have it served by the Apache web server that's also on there.

User avatar
rssfed23
Moderator
Moderator
Posts: 185
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: Expected performance

Post by rssfed23 »

That’s good to know. At least with a direct install you can apply all the standard Postgres optimisations out there in the world. Putting 2gig of that 8gig into a dedicated hugepage for Postgres will help with DB performance immediately (and in theory also search). Mayan doesn’t use a huge amount of ram anyway.

With it still OCRing in the background that’s going to also take up available slots for the converter (each page is turned into a jpg first before tesseract OCRs it!), so once your OCR is done that’ll give immediate improvement to thumbnail generation time. It’s still worth increasing the document cache size and having 2 fast workers (so both cores can generate thumbnails at once). That’ll give you the best out of thumbnails aside from HW changes :)
Please don't PM for general support; start a new thread with your issue instead.

Post Reply