Mayan performance and scalability for 70+ million documents

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
inam1567
Posts: 3
Joined: Tue Nov 26, 2019 11:40 am

Mayan performance and scalability for 70+ million documents

Post by inam1567 »

I am currently working on a project which have workload around 70+ million documents and it appears that Mayan may be suitable for our needs.But i was not able to find from the documentation whether mayan is capable of handling millions of documents.

I also have following some questions:
  • How many number of documents can be handled in community version? any other limitations.
  • Can mayan scalable horizontally? if yes, how ? any documentation /guide.
  • Performance of mayan on standalone instance?

User avatar
rosarior
Developer
Developer
Posts: 494
Joined: Tue Aug 21, 2018 3:28 am
Location: Puerto Rico
Contact:

Re: Mayan performance and scalability for 70+ million documents

Post by rosarior »

How many number of documents can be handled in community version? any other limitations.
There are no inherent limits for any of the objects (user, groups, documents, etc), it will depend on your hardware and how you customize the installation for your specific workload. As of version 3.2, there is only one type of version, the community one, the professional edition was retired and its improvements being merged into the normal version. All project revenue comes now from services only (https://www.mayan-edms.com/support/) and the book sale (https://www.mayan-edms.com/book/).
Can mayan scalable horizontally? if yes, how ? any documentation /guide.
There is a documentation chapter named "Scaling up" which covers the basics: https://docs.mayan-edms.com/topics/admi ... scaling-up

However most of these recommendations will be obsolete by the time the next version comes out because of significant changes in the Docker image and other aspects. The best approach is to contact us for a consultation contract so that we can provide a scaling up plan customized for your specific workload.
Performance of mayan on standalone instance?
What does "standalone instance" means in this context?

Finally, we stopped publishing use cases studies for privacy reasons for our clients. The default Docker image is built as a one-size-fits-all approach, and focuses on ease of installation. Still, we've had users upload several million documents with the default image. For 70+ millions documents I recommend you contact us directly for a quote for a custom deployment plan for your specific workload.

Thanks.

inam1567
Posts: 3
Joined: Tue Nov 26, 2019 11:40 am

Re: Mayan performance and scalability for 70+ million documents

Post by inam1567 »

There is a documentation chapter named "Scaling up" which covers the basics: https://docs.mayan-edms.com/topics/admi ... scaling-up
Thanks for your kind reply.
I've go through the above mentioned documents and found that multiple application hosts can be used but can database and *file storage scalable same like application host.?

*i don't want to go with object storage AWS S3.

User avatar
rssfed23
Moderator
Moderator
Posts: 191
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: Mayan performance and scalability for 70+ million documents

Post by rssfed23 »

inam1567 wrote:
Thu Nov 28, 2019 11:12 am
I've go through the above mentioned documents and found that multiple application hosts can be used but can database and *file storage scalable same like application host.?
*i don't want to go with object storage AWS S3.
I hope your evaluation went well Inam1567. I wanted to add some thoughts on the 2 points you raise so others that come across this know:
We bundle Postgres with the docker-compose image and it's our recommended database but it is not tightly coupled to it (as in there's not a database engine within Mayan itself it relies on whatever you provide to it).
Meaning: you scale up the database in the exact same way you would with any other Postgres instance separate from Mayan. Ultimately any method that doesn't change the underlying schema and provides a single entrypoint to the Database for Django to can start giving good results. https://www.postgresql.org/docs/9.6/hig ... ility.html is a good starting point. One thing to be mindful of when using technologies like streaming replication is tweaking the "COMMON_DB_SYNC_TASK_DELAY" value so that Mayan waits a bit longer to ensure writes are committed to all replicas before continuing. In my testing using the Redis lock backend (for file locking) scales more effectively than using Postgres for locking (and you will have to use a locking backend other than the default file one to scale to the number of nodes you need).

When it comes to file storage, similar to the Database you can use any technology that is able to present object, folders or block to your Mayan nodes. When it comes to millions of documents object storage is definitely preferred. Although you mention AWS S3 as a no no there are many other S3-compatible object stores that allow you to host your data on-premise. Minio springs to mind. It's also massively scalable so not only can you host it on-premise but it scales much more easily than you'd be able to get a traditional Netapp or NFS storage to scale. I strongly recommend you look into on-premise Object storage as it provides significant benefits beyond performance (compression, deduplication and most importantly sharding and erasure coding (so you can loose up to any % of nodes you configure and still retain all data. Think of it like a cross-node-spanned RAID but you can configure whatever RAID level you desire). Ceph is another solution that can provide a S3 compatible API but is more complicated than Minio but may also be considered more "Enterprise Friendly" and there are vendors offering support for it.

As you rightly say; multiple app nodes are going to be a must here. You'll want to launch multiple workers on multiple nodes to process tasks. The slower tasks like OCR are going to be a bottleneck when ingesting 70 million documents but the great thing about Mayan and its distributed setup is that you can launch 20 (or however many you need) worker nodes just to do OCR when initially ingesting and then scale those back to a couple for day-to-day operations once initial ingest is complete.

As Rosarior says, a consulting/support contract is the best way to get an environment like this up and running and for Enterprise-critical documents. They've done this type of deployment many times in the past and know all the tips and gotchas that can help give you a significantly reduced time to value.
Please don't PM for general support; start a new thread with your issue instead.

Post Reply