inam1567 wrote: ↑
Thu Nov 28, 2019 11:12 am
I've go through the above mentioned documents and found that multiple application hosts can be used but can database and *file storage scalable same like application host.?
*i don't want to go with object storage AWS S3.
I hope your evaluation went well Inam1567. I wanted to add some thoughts on the 2 points you raise so others that come across this know:
We bundle Postgres with the docker-compose image and it's our recommended database but it is not tightly coupled to it (as in there's not a database engine within Mayan itself it relies on whatever you provide to it).
Meaning: you scale up the database in the exact same way you would with any other Postgres instance separate from Mayan. Ultimately any method that doesn't change the underlying schema and provides a single entrypoint to the Database for Django to can start giving good results. https://www.postgresql.org/docs/9.6/hig ... ility.html
is a good starting point. One thing to be mindful of when using technologies like streaming replication is tweaking the "COMMON_DB_SYNC_TASK_DELAY" value so that Mayan waits a bit longer to ensure writes are committed to all replicas before continuing. In my testing using the Redis lock backend (for file locking) scales more effectively than using Postgres for locking (and you will have to use a locking backend other than the default file one to scale to the number of nodes you need).
When it comes to file storage, similar to the Database you can use any technology that is able to present object, folders or block to your Mayan nodes. When it comes to millions of documents object storage is definitely preferred. Although you mention AWS S3 as a no no there are many other S3-compatible object stores that allow you to host your data on-premise. Minio springs to mind. It's also massively scalable so not only can you host it on-premise but it scales much more easily than you'd be able to get a traditional Netapp or NFS storage to scale. I strongly recommend you look into on-premise Object storage as it provides significant benefits beyond performance (compression, deduplication and most importantly sharding and erasure coding (so you can loose up to any % of nodes you configure and still retain all data. Think of it like a cross-node-spanned RAID but you can configure whatever RAID level you desire). Ceph is another solution that can provide a S3 compatible API but is more complicated than Minio but may also be considered more "Enterprise Friendly" and there are vendors offering support for it.
As you rightly say; multiple app nodes are going to be a must here. You'll want to launch multiple workers on multiple nodes to process tasks. The slower tasks like OCR are going to be a bottleneck when ingesting 70 million documents but the great thing about Mayan and its distributed setup is that you can launch 20 (or however many you need) worker nodes just to do OCR when initially ingesting and then scale those back to a couple for day-to-day operations once initial ingest is complete.
As Rosarior says, a consulting/support contract is the best way to get an environment like this up and running and for Enterprise-critical documents. They've done this type of deployment many times in the past and know all the tips and gotchas that can help give you a significantly reduced time to value.