Search Backends with Small Servers

roberto.rosario · February 22, 2023, 10:11pm

Please bear in mind that while Mayan can work for home and personal use, and run on small hardware (https://magazine.odroid.com/article/solar-powered-microserver/) (Self hosted enterprise document server using Mayan EDMS 3.0 and an ODROID HC1 | by Roberto Rosario | Medium), that has never been its primary target. Mayan works the way it does because it favors automation and scalability targeting millions of documents.

Over time, features and optimizations have been added to allow more adjustments but there will always be a resource bottom limit that will be higher than other systems designed for small use.

For the absolute best indexing performance use the database as the search provider. The database does not requires indexing updates. The penalty however will be slow search results. The query generated for basic search will be quite complex (it will target all fields) and most likely result in timeout for almost any use case. It is best to disable it by passing the environment variable MAYAN_SEARCH_DISABLE_SIMPLE_SEARCH=true and use the advanced search which produces smaller query statements since it targets individual fields.

During hurricane Fiona, I spent a month without power. Having to work only with battery power led me to start adding support for Zinc (GitHub - zincsearch/zincsearch: ZincSearch . A lightweight alternative to elasticsearch that requires minimal resources, written in Go.) it is a less resource intensive alternative to ElasticSearch. It uses the same API and is packaged as a single Go binary. This would fit well between Whoosh (file based) and ElasticSearch (HTTP Java server). The backend is unfinished but if there is interest it could be added as a SIG project.

So It seems my Mayan server is very CPU hungry with in regards to the search workers and search message queue. As it was pointed out in the above referenced post, one change can require a lot of search updates.

Any change to a document causes a exponential cascade of search index updates. This means that working with 10 documents could cause 1000 search update tasks to be scheduled.

One project in early stages is adding a custom implementation of task deduplication. Mayan will check is a document is already in the search queue for index update and if so avoid queueing it again. Deduplication is almost like caching in terms of the dangers for false positives and edge cases.