Search Backends with Small Servers

DocCyblade · February 20, 2023, 5:48pm

Wanted to start another topic, it stems from the post “Document Processing Times” but focus on the search backends in the context of running Mayan at home for home family use. (See my use case below and example case)

Thought I would get the ball rolling and ask to see if anyone in forum-landia has had experience with different search backends (I know of two I think that Mayan supports: whoosh the default and elasticsearch)

Looking for examples and pointers to documentation on how to change the default search backend, maybe tweaking the default to align more with the home use, use case. As I am sure there are others out there like me that just want a digital file cabinet.

My Use Case

My Use Case
I wanted to put out there my use case. As with any software there is no one size fits all my use case here is just for my families personal documents and the goal is to be able to search for a document’s OCR data. I will use indexes for commonly looked up by year/type and other added meta data. I share this to give context to the topic. I am not looking to processing 5,000 documents a day. More like 1-15 and as quick and low energy usage as possible.

So It seems my Mayan server is very CPU hungry with in regards to the search workers and search message queue. As it was pointed out in the above referenced post, one change can require a lot of search updates.

Example document processing times

Example, as part of my email submission work flow, I need to change the document type and update the meta data fields for that type. Just updating one, two page document and three meta fields, moving the workflow to completed (it also changes tags and file cabinets, so more changes I presume) that processing took two hours to complete and was showing 50,000+ messages in the search queue. I feel this is a but much I think, but maybe I am off base and have unreal expectations. My specific setup is below to provide context with this example.

My Server Setup for Context

Manual Install from (Turnkey Linux v17) and upgraded to v4.4.4
Running as a VM in Proxmox v7.3
6 x vCPUs (Xeon(R) CPU E3-1245 v5 @ 3.50GHz )
10GB vRAM (DDR4)
500GB vHDD (NVME backed)
Note this is a manual install and it not technically supported, however I hope to be moving to the docker install once I get my mind around the docker environment, and that may play a role in this issue/question as well.

roberto.rosario · February 22, 2023, 10:11pm

Please bear in mind that while Mayan can work for home and personal use, and run on small hardware (https://magazine.odroid.com/article/solar-powered-microserver/) (Self hosted enterprise document server using Mayan EDMS 3.0 and an ODROID HC1 | by Roberto Rosario | Medium), that has never been its primary target. Mayan works the way it does because it favors automation and scalability targeting millions of documents.

Over time, features and optimizations have been added to allow more adjustments but there will always be a resource bottom limit that will be higher than other systems designed for small use.

For the absolute best indexing performance use the database as the search provider. The database does not requires indexing updates. The penalty however will be slow search results. The query generated for basic search will be quite complex (it will target all fields) and most likely result in timeout for almost any use case. It is best to disable it by passing the environment variable MAYAN_SEARCH_DISABLE_SIMPLE_SEARCH=true and use the advanced search which produces smaller query statements since it targets individual fields.

During hurricane Fiona, I spent a month without power. Having to work only with battery power led me to start adding support for Zinc (GitHub - zincsearch/zincsearch: ZincSearch . A lightweight alternative to elasticsearch that requires minimal resources, written in Go.) it is a less resource intensive alternative to ElasticSearch. It uses the same API and is packaged as a single Go binary. This would fit well between Whoosh (file based) and ElasticSearch (HTTP Java server). The backend is unfinished but if there is interest it could be added as a SIG project.

So It seems my Mayan server is very CPU hungry with in regards to the search workers and search message queue. As it was pointed out in the above referenced post, one change can require a lot of search updates.

Any change to a document causes a exponential cascade of search index updates. This means that working with 10 documents could cause 1000 search update tasks to be scheduled.

One project in early stages is adding a custom implementation of task deduplication. Mayan will check is a document is already in the search queue for index update and if so avoid queueing it again. Deduplication is almost like caching in terms of the dangers for false positives and edge cases.

DocCyblade · February 25, 2023, 3:28am

Understood! Kind of like Mayan is an F1 race car, and while you could drive it to work your better off in something better suited. That said, however with a few tweaks and understanding how cool would it be to pull up in front of your co-workers in an F1… Joking aside, I do think as long as expectation are set and some tweaking/documentation it can work for the home use server. That’s my mission and I will try to document as I go in my Mayan@Home journey.

So with that in mind, I wanted to try that. so I changed the search backend inside of my environment file (.env)

...
# inside your .env file
MAYAN_SEARCH_BACKEND=mayan.apps.dynamic_search.backends.django.DjangoSearchBackend
MAYAN_SEARCH_DISABLE_SIMPLE_SEARCH=true
...

I did not see an option to "Reindex search backend” under tools. I did verify under settings that the changes where effective.

zx0045 · February 25, 2023, 3:53pm

Total newbie to Mayan, but part of this thread drew a parallel that may be worth sharing, @roberto.rosario and @DocCyblade

In the parallel universe world of self-hosted / small company email servers, indexing emails for quick searching is a challenge, and the “solution” was – for a decade – the unwieldy and resource-intensive Apache SOLR. F1? No, Mack Truck that people try to shoehorn onto Pinewood Derby hardware.

During Covid, a team put together an alternative that blew beta testers away, and is now being integrated as the default into Dovecot (which is the linux email inbox core / standard for everyone). It is called FTS-Flatcurve. GitHub - slusarz/dovecot-fts-flatcurve: Dovecot FTS Flatcurve plugin (Xapian)

If the work of indexing lies in searching extant files for keywords, it may be worth a look. If the challenge for indexing in Mayan is more DB-intensive (I’m not yet into the weeds in the underlying Mayan engine), then it may not be worth the time. Best, ZX

roberto.rosario · February 27, 2023, 4:53pm

The F1 analogy is pretty accurate

There many settings that can be tweaked to lower the requirements but that will be at the cost of performance and disabling features. Some example:

Not all document types need OCR, parsing, and file metadata extraction. Disable them from those document types and background tasks will be reduced.
Disable apps that are not being using. Things like the workflows, document linking, weblinks, commenting, can be disabled and will remove event triggers and permission checks.
Lower the thumbnail and preview resolutions.
Increase the file of the caches.
Place the caches in their dedicated solid state drive.
Spin up worker A class instances in dedicated fast CPU hosts.
Try different browsers. For some users Firefox works best, for others Chromium works best.

I did not see an option to "Reindex search backend” under tools. I did verify under settings that the changes where effective.

Yes that is the correct expectation. The database backend does not need indexing.

roberto.rosario · February 27, 2023, 4:53pm

Thanks for the recommendation.

Looking at the repository this seems to be a plugin for Dovecot that uses Xapian as the backend and not a search system itself.

Xapian is by nature file-based and and its usage in Mayan will have the same artifacts that Whoosh currently does.

Xapian does have server project called xapian-omega. That would cause it to work like ElasticSearch. None of the team members have used Xapian or xapian-omega. We need to learn the search and indexing syntax, test for edge cases, try different deployment scenarios. This means that if scheduled for research, it may take several months to determine if a Xapian/xapian-omega backend is added.