How to use Elasticsearch Backend?

MrSaltman · March 17, 2023, 9:23am

Dear Mayan Users and Creators!
(Note: I am no expert regarding Docker or Elasticsearch in general, so please forgive me, if I am overlooking something obvious)

I am currently running MayanEDMS version 4.4 via docker-compose on a remote server (with traefik set up). After browsing some time, I realised that the current Search backend (Whoosh per default, since 4.2, I reckon) does not seem to properly work for me.

Since I actually want to use Elasticsearch, that was not that big of deal, but the real problem is that I can not seem to get Elasticsearch to work. From what I can tell, it is mentioned in the docker-compose per default, which should state, that there is at least a running instance of the Elasticsearch service. The actual problems/errors occur when I change the Search backend to Elasticsearch and try to search: I am getting the following error (as a pop-up in MayanEDMS):

Search backend error. Verify that the search service is available and that the search syntax is valid for the active search backend; ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fe226ec0160>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fe226ec0160>: Failed to establish a new connection: [Errno 111] Connection refused)

Regarding this error I tried to fix it with the SEARCH_BACKEND_ARGUMENTS as mentioned in the following issue on Gitlab (Add documentation for Elasticsearch and Docker Compose (#1092) · Issues · Mayan EDMS / Mayan EDMS · GitLab), but I can not seem to figure out, what exactly I have to state. Especially the client_hosts is unclear for me, since I have traefik running - on the client_http_auth side, I believe that I used the correct credentials (just the default ones, since I did not change them - ‘client_http_auth’:[‘elastic’,‘mayanespasswords’]).

If someone would be so nice, to guide me through the correct setup of Elasticsearch, I would be very thankful. I am looking forward to your ideas/answers.

Thank you in advance,
Alex

DrRSatzteil · March 18, 2023, 8:26am

It also took me a while to figure that out. The following configuration works for me:

MAYAN_SEARCH_BACKEND_ARGUMENTS={'client_hosts':['http://elasticsearch:9200'],'client_http_auth':['USER','PASS']}

Elasticsearch is deployed via the default docker-compose file. Traefik shouldn’t be an issue here since mayan and elastic communicate directly via the docker network.

MrSaltman · March 18, 2023, 9:52pm

First of all, thank you for contributing to my question!

So I changed my MAYAN_SEARCH_BACKEND_ARGUMENTS to:

MAYAN_SEARCH_BACKEND_ARGUMENTS: {‘client_hosts’:[‘http://elasticsearch:9200’],‘client_http_auth’:[‘elastic’,‘mayanespassword’]}

After that, I restarted the docker-compose and went to rebuild the Search Backend via the Mayan Frontend but received the following error (in my console):

elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f8f19d639a0>: Fa iled to establish a new connection: [Errno -3] Temporary failure in name resolution) caused by: NewConnectionError(<urllib3.connection.HTTPConnecti on object at 0x7f8f19d639a0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution)

Do you have any idea what could cause that? I suppose it is the address to elasticsearch, but I did not change anything regarding that, so I do not know what is causing this error.

Thank you in advance

DrRSatzteil · March 18, 2023, 10:31pm

First of all: Is this a 100% copy & paste? My config uses ’ and not ’ or ‘ characters. → EDIT: I see that the forum changes these unless pasted in code fences:

' and not ’ or ‘

If we can rule this out as source of the problem could you please confirm that your elasticsearch service is called ‘elasticsearch’ in your docker-compose file? It should be unless you changed the default service name.

If this is the case this error seems to be related to docker itself (though this seems odd given that the service is hosted within the same virtual docker network). Do you have a dns entry in your /etc/docker/daemon.json file (something like “dns”: [“10.9.0.1”])? If not, could you try again after adding a valid dns entry? Don’t forget to restart the docker daemon after adding the dns entry.

DrRSatzteil · March 18, 2023, 10:45pm

Oh and something else I just realized:

I set this setting in the .env file.

In the UI this setting is shown as:

client_hosts:
- http://elasticsearch:9200
client_http_auth:
- USER
- PASS

So it might need a different syntax when you set this in the UI.

MrSaltman · March 19, 2023, 11:46am

@DrRSatzteil
Thank you for the fast replies!
1) So, yes, this was more or less a copy paste (I copied beginning from “{” ) but it seems, that the setting itself should have been set correctly, as this is how it looks in the UI:

It might also be noteworthy, that I set these variables directly in the config.yml located in the volume of my “mayan_app” container. After setting that I restart the docker-compose, which should do the trick (by confirming in the UI).

2) Yes, the service is called “elasticsearch”. I do not remember changing anything regarding Elastic Search in the compose, but I might have tried something a few weeks ago. However, this is how elasticsearch is represented in my docker-compose file:

3) I was looking for the “daemon.json” file but it seems that I do not have that file on the given path, which seems to be the default one though.

DrRSatzteil · March 19, 2023, 12:38pm

This looks all good. If the daemon.json file does not exist you can simply create it. The simplest config would be:


{
	"dns": ["8.8.8.8"]
}

Or any other DNS provider (or more than one)

MrSaltman · March 19, 2023, 2:17pm

Okay, so I created the “daemon.js” file and just added the dns config, but sadly this does not seem to resolve the issue either. After I created the file, I restarted the docker daemon, docker and the docker-compose itself & went to reindex the Search Backend, but in my console it throws the following exception(s):

DrRSatzteil · March 19, 2023, 3:05pm

Are you sure that elasticsearch is running? Do you use the elasticsearch profile in the .env file? You can check if container is running with ‚docker ps‘

MrSaltman · March 19, 2023, 3:22pm

Honestly, I am not sure. I was just assuming it is, since the service is in the docker-compose per default.

My .env profiles look like this:

So would I just have to add the elasticsearch profile? What exactly would that do?

“docker ps” shows that only mayan, postgres, rabbitmq and redis are running, so I suppose you are on to something.

Please forgive me for my apparently quite obvious mistakes, I am really still quite a newbie in docker.

So what would be my next steps? Simply adding the “elasticsearch” profile to my “.env” file and restarting again? Will that create the elasticsearch container I am looking for?

DrRSatzteil · March 19, 2023, 3:24pm

Exactly!

Don’t worry, I think we’re getting closer

MrSaltman · March 19, 2023, 4:12pm

Alright!

So as far as I can see, that was it I missed the profile (or the daemon.js, or all together), so it seems to be working just fine now, from what I can tell from a few test searches.

Thank you very much for the help, fast answers and patience!

DrRSatzteil · March 19, 2023, 4:47pm

Great that you got it working!

One other thing I would recommend to change regarding the standard docker-compose file: make sure to add a hostname property to the rabbitmq container (e.g. hostname: rabbit)

If you don’t set the hostname explicitly the container will use a random one every time it is created. The problem is that rabbitmq saves its data in a folder named after the hostname and thus is not able to restore the persistent storage when the container is recreated.

MrSaltman · March 20, 2023, 9:00am

@DrRSatzteil

Sorry that I have to get back to you this fast, but I am having another concern.

So with Elasticsearch running, I reckon, that I need certain Indices that cover my documents, so I will be able to find them. Sadly, my indexing knowledge is rather minimal aswell.

The default index, which is “creation_date” (but only Year & Month) does not really cover my case, since I initially used a tool to create a cabinet structure and upload the documents to Mayan via the REST API. That worked great, but it also means, that basically every document (except those uploaded since the initial creation) has more or less the same Creation_date (Year, month & day - only the time would differ, but probably only by minutes or seconds).

I am currently looking at ~30k+ documents which all use the same document_type & do not use any Metadata types or tags (for the most part).

I tried using an index on the document.label (and sliced it for the five first characters), but it does not provide me with the result I was hoping for. Still not able to find many documents with the search.

Have you got any ideas or tips? You might have had some similair issues in the past.

Yours,
Alex

DrRSatzteil · March 20, 2023, 9:51am

This is where it starts to get tricky and probably also beyond the point of what Mayan can do for you. If your file names don’t give you enough information for your indexes you might need to use some external tools to add some metadata to your documents.

You may want to have a look at mayan-automatic-metadata: GitHub - m42e/mayan-automatic-metadata: A (to be) framework for automatic, external processing of mayan documents for assigning tags and metadata

It’s a simple bit very helpful framework to add some metadata and/or tags to your documents based on the ocr contents of your files and regular expressions.

MrSaltman · March 20, 2023, 10:07am

Looks interesting. A little frustrating, that it won`t be as simple as I hoped. Although my indexes cover all my documents, the resultare just not it. Especially, since the only real value I want to base the search result on would be the labels of the documents, nothing else.

It seems that for my use case, the old (but gold) DjangoSearchBackend does the job the best. It definitely is slower, but it seems to be the only backend to actually give me the results I am looking for (Documents with the given label & OCR aswell).

It seems that Mayan is especially having problems with the long names/full names of (my) documents. Like I might get the document I am looking for my searching a part of it, but just searching for the document by its whole name will not return anything at all.

Still, thanks for your help

DrRSatzteil · March 20, 2023, 10:35am

I’m not sure I really understand what you’re trying to achieve. Just note that the index function in mayan is (afaik) not at all related to the search backend.

Also make sure that the elasticsearch indexing is completed before you make any assumptions about the search quality (it helps to have a look at the rabbitmq frontend to see how many tasks are still queued). It may take some time to index 30k+ documents.