App not starting because of LockError

My Mayan instance only serves requests for a very short time before the web server shuts down with the following message:

gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>
2024-01-09 10:40:09,519 WARN exited: mayan-edms-gunicorn (exit status 1; not expected)

This seems to be caused by the workers not being able to initialize the locking backend.

When I configure the application to use the FileLocking backend, this is the error message I’m getting from each of the workers:

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/mayan-edms/lib/python3.11/site-packages/mayan/apps/lock_manager/backends/file_lock.py", line 87, in _init
    raise LockError
mayan.apps.lock_manager.exceptions.LockError

The following is the relevant part of file_lock.py, around line 87:

                # Someone already got this lock, check to see if it is expired.
                if file_locks[name]['expiration'] and time.time() > file_locks[name]['expiration']:
                    # It expires and has expired, we re-acquired it.
                    file_locks[name] = self._get_lock_dictionary()
                else:
                    lock.release()
                    raise LockError

So there is a lock with the given name and either it doesn’t have an expiration date or the expiration date is None or some time in the future.
Since this happens in all the worker processes, I’m assuming that this is either something in the file system or a race condition, but that’s really hard to debug without knowing the code base very well.

If I activate the redis locking backend, this is what I get instead of the above error:

  File "/opt/mayan-edms/lib/python3.11/site-packages/mayan/apps/lock_manager/backends/redis_lock.py", line 73, in _init
    raise LockError

The relevant part of redis_lock.py reads:

        if _redis_lock_instance.acquire(blocking=False):
            self._redis_lock_instance = _redis_lock_instance
        else:
            raise LockError

So, again, a lock cannot be acquired.
Since this is logically the same for both backends, with completely different locking mechanisms, I’m assuming the root cause is indeed a race condition.

I’m starting Mayan using docker-compose (docker compose up -d) with the following .env file:

COMPOSE_PROJECT_NAME=mayan
COMPOSE_PROFILES=all_in_one,postgresql,rabbitmq,redis,elasticsearch
MAYAN_DOCKER_IMAGE_NAME=mayanedms
MAYAN_DOCKER_IMAGE_TAG=v4.5.5
MAYAN_FRONTEND_HTTP_PORT=8002
MAYAN_WORKER_CUSTOM_QUEUE_LIST=
MAYAN_DOCKER_WAIT="postgresql:5432 rabbitmq:5672 redis:6379"
MAYAN_TRAEFIK_LETS_ENCRYPT_EMAIL=
MAYAN_TRAEFIK_EXTERNAL_DOMAIN=
MAYAN_TRAEFIK_DASHBOARD_ENABLE=false
MAYAN_TRAEFIK_DASHBOARD_AUTHENTICATION=''
MAYAN_TRAEFIK_FRONTEND_ENABLE=false
MAYAN_TRAEFIK_RABBITMQ_ENABLE=false
MAYAN_TRAEFIK_LETS_ENCRYPT_DNS_CHALLENGE_PROVIDER=
MAYAN_LOCK_MANAGER_BACKEND=mayan.apps.lock_manager.backends.file_lock.FileLock

The locking behavior you are describing is secondary. The main issue to resolve first is why gunicorn workers are failing to boot with an exit status of 1.

Resolve that issue first before attempting to work on the locking system.

Ah, OK, thank you.
I thought the gunicorn workers were failing because the locking backend couldn’t be initialized; the gunicorn error occurs much later in the log than the locking failure.
I’ll see what I can find out about the gunicorn worker failure.

OK, I did find another problem. This is the first thing in the log that looks like anything is amiss:

mayan.apps.backends.model_mixins <20> [ERROR] "get_backend_class() line 50 ESC[31;1mImportError while importing backend: mayan.apps.sources.source_backends.SourceBackendWebForm; Module "mayan.apps.sources.source_backends" does not define 
a "SourceBackendWebForm" attribute/classESC[0m"

This looks similar to issue 1153, but there, the entire package source_backends was missing.

It’s also an error Obelix1981 had in their log in this forum thread, but I can’t tell whether that was what broke the deployment for them.
I also don’t can’t delete any sources, as suggested by ssf later in that thread, possibly as a fix for the above error(?), because the application won’t start.

I thought the gunicorn workers were failing because the locking backend couldn’t be initialized; the gunicorn error occurs much later in the log than the locking failure.

Yes, that means that the issue is not the Mayan LockManager. Mayan uses it own lock system separate from gunicorn’s lock system for frontend workers.

Mayan’s lock system does a test a startup and if it fails the startup is interrupted because it won’t be able to work properly. If your system is able to start up, then the issue lies in another component.

The issue with the source paths was fixed over the bugfix releases 4.5.1 and 4.5.2. Issues with the source paths don’t stop the startup process.

https://docs.mayan-edms.com/releases/4.5.1.html

gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>

gunicorn is not part of Mayan but a third party library. This means that something in your setup is affecting Mayan as well as gunicorn.

Please use the default installation and configure as instructed in the documentation to replicate the issue before making configuration changes that may be causing the issue.

Thank you.

I think I’ve got it. I didn’t add elasticsearch to the list of services to wait for in the .env file: Instead of

MAYAN_DOCKER_WAIT="postgresql:5432 rabbitmq:5672 redis:6379 elasticsearch:9200"

I only had

MAYAN_DOCKER_WAIT="postgresql:5432 rabbitmq:5672 redis:6379"

I’m not sure why Mayan crashes (without an error message specific to elasticsearch) if ES isn’t online before Mayan starts, but it’s repeatable on my system.
Might be an actual dependency or some sort of competition for resources (either between components or for compute resources).

Mayan is starting now and everything seems stable so far.

Quick follow-up, though: I still have that source path error in my logs, although I’m on Mayan 4.5.6. Could this be a something to do with having migrated my system from a (much) earlier version?

I think I’ve got it. I didn’t add elasticsearch to the list of services to wait for in the .env file: Instead of

The MAYAN_DOCKER_WAIT is a convenience feature to avoid extended retries but not using or specifying services here does not stop the system from booting up.

By default ElasticSearch is not enabled. There is no need to add it to the MAYAN_DOCKER_WAIT unless you updated your search backend to use ElasticSearch which you did not mentioned.

Might be an actual dependency or some sort of competition for resources (either between components or for compute resources).

All dependencies are included in the Docker image and tested to work properly as part of the CD/CI release pipeline. Any error in the code testing or dependencies stops the build and blocks the release of a version.

A system resource contention could happen if you don’t have the necessary minimum memory or 4GB and CPU resources of at least 2 core available.

Object resource contention is the actual reason a specialized distributed lock manager was created for Mayan. It can orchestrate resources even across different hosts, networks, or geographic regions.

I’m not sure why Mayan crashes (without an error message specific to elasticsearch) if ES isn’t online before Mayan starts, but it’s repeatable on my system.

This happens because you made changes which you did not disclosed (ElasticSearch, frontend settings, possible custom networking).

Custom internal Docker network port for HTTP frontend.

MAYAN_FRONTEND_HTTP_PORT=8002

Please use the default installation with default settings to replicate issues. If you make changes, please fully disclose these in other to better help you.

Well, I did post my .env file in my original post, and it says:

COMPOSE_PROFILES=all_in_one,postgresql,rabbitmq,redis,elasticsearch

So yeah, I did disclose this.

And, what can I say, the application starts when I ask it to wait for elasticsearch, and it doesn’t when I don’t. Repeatedly.

And, as I wrote in my last post, to replicate the issue, I did start from the vanilla deployment configuration to replicate the problem, but since the vanilla setup isn’t what I need (I need elasticsearch) I changed the configuration one step at a time until I had the problem replicated.

Case closed, as far as I’m concerned.

Thank you.

This topic was automatically closed 8 hours after the last reply. New replies are no longer allowed.