Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

When things don't work as they should.
bernroth
Posts: 9
Joined: Mon Oct 05, 2020 4:10 am

Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by bernroth »

Dear Mayan EDMS Team,

for my first post I would like to thank all together for the great work and continuous development effort involved in Mayan EDMS.

I'm using docker-compose and upgraded my installation from 3.4.17 to 3.5 some days ago.
The server is a VM with Ubuntu 20.04 LTS (8 CPU cores, 16GB RAM)

We have around 17.000 documents with 445.000 pages in the system.

After upgrade, I enabled Whoosh search engine (THANKS, finally a quick full-text search :D ) and recreated the index.
The docker image was restarted.

During continuous observation of the system, I noticed that many libreoffice processes (soffice.bin) were started.
System load increased heavily to > 150 and eventually the kernel OOM killer was invoked.
Even after upgrading the VM to 64 CPU cores and 128GB RAM I saw the system go bad (OOM killer, very slow) after some time.

I wonder what possibly could go wrong here.

I'd expect the processes to be limited e.g. by CPU count.

To temporarily fix the issue, I disabled RabbitMQ as broker and reverted to redis.

Is this a known issue? If not, how can I contribute to help fixing the problem?

Best regards,
Bernhard
User avatar
rosarior
Developer
Developer
Posts: 624
Joined: Tue Aug 21, 2018 3:28 am
Location: Puerto Rico
Contact:

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by rosarior »

Thanks for the report.

This is not a know issue. We have not seen this behavior on any other installation so far.
I'd expect the processes to be limited e.g. by CPU count.
Under normal supervisor the concurrency is set for all workers to 1.

Under Docker, supervisor is set default and only limited for the slow worker. This worker handles the OCR and other heavy, long lived tasks. It will have only one process at a time.

Worker left to default concurrency will launch a number processes that equals to the number of CPU cores. This means that as many as 8 Libre Office instances would be launched at a time. 4 for the preview engine handled by the fast worker and 4 for the content extraction by the medium worker.
-c, --concurrency <concurrency>

Number of child processes processing the queue. The default is the number of CPUs availableon your system.
This logic has remained the same since version 3.4, which means the issue is somewhere else.

For version 3.5, Celery and Libre Office were updated, the headless variant of Libre Office is now used. This could be a case where instances of Libre Office are not being killed after they finished or that the new Libre Office version + variant consumes more memory than the previous version.

This is where we will start the diagnosis.
To temporarily fix the issue, I disabled RabbitMQ as broker and reverted to redis.
This is the part that is confusing. Changing the message broker should have no impact in the system load or the number of Libre Office instances launched. The background processes are handled by Celery and the concurrency logic is not tied to the broker being used.
Is this a known issue? If not, how can I contribute to help fixing the problem?
Yes, please.

1. How many instances of Libre Office were launched when the VM had 8 CPU cores asigned?
2. Was the spike in Libre Office instances caused by the search engine reindexing or were they gradual from normal usage?

Thanks for the report. I'm glad version 3.5 had features you were looking forward too!
bernroth
Posts: 9
Joined: Mon Oct 05, 2020 4:10 am

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by bernroth »

Thanks for your quick response!

Attached a screenshot of the process list now with 8 cores and 16GB RAM.

Interesting: OOM is expected as each libreoffice process consumes 1-3GB of RAM.
Update: OOM just happened :( Trying now again with 32GB of RAM

Below my docker-compose configuration.

When redis is enabled, I can't see those libreoffice processes.
Once enabling RabbitMQ, the system starts to spin up the processes and gets very busy.

With redis system load was around 0.5 when I arrived at work this morning.

Code: Select all

version: '3.7'

networks:
  bridge:
    driver: bridge

services:
  app:
    depends_on:
      - postgresql
      - redis
      # Enable to use RabbitMQ
      - rabbitmq
    environment: &mayan_env
      # Enable to use RabbitMQ
      MAYAN_CELERY_BROKER_URL: amqp://${MAYAN_RABBITMQ_USER:-mayan}:${MAYAN_RABBITMQ_PASSWORD:-mayanrabbitpass}@rabbitmq:5672/${MAYAN_RABBITMQ_VHOST:-mayan}
      # To use RabbitMQ as broker, disable Redis as broker
#      MAYAN_CELERY_BROKER_URL: redis://:${MAYAN_REDIS_PASSWORD:-mayanredispassword}@redis:6379/0
      MAYAN_CELERY_RESULT_BACKEND: redis://:${MAYAN_REDIS_PASSWORD:-mayanredispassword}@redis:6379/1
      MAYAN_DATABASES: "{'default':{'ENGINE':'django.db.backends.postgresql','NAME':'${MAYAN_DATABASE_DB:-mayan}','PASSWORD':'${MAYAN_DATABASE_PASSWORD:-mayandbpass}','USER':'${MAYAN_DATABASE_USER:-mayan}','HOST':'postgresql'}}"
#      MAYAN_DOCKER_WAIT: "postgresql:5432 redis:6379"
      # Replace with the line below when using RabbitMQ
      MAYAN_DOCKER_WAIT: "postgresql:5432 redis:6379 rabbitmq:5672"
      # To add operating system packages, like additional OCR language,
      # packages, put then in the variable below.
      # MAYAN_APT_INSTALLS: "tesseract-ocr-deu tesseract-ocr-nld"
      MAYAN_APT_INSTALLS: "tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-spa"
      # To add Python libraries, like LDAP, put then in the variable below.
      # MAYAN_PIP_INSTALLS: "python-ldap"
    image: mayanedms/mayanedms:3.5
    networks:
      - bridge
    ports:
      - "8080:8000"
    restart: unless-stopped
    volumes:
      - ${MAYAN_APP_VOLUME:-app}:/var/lib/mayan
      # Optional volumes to access external data like staging or watch folders
      # - /opt/staging_files:/staging_files
      # - /opt/watch_folder:/watch_folder
      - /data/scan/dms/itk-security:/dms_scan_itks
      - /data/scan/dms/roth-itk:/dms_scan_ritk
  postgresql:
    environment:
      POSTGRES_DB: ${MAYAN_DATABASE_DB:-mayan}
      POSTGRES_PASSWORD: ${MAYAN_DATABASE_PASSWORD:-mayandbpass}
      POSTGRES_USER: ${MAYAN_DATABASE_USER:-mayan}
    image: postgres:9.6-alpine
    networks:
      - bridge
    restart: unless-stopped
    volumes:
      - ${MAYAN_POSTGRES_VOLUME:-postgres}:/var/lib/postgresql/data

  redis:
    command:
      - redis-server
      - --appendonly
      - "no"
      - --databases
      - "2"
      - --maxmemory
      - "100mb"
      - --maxclients
      - "500"
      - --maxmemory-policy
      - "allkeys-lru"
      - --save
      - ""
      - --tcp-backlog
      - "256"
      - --requirepass
      - "${MAYAN_REDIS_PASSWORD:-mayanredispassword}"
    image: redis:5.0-alpine
    networks:
      - bridge
    restart: unless-stopped
    volumes:
      - ${MAYAN_REDIS_VOLUME:-redis}:/data

  # Optional services

  # celery_flower:
  #   command:
  #     - run_celery
  #     - flower
  #   depends_on:
  #     - postgresql
  #     - redis
  #     # Enable to use RabbitMQ
  #     # - rabbitmq
  #   environment:
  #     <<: *mayan_env
  #   image: mayanedms/mayanedms:3
  #   networks:
  #     - bridge
  #   ports:
  #     - "5555:5555"
  #   restart: unless-stopped

  # Enable to use RabbitMQ
  rabbitmq:
    image: rabbitmq:3.8-alpine
    environment:
      RABBITMQ_DEFAULT_USER: ${MAYAN_RABBITMQ_USER:-mayan}
      RABBITMQ_DEFAULT_PASS: ${MAYAN_RABBITMQ_PASSWORD:-mayanrabbitpass}
      RABBITMQ_DEFAULT_VHOST: ${MAYAN_RABBITMQ_VHOST:-mayan}
    networks:
      - bridge
    restart: unless-stopped
    volumes:
       - ${MAYAN_RABBITMQ_VOLUME:-rabbitmq}:/var/lib/rabbitmq

  # Enable to run standalone workers
  # worker_fast:
  #   command:
  #     - run_worker
  #     - fast
  #   depends_on:
  #     - postgresql
  #     - redis
  #     # Enable to use RabbitMQ
  #     # - rabbitmq
  #   environment:
  #     <<: *mayan_env
  #   image: mayanedms/mayanedms:3
  #   networks:
  #     - bridge
  #   restart: unless-stopped
  #   volumes:
  #     - ${MAYAN_APP_VOLUME:-app}:/var/lib/mayan

  # Enable to run frontend gunicorn
  # frontend:
  #   command:
  #     - run_frontend
  #   depends_on:
  #     - postgresql
  #     - redis
  #     # Enable to use RabbitMQ
  #     # - rabbitmq
  #   environment:
  #     <<: *mayan_env
  #   image: mayanedms/mayanedms:3
  #   networks:
  #     - bridge
  #   ports:
  #     - "81:8000"
  #   restart: unless-stopped
  #   volumes:
  #     - ${MAYAN_APP_VOLUME:-app}:/var/lib/mayan

volumes:
  app:
  postgres:
  rabbitmq:
  redis:
Attachments
Screenshot at 2020-10-05 10-05-27.png
Screenshot at 2020-10-05 10-05-27.png (453.76 KiB) Viewed 788 times
bernroth
Posts: 9
Joined: Mon Oct 05, 2020 4:10 am

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by bernroth »

This is the process list now.
3.4GB per libreoffice process is surprisingly much
Attachments
Screenshot at 2020-10-05 10-12-40.png
Screenshot at 2020-10-05 10-12-40.png (145.12 KiB) Viewed 786 times
bernroth
Posts: 9
Joined: Mon Oct 05, 2020 4:10 am

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by bernroth »

I had to revert to Redis as some documents did not show via Web GUI.

Now everything looks fine, load only 0.4

No soffice.bin processes.

FYI
User avatar
michael
Developer
Developer
Posts: 78
Joined: Sun Apr 19, 2020 6:21 am

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by michael »

We've been scratching our heads with this one. Not being able to reproduce it slow things down. We don't have a theory either about why in your case it only manifests using RabbitMQ.

We'll continue to investigate.
bernroth
Posts: 9
Joined: Mon Oct 05, 2020 4:10 am

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by bernroth »

Thanks for your feedback.

I performed a test with 3.5.1 and RabbitMQ.
After some seconds starting the docker-compose file, I see many following processes to appear:

Code: Select all

/usr/lib/libreoffice/program/soffice.bin -env:UserInstallation=file:///tmp/tmp4h3xt86o/LibreOffice_Conversion --headless --convert-to pdf:writer_pdf_Export /tmp/tmp0valrl99 --outdir /tmp --infilter=Text (encoded):UTF8,LF,,,
Do we know what this process is doing? Which part of the software may trigger those?

Is RabbitMQ, when enabled with Mayan, maybe processing some kind of outstanding queue?

Please let me know if I can provide more details, thanks!
bernroth
Posts: 9
Joined: Mon Oct 05, 2020 4:10 am

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by bernroth »

Dear all,

just a quick update on that issue:

With the newly released version 3.5.4 the problem happens when using docker-compose with RabbitMQ.

On the other hand: With Redis we are using the system in production with no problems.

Does RabbitMQ maybe have some kind of persistent processing queue? Do I need to flush it?

Anyhow, soffice.bin does consume huge amounts of memory when RabbitMQ is active.
When using Redis, I see soffice.bin processes as well but they have a ten times less memory footprint.

Well, just to share some observations.

Is there anything I can do to help identifying the root cause?

Best regards,
Bernhard
User avatar
rosarior
Developer
Developer
Posts: 624
Joined: Tue Aug 21, 2018 3:28 am
Location: Puerto Rico
Contact:

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by rosarior »

Thanks for the update,

We assigned this to ticket #891 (https://gitlab.com/mayan-edms/mayan-edms/-/issues/891).

We have not made any changes regarding this because the code that calls Libre Office is the same regardless of the broker used. The broker only changes how messages are sent to coordinate the tasks, but the task execution itself is unchanged in terms of code and calling convention at run time. From the task scheduler code point of view, I can only hypothesize that the only thing that could cause such artifact is that there is a different sub process calling code path in Celery when RabbitMQ is used. Confirming this would require auditing the code of Celery which is time consuming and not likely to happen during our current development cycle.
Does RabbitMQ maybe have some kind of persistent processing queue? Do I need to flush it?
The queues should spike during use but as tasks are completed it should reduce in pending messages until empty. Queues that remain full are an indication that either more messages are coming than the workers can take on, which is a capacity issue and needs to scale up or an infrastructure problem causing network latency or slow disk IO is preventing RabbitMQ to work efficiently. Unlike Redis, RabbitMQ is persistent by default, so disk IO will factor in its performance. It uses disk in an effective manner but still, there is a connection to examine.

There are also a couple of observations in the GitLab ticket:

RabbitMQ uses distinctive memory memory management based on total system percentage memory. If the VM software used has a distinctive memory sharing and allocation code or setup (like shared host RAM), it could cause the reported memory to the guest OS to be incorrect or dynamic enough to confuse RabbitMQ.

As mentioned in the ticket we have deployed Mayan EDMS on IBM Cloud's Kubernetes using IBM provided RabbitMQ services that power installations with millions of pages with just shared vCPU, 2GB RAM and 1GB disk for RabbitMQ.

One thing we did notice during one installation is that when the hosted RabbitMQ service ran out of disk space, the service became unresponsive and messages just accumulated to the point that not even the API was accessible to delete messages and we had to shutdown and reprovision the service. It seems RabbitMQ uses its own form of virtual memory.

RabbitMQ behaves like a relational database in the terms of memory management. It has its own memory metrics, knobs, and garbage collection. There might be something in your installation (VM host, VM software, VM setup, hardware, workload, document composition) that requires a specialized adjustments for RabbitMQ to work as expected.

In order to find the cause we need to able to replicate this in a test environment we control and that we can break and automate to iterate many times very fast for all scenarios. If you can think of anything that makes your installation unique and that could help us replicate it in a VM or a container, it would open a way to find the root cause.
bernroth
Posts: 9
Joined: Mon Oct 05, 2020 4:10 am

Re: Version 3.5 with RabbitMQ = Hundreds of Processes, OOM Killer

Post by bernroth »

Thanks a lot for your quick and extensive reply!

The more I think about the issue and observe the behavior of the system, the more I think that this could be less related to the broker but to the way the processes are started.

With Redis I see many times N CPU processes soffice.bin with a reasonable/interestingly low memory footprint.
With RabbitMQ the behavior is the same but each process consumes far more memory than with Redis.

Are the processes started in a different manner? I remember that Linux has some memory consumption optimizations e.g. memory for libraries are only loaded once and then shared between the processes.
Is this mechanism disabled when using RabbitMQ and each process started in a separate environment?
Post Reply