Document content empty with no errors

When things don't work as they should.
Post Reply
pintariching
Posts: 3
Joined: Mon Jan 24, 2022 11:48 am

Document content empty with no errors

Post by pintariching »

I have problems getting OCR to work. I'm using a fresh docker-compose installation with Portainer, however the same problem existed without Portainer.
My docker-compose file:

Code: Select all

services:
  mayan-app:
    image: mayanedms/mayanedms:v4.1.4
    container_name: mayan-app
    restart: unless-stopped
    depends_on:
      - mayan-postgres
      - mayan-redis
    environment:
      MAYAN_CELERY_BROKER_URL: redis://mayan-redis:6379
      MAYAN_CELERY_RESULT_BACKEND: redis://mayan-redis:6379
      MAYAN_DATABASES: "{'default':{'ENGINE':'django.db.backends.postgresql','NAME':'mayan','PASSWORD':'mayan','USER':'mayan','HOST':'mayan-postgres'}}"
      MAYAN_DOCKER_WAIT: "mayan-postgres:5432 mayan-redis:6379"
      MAYAN_LOCK_MANAGER_BACKEND: mayan.apps.lock_manager.backends.redis_lock.RedisLock
      MAYAN_LOCK_MANAGER_BACKEND_ARGUMENTS: "{'redis_url':'redis://mayan-redis:6379'}"
    volumes:
      - /mnt/storage/mayan/mayan:/var/lib/mayan
    ports:
      - 80:8000
    networks:
      - mayan-net

  mayan-postgres:
    image: postgres
    restart: unless-stopped
    container_name: mayan-postgres
    volumes:
      - /mnt/storage/mayan/postgres:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: mayan
      POSTGRES_USER: mayan
      POSTGRES_PASSWORD: mayan
    ports:
      - 5432:5432
    networks:
      - mayan-net

  mayan-redis:
    image: redis
    container_name: mayan-redis
    restart: unless-stopped
    networks:
        - mayan-net
    command:
      - redis-server
      - --appendonly
      - "no"
      - --databases
      - "3"
      - --maxmemory
      - "100mb"
      - --maxclients
      - "500"
      - --maxmemory-policy
      - "allkeys-lru"
      - --save
      - ""
      - --tcp-backlog
      - "256"


networks:
  mayan-net:
    name: mayan-net
    driver: bridge  
After starting and uploading a pdf, the content is empty and I get no errors in the UI or in the docker logs. If I upload a png, the story is the same.
But running tesseract manually inside the container on a png image gives me content without a problem.

I'm not sure where to look.

EDIT:
After looking around I have found that certain pdfs get content and some don't.
Files I've downloaded from the internet work fine, but scanned documents from my printer don't for some reason. I think it has to do with the printer doing OCR on it's own, so I've disabled the option and see what it does.
pintariching
Posts: 3
Joined: Mon Jan 24, 2022 11:48 am

Re: Document content empty with no errors

Post by pintariching »

So I have found out the problem.
Previously I wasn't scanning the document's with OCR enabled, but I thought I was and I thought the printer wasn't doing something right.
It appears I just simply wasn't running OCR on the printer.
Still the OCR that comes with Mayan doesn't work properly. The events say that the document was submitted for OCR and that it finished it, but still no content if the file is scanned from a printer, without the printer doing the OCR beforehand.
bebef
Posts: 27
Joined: Fri Aug 21, 2020 6:00 am

Re: Document content empty with no errors

Post by bebef »

I'm running Mayan 4.2.3 in Docker and I'm facing the same issue. OCR always worked and some time in the past it just ceased working. I only found out because search wouldn't find documents (that I expected to find) any more. :(

No error messages, parsing seemingly work without any error. Upon start of the container, the appropriate OCR packages are installed as well.

I wonder whether it is the parsing itself or if parsing works but no content is stored with the document afterwards.
bebef
Posts: 27
Joined: Fri Aug 21, 2020 6:00 am

Re: Document content empty with no errors

Post by bebef »

Still the same issue in 4.2.4.

This time I even deleted the document's contents and then had it re-parse again. The result of a 7-page PDF scan was 7 emtpy pages. :cry:

It basically affects all scanned PDF documents. All PDF documents that are generated instead still have their content.

Would be really cool if someone could have a look at this. :|
bebef
Posts: 27
Joined: Fri Aug 21, 2020 6:00 am

Re: Document content empty with no errors

Post by bebef »

By the way, I've tested command line tesseract on images in /var/lib/mayan/document_file_page_image_cache and it works just fine!
bebef
Posts: 27
Joined: Fri Aug 21, 2020 6:00 am

Re: Document content empty with no errors

Post by bebef »

Just tested with a fresh setup. Same error, so this is a bug in Mayan 4.2.4 (Docker) I suppose?
bebef
Posts: 27
Joined: Fri Aug 21, 2020 6:00 am

Re: Document content empty with no errors

Post by bebef »

Post Reply