Page 1 of 1

Docker-deployment and content parsing of non-pdf files

Posted: Tue Sep 14, 2021 10:01 am
by thierryA
Hi,

I'm trying to setup a demo based on the 4.0.x version of Mayan-EDMS and a Docker / docker-compose deployment.

In this demo I need to show how to add and search a collection of documents of several formats : .pdf, .docx, .pptx, .txt...

I've tried the 4.0.15 version which allows me to add and "ingest" pdf documents without problems (these are correctly indexed and can be found, as expected, with the search tools).

But I have problems with all other formats, for which I always get empty parsed content when I look at the document / file / content. However the jpeg representation of the file is correctly generated.

The process I'm following for my tests is :

1/ Run docker-compose using https://gitlab.com/mayan-edms/mayan-edm ... ompose.yml

2/ Connect to the web site

3/ Documents > New document > Default document type and add a pdf / docx / txt document


In the docker logs, after the upload og each document I see this :

Code: Select all

[2021-09-14 09:28:31,141: ERROR/ForkPoolWorker-2] Task mayan.apps.dynamic_search.tasks.task_deindex_instance[bfb696d3-182c-40b5-8ea9-a70796b946be] raised unexpected: DoesNotExist('IndexInstanceNode matching query does not exist.')
Traceback (most recent call last):
  File "/opt/mayan-edms/lib/python3.7/site-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/lib/python3.7/site-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/lib/python3.7/site-packages/mayan/apps/dynamic_search/tasks.py", line 22, in task_deindex_instance
    instance = Model._meta.default_manager.get(pk=object_id)
  File "/opt/mayan-edms/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/mayan-edms/lib/python3.7/site-packages/django/db/models/query.py", line 408, in get
    self.model._meta.object_name
mayan.apps.document_indexing.models.IndexInstanceNode.DoesNotExist: IndexInstanceNode matching query does not exist.

But this does not affect the content parsing of pdf files (content is parsed as expected for these files).

I've also tried older versions of mayan-edms (4.0.14, 4.0.11, 4.0.1...) and still have this problem (content not parsed for non pdf files).

Do you have any idea on the origin of this problem ? Do you know if there is a 4.0.x version wich works as expected for the ingestion of non pdf documents, with the docker-compose deployment ? Or shoud I change something in the docker-compose configuration ?

Is a direct deployment a better solution for the setup of a demo and if so, do you know if there are contexts / versions versions for which content pasing works as expected ?