Lost documents in file system, database + preview still existent

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
bernroth
Posts: 12
Joined: Mon Oct 05, 2020 4:10 am

Lost documents in file system, database + preview still existent

Post by bernroth »

Dear all,

we are using Mayan EDMS for quite some time now.
Our version is 4.0.15 which was upgraded from 3.x a while ago.

Just recently we notice that some documents cannot be opened.
Logs show irritating messages

Code: Select all

django.http.response.Http404: No Document found matching the query
mayan.apps.logging.middleware.error_logging <6825> [ERROR] "process_exception() line 17 Exception caught by request middleware; <WSGIRequest: GET '/documents/documents/22672/preview/'>, No Document found matching the query"
Traceback (most recent call last):
  File "/opt/mayan-edms/lib/python3.7/site-packages/django/views/generic/detail.py", line 52, in get_object
    obj = queryset.get()
  File "/opt/mayan-edms/lib/python3.7/site-packages/django/db/models/query.py", line 408, in get
    self.model._meta.object_name
mayan.apps.documents.models.document_models.Document.DoesNotExist: Document matching query does not exist.
In the SQL dump when I upgraded, I can find a UUID in documents_document table

b050fdec-c1e3-4420-928b-1fecb59807a8

Searching the file system for that UUID, I can find preview images in "document_file_page_image_cache" but not the actual file itself.

What could possibly go wrong here?

Please help getting file UUID with psql in a docker-compose installation.
Somehow I cannot run a SQL query to get the file UUID to perform other searches.
Something like "select UUID from public.documents_document where id=22672".
This will allow me to search for the missing file in the file system. The error messages are unfortunately not very clear about which exact file was not found.

This problem affects multiple files and I have no idea what happend.

The backup of the system, 35 days ago, does not find the file either (this time everything was still running mayan edms 3.xx).

I don't think this is a problem caused by the update.

Thanks for your help and hints!

Best regards,
Bernhard
User avatar
michael
Developer
Developer
Posts: 187
Joined: Sun Apr 19, 2020 6:21 am

Re: Lost documents in file system, database + preview still existent

Post by michael »

Hello Bernhard,

There are some bits to unpack here before diving in.
mayan.apps.logging.middleware.error_logging <6825> [ERROR] "process_exception() line 17 Exception caught by request middleware; <WSGIRequest: GET '/documents/documents/22672/preview/'>, No Document found matching the query"
What action were you trying to perform when you got this error?

The document list is not cached. If the document thumbnail is shown in the list then it means it is found in the database.

Use a superuser account to check the missing documents. As a strict security measure, Mayan returns an error 404 and not error 403 if the user does not have sufficient permissions => https://docs.mayan-edms.com/mercs/0006- ... close.html
Searching the file system for that UUID, I can find preview images in "document_file_page_image_cache" but not the actual file itself.
When a document is deleted, the previews are deleted too. In normal operation it is not possible (or at least probable given the many test units that cover the code path) for a document to be deleted and its previews be left behind. This behavior leads me to believe the database was modified directly with another program.
Somehow I cannot run a SQL query to get the file UUID to perform other searches.
What is the error message shown when attempting to run the SQL query?
The backup of the system, 35 days ago, does not find the file either (this time everything was still running mayan edms 3.xx).

I don't think this is a problem caused by the update.
Can you run a separate test install with that backup restored?

The event log for the document type will show if and when the documents were deleted.

Another thing to check is the existence of a deletion policy for the document type. These are designed to trash, then deleted document after a period of time to comply with privacy policies.
bernroth
Posts: 12
Joined: Mon Oct 05, 2020 4:10 am

Re: Lost documents in file system, database + preview still existent

Post by bernroth »

Thanks for your reply!

Feedback to your questions is below.

I have some new findings which might explain partly the issues I've seen:

Back in September we noticed that many duplicated documents exist (around 5000).
This is because documents put into the watch folder are often imported twice.

To avoid duplicates in the search results, I spent some time removing all the detected duplicates in the GUI.

Afterwards I got reported those "Page not found" errors but I did not relate them to my DUP cleanup.

Slowly I began to realize several issues with the "Duplicated documents" feature:

- Not only duplicates are shown but both versions of the same document
- Cleaning the "Duplicated documents" list will remove all occurrences of those documents.
- Duplicates can be there for a purpose. My workmate uses the Cabinet feature extensively. For some projects at a given time, all documents of that step are set to be in dedicated a cabinet. Specifications which no not change so often might get "duplicated". This is a desired behavior.
Files names differ but the file content will be exactly the same.
-> I will have to create a new forum topic to discuss possible solutions to that problem. I think some kind of deduplication is needed.


Question: Is it possible that trashing documents from the "Duplicated document" panel will cause the document to get removed but not properly cleaned in the database, search index and thumbnail cache?


Now I restored all trashed documents and "Page not found" errors are greatly reduced.




What action were you trying to perform when you got this error?
I searched for a document and clicked on the preview to open it.
We are using whoosh search engine.
When a document is deleted, the previews are deleted too. In normal operation it is not possible (or at least probable given the many test units that cover the code path) for a document to be deleted and its previews be left behind. This behavior leads me to believe the database was modified directly with another program.
I don't mess with the database tables except for upgrades, e.g. 3.5 to 4.0 :)
What is the error message shown when attempting to run the SQL query?
Would you be so kind and give me an example about how to execute a SQL query on the host using the database from the postgresql container?
Another thing to check is the existence of a deletion policy for the document type. These are designed to trash, then deleted document after a period of time to comply with privacy policies.
Checked, 10 years. This is IMHO not the root cause.
Post Reply