Logging upload errors and identifying duplicates on upload

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
daniel1113
Posts: 21
Joined: Tue Aug 21, 2018 2:32 pm

Logging upload errors and identifying duplicates on upload

Post by daniel1113 » Fri May 10, 2019 2:53 pm

We're using Mayan to store large quantities of PDFs and need to upload batches of them at a time (~50 to 150 PDFs is not uncommon). Mayan is pretty good at importing the files using the web upload interface, but occasionally some of the files in a batch will error out. Errors get triggered for various reasons: network transfer errors, hitting the upload size limit, etc. When errors happen, it is very difficult to distinguish exactly which files were uploaded and which were not in a large batch. This leads me to two questions.

First, does Mayan log document upload errors? If so, where are these logs located? If not, can this feature be added? A simple log would make it very easy to identify successful and unsuccessful uploads.

Second, another way to resolve this would be to simply re-upload the entire batch, which includes both files that were previously uploaded and those that errored-out. Duplicates can then be culled, hopefully using Mayan's duplicate documents feature. But uploading duplicates just to delete them is very wasteful, especially in the quantities we are dealing with. Would it be possible for Mayan to check for duplicates at upload time, and then only import and process non-duplicates?

User avatar
rosarior
Posts: 303
Joined: Tue Aug 21, 2018 3:28 am

Re: Logging upload errors and identifying duplicates on upload

Post by rosarior » Sun May 19, 2019 7:54 pm

There is feature to enable out-of-channel error logging for these cases using the

Code: Select all

COMMON_PRODUCTION_ERROR_LOGGING
and

Code: Select all

COMMON_PRODUCTION_ERROR_LOG_PATH
(which defaults to 'media/error.log'). This feature is disabled by default because it conflicts with Docker logging best practices (which is to log to STD OUT) and conflicts with Supervisord logging (which logs to /var/log/supervisor).

Detecting upload errors while running distributed background tasks is very hard because during uploads the files are in a sort of "limbo" state. They left the browser but are not yet Mayan documents. The upload code has a lot of code to recover from upload errors (like broker retires and database transactions) but there are situation that are just impossible to recover from like the ones you mention (network errors, etc). Here a talk I presented at Pycon Italy discussing the problems and solutions we came up with for Mayan: https://www.youtube.com/watch?v=0UJTG5QU7Ss

A feature to reject uploading duplicated documents and another one to add an entry in the Tool menu to delete all duplicated documents are planned and their implementations being discussed. Might be as soon as version 3.3.

Post Reply