Documents don't get send to Tesseract

Hi,

I moved my direct installation of Mayan over to the Docker based installation. I did that by following this guide: Direct deployment to Docker Compose migration

After migrating I checked a couple documents and noticed that some are missing their OCR content. I then went into tools and clicked the “queue all documents for OCR processing” button and send all my documents through - small system and not that many documents.

After processing finished - which I checked by looking for ‘tesseract’ processes - I checked again and the only content they had was “- Page 1 - - Page 2 -” and so on but no actual content.

I checked the logs with the docker compose logs command and found this:

mayan-rabbitmq-1    | 2023-05-10 17:37:22.827265+00:00 [warning] <0.15033.0> closing AMQP connection <0.15033.0> (172.18.0.2:55364 -> 172.18.0.4:5672, vhost: 'mayan', user: 'mayan'):
mayan-rabbitmq-1    | 2023-05-10 17:37:22.827265+00:00 [warning] <0.15033.0> client unexpectedly closed TCP connection
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946018+00:00 [warning] <0.14937.0> closing AMQP connection <0.14937.0> (172.18.0.2:46928 -> 172.18.0.4:5672, vhost: 'mayan', user: 'mayan'):
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946018+00:00 [warning] <0.14937.0> client unexpectedly closed TCP connection
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946557+00:00 [warning] <0.15021.0> closing AMQP connection <0.15021.0> (172.18.0.2:55354 -> 172.18.0.4:5672, vhost: 'mayan', user: 'mayan'):
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946557+00:00 [warning] <0.15021.0> client unexpectedly closed TCP connection
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946850+00:00 [warning] <0.15001.0> closing AMQP connection <0.15001.0> (172.18.0.2:55346 -> 172.18.0.4:5672, vhost: 'mayan', user: 'mayan'):
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946850+00:00 [warning] <0.15001.0> client unexpectedly closed TCP connection
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>     supervisor: {<0.15025.0>,rabbit_channel_sup}
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>     errorContext: shutdown_error
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>     reason: noproc
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>     offender: [{pid,<0.15028.0>},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                {id,channel},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                {mfargs,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                    {rabbit_channel,start_link,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                        [1,<0.15021.0>,<0.15026.0>,<0.15021.0>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                         <<"172.18.0.2:55354 -> 172.18.0.4:5672">>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                         rabbit_framing_amqp_0_9_1,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                         {user,<<"mayan">>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                             [administrator],
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                             [{rabbit_auth_backend_internal,none}]},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                         <<"mayan">>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                         [{<<"consumer_cancel_notify">>,bool,true},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                          {<<"connection.blocked">>,bool,true},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                          {<<"authentication_failure_close">>,bool,true}],
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                         <0.15022.0>,<0.15027.0>]}},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                {restart_type,intrinsic},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                {shutdown,70000},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.946727+00:00 [error] <0.15025.0>                {child_type,worker}]
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>     supervisor: {<0.15005.0>,rabbit_channel_sup}
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>     errorContext: shutdown_error
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>     reason: noproc
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>     offender: [{pid,<0.15008.0>},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                {id,channel},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                {mfargs,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                    {rabbit_channel,start_link,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                        [1,<0.15001.0>,<0.15006.0>,<0.15001.0>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                         <<"172.18.0.2:55346 -> 172.18.0.4:5672">>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                         rabbit_framing_amqp_0_9_1,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                         {user,<<"mayan">>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                             [administrator],
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                             [{rabbit_auth_backend_internal,none}]},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                         <<"mayan">>,
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                         [{<<"consumer_cancel_notify">>,bool,true},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                          {<<"connection.blocked">>,bool,true},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                          {<<"authentication_failure_close">>,bool,true}],
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                         <0.15002.0>,<0.15007.0>]}},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                {restart_type,intrinsic},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                {shutdown,70000},
mayan-rabbitmq-1    | 2023-05-10 17:37:22.947006+00:00 [error] <0.15005.0>                {child_type,worker}]

I’m not sure if that has anything to do with the OCR process but that’s the only error I found so far.

I also checked on my direct installation which is in a virtual machine and did the same thing - with the same result: The same documents aren’t being processed for some reason.

I think what you looking at is ‘File’ contents. That would be text extracted from file but that is was there previously, it includes the page separators you describe.

OCR done by tesseract can be found under ‘Versions’.

1 Like

Hi,

yes, you are correct. I apparently had forgotten where the actual OCR-Text menu entry was hidden and just assumed it was removed.

I now checked and looked at what shows up under ‘Versions’ ‘OCR-Text’ and while a few documents have no content there most have the header of the document there or the footer of the document but not the ‘in between’ stuff. Even sending that document to the OCR queue again doesn’t change that but just gives the same result again. The quality of the scan is good / not different from other documents. In any case, this seems to be an issue between tesseract and my documents.

Thank you for your help!

However, that still leaves the issue with that rabbitmq exception. Obviously that has nothing to do with tesseract though and I think another user opened a thread regarding those issues as well so I’ll just post here again when I’ve figured out why tesseract isn’t completly analyzing those documents.