Mayan Mindee metadata/tag provider plugin

Hi there,

I created a “plugin” for mayan that allows you to automatically attach metadata and tags to your documents through the use of mindee apis (https://mindee.com/) that I want to share with you.

If you are interested please have a look at the README of the project:

Use mayan workflows with http actions to trigger the mayan mindee web service.

Any questions/feedback or contributions are welcome :slight_smile:

3 Likes

Love this! Thanks for sharing your work!

Hello,

Thank you for making this plugin. It looks like a perfect fit for our project, scanning in our old sales orders.

Is there a way to get the raw OCR data as well as the data fields from the Mindee API?

Thanks,
Daniel

Hi Daniel,

unfortunately you cannot get the complete OCR data from the API. You only get some raw data (bounding boxes and confidence levels) for the data that was mapped to an api field. You can get this data mapped to a metadata field in Mayan if you specify the metadata name as „storeocr“ in the config file. However you don’t get the whole OCR data.

I have another repo here: GitHub - DrRSatzteil/metadatamagic: Automatic metadata for mayan documents

This repo uses the models provided by mindee to do OCR on premises. The whole project contains quite a bit more than just the OCR part but is unfinished and not usable as is. The OCR part is working fine however, you can find it here though this is probably not something you want to do: metadatamagic/metadatamagic/analysis/documentanalyser.py at main · DrRSatzteil/metadatamagic · GitHub

Regards
Thomas

1 Like

Hi Thomas,

Thanks for the answer. It looks like you’ve put quite a bit of time into this project.

As it happens, I asked someone at Mindee if they will be providing the raw OCR data. He said…

"We’re actually in the process of building a Raw Text API that should be available soon; we haven’t confirmed the exact timeline for its availability yet. "

Thanks again,
Daniel

1 Like

Hi Thomas,

What type of transition trigger do you use to trigger the web service?

For example, I would like to setup a Watch Folder, and when a file is sent to the folder, I want to call Mindee. Then once Mindee has filled the Meta Data Fields, I want the user to review them for accuracy.

I don’t see a trigger in Mayan that will do that? Except maybe “Document file submitted for file metadata processing”. And that doesn’t seem like a perfect fit.

If I use a Staging Folder, or the Web Interface to add a new Document, the user has to enter the Meta Data Fields first, so I don’t know if there’s a way to use those and have the Plugin run before the user sees the Meta Data Field Entry Screen.

Thanks,
Daniel

Hi Daniel,

I don’t see a good way to use my plugin like that unfortunately. The reason is that the plugin gets triggered with a document id and will download the file from Mayan as a first step. So the document (including the file) must exist in Mayan before you can trigger it.

However there might be a way to work around so that you can keep your workflow (I’m not sure though whether this would work). This might also be interesting for me so I might look into that by myself if I find some time: you could upload the files right away (no staging folder but a direct import) and have mindee run over it (use a http request action on entering your first workflow step to do so). Configure the mindee plugin to assign a tag to every document so that you can use this as a trigger to transition the document to the next workflow step. You can remove the tag on exiting the previous workflow step so your users don’t have to see this tag. Here comes the part I’m not very familiar with: allow your users to trigger the next workflow transition manually (e.g. “Check Metadata”). You can add some fields to this workflow transition that your users get to see when they trigger the transition and that could be used to enter corrections for the detected metadata. However I don’t think it would be possible to populate these fields with the metadata you have attached to your document at that point in time. If that is not possible your users would need to keep the metadata fields open in a separate tab and correct them if needed. This would not be a very convenient way of doing it though. It would be great if you could populate this transition field with some existing data but that does not seem to be the case. In the other hand the users would need to keep a tab open to see the document anyway for making corrections so it might not be so bad after all.

Another idea: just let them enter their corrections on the regular metadata mask and just offer them a workflow transition they can use to make them as checked. Actually I would recommend to let them rather add a tag (e.g. “Checked”) and trigger the workflow automatically since this is just a lot more convenient than to trigger the workflow manually.

Regards
Thomas

Thank you for the quick response.

This is very helpful, if for no other reason then I know not to pursue this approach.

I think your idea of using tags, might be a good approach. Just because the User can have confidence about the Meta Data.

I’ll have to think about this some more. I want to make an easy, understandable workflow for my users.

Kind Regards,
Daniel

Hi Thomas,

I’m setting up the plugins in my docker-compose.yml right now. I believe I have most everything setup. But in reading your readme.md on GitHub I see this paragraph

Please note that right now it is not enough to create a new config in the api.json file but you also need to add a new endpoint in the mayanmindee/service.py file. This might be changed in a future release so that you can add new apis only by adjusting the config file.

As I’m pulling your plugins via docker and not git, I’m not quite sure how I can modify the service.py file.

Thanks,
Daniel

First of all you only need to adjust the file if you want to use an API that is not yet supported (currently TypeProofOfAddressV1 and TypeInvoiceV4).

The easiest way would be to download the file (mayan-mindee/mayanmindee/service.py at main · DrRSatzteil/mayan-mindee · GitHub), make you adjustments and then mount the changed file in your docker container at the path „/app/mayanmindee/service.py“.

Great! Thank you, I’ll follow your directions for mounting.

We have a custom API that we are using.

1 Like

Hi Thomas,

The plug-in does not seem to be contacting Mindee when I run a workflow using it. I see no traffic at all to Mindee.

When I run docker compose logs, I’m seeing this error message.

mayan-mayan-mindee-worker-1  |   File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 524, in read_response
mayan-mayan-mindee-worker-1  |     raise response
mayan-mayan-mindee-worker-1  | redis.exceptions.ResponseError: DB index is out of range

That error shows at initial startup, and keeps showing up at regular intervals.

I’m wondering if I have the wrong REDIS URL, I noticed that you have a /4 at the end of the URL, is that intended?

REDIS_URL: redis://:${MAYAN_REDIS_PASSWORD:-mayanredispassword}@redis:6379/4

Thanks

I use a separate redis database for my application to make sure that it does not interfere with mayan. In the standard installation the number of redis databases is lower. This is why the plugin cannot connect to redis. Increase the number of databases in your docker-compose.yml and you should be good. You may not need 4 databases though, this all depends on your installation. Just check how many databases you have right now and use the next higher number.

I see it’s set to 3, but I can only see 2 being used. I’ll increase to 4 for now, just for testing.

Thanks

If I remember correctly the connections start counting from 0 so 4 should be the fifth database though

It was, I had to change the count to 5.

I’m not getting the error message any more, but I can’t seem to get any documents to send over to Mindee.

Is there an error log, I can look at?

docker compose logs, isn’t showing any issues with either plugin.

I just used docker exec -it command to go into the mindee-web container.

The /log/mayanmindee.log is empty.

Are you sure that you triggered the plugin? You should see logs starting with “Loading document…” when you trigger the service.

You can also try to just trigger it via curl to make sure that it’s not a problem with the trigger from Mayan. Just send a get request with the document id to your defined endpoint.

Edit: please check the logs in the worker container, I actually think the web container does not log anything at all…

There’s no log file at all in the /logs directory of the worker container.

I’ll do the curl command next.

I ran curl, in the web container for a fake document id. The log there said it sent it to the worker container. Below is the log file from the worker container. It appears that it can’t find the Mayan app.

2024-01-03 18:29:22,094 worker       INFO     Retrieve initial mayan configuration from environment
2024-01-03 18:29:22,095 mayanapi     DEBUG    endpoint api call http://app:8000/api/v4/auth/token/obtain/?format=json
2024-01-03 18:29:22,097 urllib3.connectionpool DEBUG    Starting new HTTP connection (1): app:8000
2024-01-03 18:29:30,103 rq.worker    DEBUG    Job d226c82e-54ac-4ace-9d8e-61f850e5aec6 raised an exception.
2024-01-03 18:29:30,111 rq.worker    DEBUG    Handling failed execution of job d226c82e-54ac-4ace-9d8e-61f850e5aec6
2024-01-03 18:29:30,132 rq.worker    DEBUG    Handling exception for d226c82e-54ac-4ace-9d8e-61f850e5aec6.
2024-01-03 18:29:30,134 rq.worker    ERROR    [Job d226c82e-54ac-4ace-9d8e-61f850e5aec6]: exception raised while executing (worker.process_standard)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 496, in _make_request
    conn.request(
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 395, in request
    self.endheaders()
  File "/usr/local/lib/python3.11/http/client.py", line 1281, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.11/http/client.py", line 1041, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.11/http/client.py", line 979, in send
    self.connect()
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 243, in connect
    self.sock = self._new_conn()
                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 210, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPConnection object at 0x7f09a0f85210>: Failed to resolve 'app' ([Errno -3] Temporary failure in name resolution)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='app', port=8000): Max retries exceeded with url: /api/v4/auth/token/obtain/?format=json (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f09a0f85210>: Failed to re
solve 'app' ([Errno -3] Temporary failure in name resolution)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/rq/worker.py", line 1428, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1278, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1315, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/mayanmindee/worker.py", line 177, in process_standard
    m = get_mayan()
        ^^^^^^^^^^^
  File "/app/mayanmindee/worker.py", line 37, in get_mayan
    m.login(options["username"], options["password"])
  File "/app/mayanmindee/mayanapi.py", line 78, in login
    token_response = self.session.post(
                     ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='app', port=8000): Max retries exceeded with url: /api/v4/auth/token/obtain/?format=json (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f09a0f85210>: Failed to
 resolve 'app' ([Errno -3] Temporary failure in name resolution)"))