where to add code for auto assignment of metadata

Posted: Sat Nov 16, 2019 7:23 pm
by bradyhurst
Hello - I'm new to Mayan and just purchased the book to get started, but haven't found a way to do what I want to do yet.
I did get up and running with the docker container and was able to set up a watch folder source and some document types.

The initial goal of my project is:
1 - Various document types are scanned to dedicated shared folders from different scanners around the building. For this example, I'll use the work order document type.
2 - the work order has a meaningless but unique file name and comes in as a clear PDF with a key value barcode in the upper left corner of the cover page as well as various reliable key fields that can be OCR'd.
3 - I'd like the meta data assignment and indexing process to be automatic such that the system reads the file, OCR's it decodes a barcode (if the document type has one) and then assigns metadata based on regex or grep run against the ocr OR some custom python that uses the bardode to look up other stuff via ODBC.

I've been doing this for years with one document type, "CofC" where the documents are all just dumped in one big folder, OCR'd, and then Windows file indexing and search is used to find what we want. This actually works pretty reliably, but obviously the interface leaves something to be desired. Windows also makes indexed file search over a network on a domain pretty unpleasant to setup and maintain these days.

I'm envisioning something like a place where I can add a .py script that runs after OCR and then runs my own logic to parse the data and assign metadata. I couldn't tell from reading past posts whether barcode support is in there yet or not, but if it is, that would probably be more reliable for some document types than OCR, but if the OCR is comparable to what software like Acrobat and Paperport do, then I would trust it for my needs.

I don't mind buying a support subscription if I need to, but I need to figure out if the software is going to meet our needs before putting much more time in it.

I don't mind doing some Python programming, but I'm not looking to get into a huge project here. Can I do what I described above in Mayan without needing to make my own fork from the source?

Posted: Sun Nov 24, 2019 5:19 am
by m42e
I wanted to do the same. The least intrusive way I found and had chosen was using the api. I add a tag to all the files processed to skip them, and process all the other files. The easiest would be to let it run on the same server, but it is not a must have. Works, but takes some time dependent on the amount of documents you are using.

If you are interested I’m willing to clean up my code a bit and push it to GitHub.

Posted: Mon Nov 25, 2019 2:46 pm
by bradyhurst
Sure, I'd like to see it, but I'm not sure I follow the work flow. Did you write a separate process that polls the API, does the OCR, tags, then hands back to Mayan? Thanks for the reply.

Posted: Fri Nov 29, 2019 8:12 am
by m42e
So here's the code:

OCR is done by mayan. Actually this polls the api (in the example above it currently only processes the document with the id 1206). But you should be able to adapt it to listen on a callback triggered by a workflow from mayan. Maybe a setup with another script offering an web callback endpoint and utilise celery would be not the worst idea.

It uses the API to fetch metadata, document types, ect. A caching may be useful there but is not implemented yet.

Posted: Thu Dec 05, 2019 7:14 am
by rosarior
Thank you very much for sharing your code!