Hello,
in addition to https://gitlab.com/mayan-edms/mayan-edms/-/issues/860 I have a feature request about automatic processing of documents by learning from its content or/and shape.
The idea for Mayan EDMS came, as I played around with PySS3 (https://pyss3.readthedocs.io/en/latest/) for auto-detecting a lot of documents.
Especially, if you want to transition to a modern EDMS from very old storage/archiving application with a great number of different files (PDF, DOC, EML, MSG, ...) and jailed document information in a separate proprietary database.
That could be a huge benefit in just drop/scan a document into a folder and have it auto-hydrated with information/tags/metadata about it.
On Stackoverflow some mention approaches with visually recognize documents by its layout using deep learning with Keras or Tensorflow.
Additional resources I found as I researched for my case:
Text classification: https://github.com/kk7nc/Text_Classification
Document classification: https://github.com/MITESHPUTHRANNEU/Doc ... sification
What's your opinion on this?
Thanks in advance.
Machine Learning/NLP for document classification and metadata
Re: Machine Learning/NLP for document classification and metadata
Those are exciting thoughts!
I am working on a solution to use http://cermine.ceon.pl/index.html as I only use scientific PDF articles
Cermine is looking at the composition in the text to determine its function e.g. title, authors, year of publication, etc.
Metadata provided in PDFs is most of the time unuseful data from the scanner and so on. Very unreliable.
Cheers
I am working on a solution to use http://cermine.ceon.pl/index.html as I only use scientific PDF articles
Cermine is looking at the composition in the text to determine its function e.g. title, authors, year of publication, etc.
Metadata provided in PDFs is most of the time unuseful data from the scanner and so on. Very unreliable.
Cheers