Machine Learning/NLP for document classification and metadata

Requests for new functionality or improvements in existing functionality. Please provide clear descriptions of your request, an example or if possible a real life scenario.
Post Reply
Andro
Posts: 1
Joined: Tue May 17, 2022 9:11 pm

Machine Learning/NLP for document classification and metadata

Post by Andro »

Hello,

in addition to https://gitlab.com/mayan-edms/mayan-edms/-/issues/860 I have a feature request about automatic processing of documents by learning from its content or/and shape.
The idea for Mayan EDMS came, as I played around with PySS3 (https://pyss3.readthedocs.io/en/latest/) for auto-detecting a lot of documents.
Especially, if you want to transition to a modern EDMS from very old storage/archiving application with a great number of different files (PDF, DOC, EML, MSG, ...) and jailed document information in a separate proprietary database.

That could be a huge benefit in just drop/scan a document into a folder and have it auto-hydrated with information/tags/metadata about it.
On Stackoverflow some mention approaches with visually recognize documents by its layout using deep learning with Keras or Tensorflow.

Additional resources I found as I researched for my case:
Text classification: https://github.com/kk7nc/Text_Classification
Document classification: https://github.com/MITESHPUTHRANNEU/Doc ... sification


What's your opinion on this?


Thanks in advance.
bwakkie
50 Posts
50 Posts
Posts: 70
Joined: Fri Feb 14, 2020 8:28 pm

Re: Machine Learning/NLP for document classification and metadata

Post by bwakkie »

Those are exciting thoughts!

I am working on a solution to use http://cermine.ceon.pl/index.html as I only use scientific PDF articles
Cermine is looking at the composition in the text to determine its function e.g. title, authors, year of publication, etc.

Metadata provided in PDFs is most of the time unuseful data from the scanner and so on. Very unreliable.

Cheers
Post Reply