Document Content Empty

Questions, comments, discussions. Over time certain topics might be moved to their own category.
amphetamine
Posts: 14
Joined: Sat Jul 04, 2020 8:43 am

Re: Document Content Empty

Post by amphetamine »

same problem happened in directly installed and Docker installed.
after document parsing for word file (doc or docx extension), the document content is still empty.
also tried odt file (save as from word), same result (content empty)
--
lsmoker
Posts: 24
Joined: Wed Sep 05, 2018 3:52 pm

Re: Document Content Empty

Post by lsmoker »

I noticed this too. The short answer is that this is the way it is coded. See https://gitlab.com/mayan-edms/mayan-edm ... rs.py#L164. Only the 'application/pdf' mimetype is listed as a registered parser class. So anything except PDF would need to be converted first.

Of course, it looks like other parsers can be written and registered...
---
LeVon Smoker
lsmoker
Posts: 24
Joined: Wed Sep 05, 2018 3:52 pm

Re: Document Content Empty

Post by lsmoker »

Found an easy way to get more mimetypes to have the "Content" appear (as opposed to only PDF). If you have your own custom app added in, add the below code to your /opt/mayan-edms/mycustomapp/apps.py file:

Code: Select all

        # adding this makes the "Content" appear for these mimetypes/formats
        from mayan.apps.document_parsing.parsers import Parser, PopplerParser
        Parser.register(
            mimetypes=(
                'application/pdf',
                'application/msword',
                'application/mswrite',
                'application/mspowerpoint',
                'application/msexcel',
                'application/vnd.ms-excel',
                'application/vnd.ms-excel.addin.macroEnabled.12',
                'application/vnd.ms-excel.sheet.binary.macroEnabled.12',
                'application/vnd.ms-powerpoint',
                'application/vnd.oasis.opendocument.chart',
                'application/vnd.oasis.opendocument.chart-template',
                'application/vnd.oasis.opendocument.formula',
                'application/vnd.oasis.opendocument.formula-template',
                'application/vnd.oasis.opendocument.graphics',
                'application/vnd.oasis.opendocument.graphics-template',
                'application/vnd.oasis.opendocument.image',
                'application/vnd.oasis.opendocument.image-template',
                'application/vnd.oasis.opendocument.presentation',
                'application/vnd.oasis.opendocument.presentation-template',
                'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
                'application/vnd.openxmlformats-officedocument.spreadsheetml.template',
                'application/vnd.openxmlformats-officedocument.presentationml.template',
                'application/vnd.openxmlformats-officedocument.presentationml.slideshow',
                'application/vnd.openxmlformats-officedocument.presentationml.presentation',
                'application/vnd.openxmlformats-officedocument.presentationml.slide',
                'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
                'application/vnd.openxmlformats-officedocument.wordprocessingml.template',
                'application/vnd.oasis.opendocument.spreadsheet',
                'application/vnd.oasis.opendocument.spreadsheet-template',
                'application/vnd.oasis.opendocument.text',
                'application/vnd.oasis.opendocument.text-master',
                'application/vnd.oasis.opendocument.text-template',
                'application/vnd.oasis.opendocument.text-web',
            ),
            parser_classes=(PopplerParser,)
        )
This works for the few documents I tested, so YMMV.

Not sure why these mimetypes are not in the parsers.py file (see previous reply). Devs?
---
LeVon Smoker
User avatar
rosarior
Developer
Developer
Posts: 649
Joined: Tue Aug 21, 2018 3:28 am
Location: Puerto Rico
Contact:

Re: Document Content Empty

Post by rosarior »

The MIME types for office files were not in the Poppler backend registration because internally, all office documents are converted to PDF before processing (https://gitlab.com/mayan-edms/mayan-edm ... es.py#L178).

Issue #957 (https://gitlab.com/mayan-edms/mayan-edms/-/issues/957) was created to add a test and double check the behavior.

Thanks!
lsmoker
Posts: 24
Joined: Wed Sep 05, 2018 3:52 pm

Re: Document Content Empty

Post by lsmoker »

It looks like the mimetype check is being done with the mimetype on the docx document's properties rather than the internal PDF (but I haven't dug through the code yet...)
---
LeVon Smoker
Post Reply