Document Content Empty

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
FlightdeckRob
Posts: 15
Joined: Wed Oct 03, 2018 3:54 pm

Document Content Empty

Post by FlightdeckRob » Tue Dec 11, 2018 8:30 pm

Hi all,

I was wondering if Mayan is supposed to show "Content" for non-pdf files? My PDFs all have nice and complete "Content" sections but docx, xls, xlsx, etc don't have any "Content" and they only searchable using the OCR text (which is sketchy on some of the excel files).

Page previews, OCR, etc are all working fine. No parsing errors showing in the web interface either.

If this is supposed to work can anyone give me a starting point to start checking things? I went looking for the error.log file and it doesn't exist where the settings say it should be so either I have no errors or the web interface settings is lying to me about where the error.log file is located.

Any help would be appreciated.

Thanks,

Rob

FlightdeckRob
Posts: 15
Joined: Wed Oct 03, 2018 3:54 pm

Re: Document Content Empty

Post by FlightdeckRob » Wed Dec 19, 2018 3:57 pm

Nobody else running Mayan in Docker is having this issue? It happens every time I install with Docker.

I just did a clean install on a new VM with Docker and found that the document_cache folder was not automatically created so I wasn't even getting the preview images or the OCR. I was also seeing this on the Docker demo in PWD. Added the folder to the Docker volume and that stuff started working.

Is there another folder that stores the "Content" information that might be missing?

Any help would be really appreciated.

Thanks,

Rob

KevinPawsey
Posts: 85
Joined: Wed Aug 22, 2018 2:52 pm

Re: Document Content Empty

Post by KevinPawsey » Thu Dec 20, 2018 10:33 am

Hi Rob,

I run Mayan on an x86 docker install... all seems to be working for me.

Could it possibly be a permissions issue with where the folders are being created? Make sure that the user that is running Docker container can write to the disk root of wherever the folders are being created. Also, are there any errors in the Docker logs?

Code: Select all

docker logs -f [docker_container]
Hope that helps.


Kevin
Running Mayan-EDMS on: OpenMediaVault, (Docker plugin), on x86 dual-core

FlightdeckRob
Posts: 15
Joined: Wed Oct 03, 2018 3:54 pm

Re: Document Content Empty

Post by FlightdeckRob » Thu Dec 20, 2018 2:53 pm

Kevin,

I'm running docker-compose according to the compose yml in the gitlab repo.

I'm continuously getting errors like these in the db container:

Code: Select all

2018-12-20 11:00:00.089 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 11:00:00.089 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

2018-12-20 12:00:00.051 UTC [86] ERROR:  column "document__date_added" does not exist at character 29

2018-12-20 12:00:00.051 UTC [86] STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))

2018-12-20 12:00:00.070 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 12:00:00.070 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

2018-12-20 13:00:00.057 UTC [86] ERROR:  column "document__date_added" does not exist at character 29

2018-12-20 13:00:00.057 UTC [86] STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))

2018-12-20 13:00:00.076 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 13:00:00.076 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

2018-12-20 14:00:00.068 UTC [86] ERROR:  column "document__date_added" does not exist at character 29

2018-12-20 14:00:00.068 UTC [86] STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))

2018-12-20 14:00:00.090 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 14:00:00.090 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))
and these errors in the app container

Code: Select all


[2018-12-20 14:00:00,039: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[f7a7f683-2eb3-4bf8-ac82-849319c2084e] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 146, in total_document_per_month
    ): qss.until(datetime.date(year, next_month, 1))
IndexError: list index out of range
[2018-12-20 14:00:00,059: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[86b76329-5d32-4301-af09-eadd05349e00] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 183, in total_document_version_per_month
    ): qss.until(datetime.date(year, next_month, 1))
IndexError: list index out of range
[2018-12-20 14:00:00,066: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[c2703399-8017-467b-b1e1-973abd1aab38] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 34, in new_documents_per_month
    qss.time_series(start=this_year, end=today, interval='months')
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 33, in <lambda>
    lambda x: {force_text(MONTH_NAMES[x[0].month]): x[1]},
IndexError: list index out of range
[2018-12-20 14:00:00,087: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[236d3ff4-3517-4fa9-8af5-ca0746d3d05f] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 96, in new_document_versions_per_month
    qss.time_series(start=this_year, end=today, interval='months')
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 95, in <lambda>
    lambda x: {force_text(MONTH_NAMES[x[0].month]): x[1]},
IndexError: list index out of range
[2018-12-20 14:00:00,114: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[c7ebc09c-0dbd-48b6-aa0c-9f654bdb5d93] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 56, in new_document_pages_per_month
    qss.time_series(start=this_year, end=today, interval='months')
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 55, in <lambda>
    lambda x: {force_text(MONTH_NAMES[x[0].month]): x[1]},
IndexError: list index out of range
[2018-12-20 14:00:00,139: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[6d41810c-def5-4b7f-8135-463906f662c1] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 220, in total_document_page_per_month
    ): qss.until(datetime.date(year, next_month, 1))
IndexError: list index out of range
[2018-12-20 14:40:18 +0000] [154] [INFO] Autorestarting worker after current request.
[2018-12-20 14:40:19 +0000] [154] [INFO] Worker exiting (pid: 154)
[2018-12-20 14:40:20 +0000] [158] [INFO] Booting worker with pid: 158
I really have no idea how to interpret these things other than that maybe the DB is corrupted and maybe that's causing the app errors. There doesn't seem to be any permission errors since docker is installed as root and the document_cache folder and the document_storage folder are both being written to.

Thanks for your suggestions!

Rob

FlightdeckRob
Posts: 15
Joined: Wed Oct 03, 2018 3:54 pm

Re: Document Content Empty

Post by FlightdeckRob » Thu Dec 20, 2018 4:57 pm

Did some more digging on a new install of Mayan. Test files used to check parsing were docx and txt files.
-Created new Debian 9 VM with root access.
-Loaded docker and docker-compose under root user
-Installed Mayan-EDMS via docker-compose.yml available through gitlab (4 containers).

This process gave the same issues, Mayan created all of the subfolders EXCEPT documents_cache, but after manual folder creation it was able to preview and OCR the documents but still no document "Content" parsed.

Removed all volumes, images, containers, etc

Then I tried following the 2 container instructions from the documentation:

Code: Select all

Using a dedicated Docker network
Use this method to avoid having to expose PostreSQL port to the host’s network or if you have other PostgreSQL instances but still want to use the default port of 5432 for this installation.

Create the network:

docker network create mayan
Launch the PostgreSQL container with the network option and remove the port binding (-p 5432:5432):

docker run -d \
--name mayan-edms-postgres \
--network=mayan \
--restart=always \
-e POSTGRES_USER=mayan \
-e POSTGRES_DB=mayan \
-e POSTGRES_PASSWORD=mayanuserpass \
-v /docker-volumes/mayan-edms/postgres:/var/lib/postgresql/data \
-d postgres:9.5
Launch the Mayan EDMS container with the network option and change the database hostname to the PostgreSQL container name (mayan-edms-postgres) instead of the IP address of the Docker host (172.17.0.1):

docker run -d \
--name mayan-edms \
--network=mayan \
--restart=always \
-p 80:8000 \
-e MAYAN_DATABASE_ENGINE=django.db.backends.postgresql \
-e MAYAN_DATABASE_HOST=mayan-edms-postgres \
-e MAYAN_DATABASE_NAME=mayan \
-e MAYAN_DATABASE_PASSWORD=mayanuserpass \
-e MAYAN_DATABASE_USER=mayan \
-e MAYAN_DATABASE_CONN_MAX_AGE=60 \
-v /docker-volumes/mayan-edms/media:/var/lib/mayan \
mayanedms/mayanedms:latest
To be sure there weren't any permission issues I set the docker volume folder to have 777 permissions. This gave the same result with document_storage being created to store the first file on upload. Documents_cache folder was not created so no OCR and no previews on startup but after manually creating the folder previews and OCR worked. Still no parsed document content.

Update:
I checked the log files of DB container and found this with only 2 files uploaded with no errors:

Code: Select all

LOG:  database system was shut down at 2018-12-20 16:26:28 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
ERROR:  column "document__date_added" does not exist at character 29
STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))
ERROR:  column "document_version__document__date_added" does not exist at character 29
STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

User avatar
rosarior
Posts: 393
Joined: Tue Aug 21, 2018 3:28 am

Re: Document Content Empty

Post by rosarior » Tue Jan 29, 2019 7:37 am

Thanks for finding the source of the issue and opening the tickets, it will be easier to solve these.

Tickets:
https://gitlab.com/mayan-edms/mayan-edms/issues/549
https://gitlab.com/mayan-edms/mayan-edms/issues/550

riopangeran
Posts: 10
Joined: Thu Sep 12, 2019 8:46 am

Re: Document Content Empty

Post by riopangeran » Thu Sep 12, 2019 8:53 am

Hi Rob..

Sorry to bump this thread again, mind to share how to solve the problem of empty content for non-pdf docs? i also have this problems, but for preview, OCR text, everything is good, just the content.

Docuemnt i have tried to uplaod so far : txt, docx, and xlsx... result content always return empty.

I use Mayan v.3.2.7 using docker..

User avatar
rosarior
Posts: 393
Joined: Tue Aug 21, 2018 3:28 am

Re: Document Content Empty

Post by rosarior » Thu Sep 19, 2019 3:13 pm

Hi,

not problem with bumping old topics, that's what the forum is for! :)

Can you share a document that exhibits the problem so that we can test it locally. Anything without confidential information. If you can trigger it using a public document for the web even better. Thanks.

riopangeran
Posts: 10
Joined: Thu Sep 12, 2019 8:46 am

Re: Document Content Empty

Post by riopangeran » Thu Oct 03, 2019 8:35 am

Hi, sorry for my delay responding the message, just came back from my duty.

Unfortunately, i just removed the installation of mayan (docker) and now still trying to reinstall Mayan using direct deployment.

But, what i remember for the document i uploaded was just a simple new created ms word document with one line of random words there. then save it as docx extension.

When i upload it to Mayan, the OCR works well, but not with the document parsing.

Any advice for this?

Thank you.

riopangeran
Posts: 10
Joined: Thu Sep 12, 2019 8:46 am

Re: Document Content Empty

Post by riopangeran » Fri Oct 11, 2019 8:10 am

Hi Rosario,

Just finished deploying this great app using direct deployment.

I tried again to test the parse function, but still couldn't find the answer, it has empty content

These are samples files i put to Mayan :
https://1drv.ms/u/s!ApaK9u60Bn-xhdlM8LN ... w?e=8FAoi4

Also i found also the error from postgresql :
https://prnt.sc/pht3ub

Thank you..

Post Reply