Document Content Empty

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
FlightdeckRob
Posts: 14
Joined: Wed Oct 03, 2018 3:54 pm

Document Content Empty

Post by FlightdeckRob » Tue Dec 11, 2018 8:30 pm

Hi all,

I was wondering if Mayan is supposed to show "Content" for non-pdf files? My PDFs all have nice and complete "Content" sections but docx, xls, xlsx, etc don't have any "Content" and they only searchable using the OCR text (which is sketchy on some of the excel files).

Page previews, OCR, etc are all working fine. No parsing errors showing in the web interface either.

If this is supposed to work can anyone give me a starting point to start checking things? I went looking for the error.log file and it doesn't exist where the settings say it should be so either I have no errors or the web interface settings is lying to me about where the error.log file is located.

Any help would be appreciated.

Thanks,

Rob

FlightdeckRob
Posts: 14
Joined: Wed Oct 03, 2018 3:54 pm

Re: Document Content Empty

Post by FlightdeckRob » Wed Dec 19, 2018 3:57 pm

Nobody else running Mayan in Docker is having this issue? It happens every time I install with Docker.

I just did a clean install on a new VM with Docker and found that the document_cache folder was not automatically created so I wasn't even getting the preview images or the OCR. I was also seeing this on the Docker demo in PWD. Added the folder to the Docker volume and that stuff started working.

Is there another folder that stores the "Content" information that might be missing?

Any help would be really appreciated.

Thanks,

Rob

KevinPawsey
Posts: 62
Joined: Wed Aug 22, 2018 2:52 pm

Re: Document Content Empty

Post by KevinPawsey » Thu Dec 20, 2018 10:33 am

Hi Rob,

I run Mayan on an x86 docker install... all seems to be working for me.

Could it possibly be a permissions issue with where the folders are being created? Make sure that the user that is running Docker container can write to the disk root of wherever the folders are being created. Also, are there any errors in the Docker logs?

Code: Select all

docker logs -f [docker_container]
Hope that helps.


Kevin
Running Mayan-EDMS on: OpenMediaVault, (Docker plugin), on x86 dual-core

FlightdeckRob
Posts: 14
Joined: Wed Oct 03, 2018 3:54 pm

Re: Document Content Empty

Post by FlightdeckRob » Thu Dec 20, 2018 2:53 pm

Kevin,

I'm running docker-compose according to the compose yml in the gitlab repo.

I'm continuously getting errors like these in the db container:

Code: Select all

2018-12-20 11:00:00.089 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 11:00:00.089 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

2018-12-20 12:00:00.051 UTC [86] ERROR:  column "document__date_added" does not exist at character 29

2018-12-20 12:00:00.051 UTC [86] STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))

2018-12-20 12:00:00.070 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 12:00:00.070 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

2018-12-20 13:00:00.057 UTC [86] ERROR:  column "document__date_added" does not exist at character 29

2018-12-20 13:00:00.057 UTC [86] STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))

2018-12-20 13:00:00.076 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 13:00:00.076 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

2018-12-20 14:00:00.068 UTC [86] ERROR:  column "document__date_added" does not exist at character 29

2018-12-20 14:00:00.068 UTC [86] STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))

2018-12-20 14:00:00.090 UTC [86] ERROR:  column "document_version__document__date_added" does not exist at character 29

2018-12-20 14:00:00.090 UTC [86] STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))
and these errors in the app container

Code: Select all


[2018-12-20 14:00:00,039: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[f7a7f683-2eb3-4bf8-ac82-849319c2084e] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 146, in total_document_per_month
    ): qss.until(datetime.date(year, next_month, 1))
IndexError: list index out of range
[2018-12-20 14:00:00,059: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[86b76329-5d32-4301-af09-eadd05349e00] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 183, in total_document_version_per_month
    ): qss.until(datetime.date(year, next_month, 1))
IndexError: list index out of range
[2018-12-20 14:00:00,066: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[c2703399-8017-467b-b1e1-973abd1aab38] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 34, in new_documents_per_month
    qss.time_series(start=this_year, end=today, interval='months')
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 33, in <lambda>
    lambda x: {force_text(MONTH_NAMES[x[0].month]): x[1]},
IndexError: list index out of range
[2018-12-20 14:00:00,087: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[236d3ff4-3517-4fa9-8af5-ca0746d3d05f] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 96, in new_document_versions_per_month
    qss.time_series(start=this_year, end=today, interval='months')
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 95, in <lambda>
    lambda x: {force_text(MONTH_NAMES[x[0].month]): x[1]},
IndexError: list index out of range
[2018-12-20 14:00:00,114: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[c7ebc09c-0dbd-48b6-aa0c-9f654bdb5d93] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 56, in new_document_pages_per_month
    qss.time_series(start=this_year, end=today, interval='months')
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 55, in <lambda>
    lambda x: {force_text(MONTH_NAMES[x[0].month]): x[1]},
IndexError: list index out of range
[2018-12-20 14:00:00,139: ERROR/MainProcess] Task mayan_statistics.tasks.task_execute_statistic[6d41810c-def5-4b7f-8135-463906f662c1] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/tasks.py", line 16, in task_execute_statistic
    Statistic.get(slug=slug).execute()
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/mayan_statistics/classes.py", line 120, in execute
    self.store_results(results=self.func())
  File "/opt/mayan-edms/local/lib/python2.7/site-packages/mayan/apps/documents/statistics.py", line 220, in total_document_page_per_month
    ): qss.until(datetime.date(year, next_month, 1))
IndexError: list index out of range
[2018-12-20 14:40:18 +0000] [154] [INFO] Autorestarting worker after current request.
[2018-12-20 14:40:19 +0000] [154] [INFO] Worker exiting (pid: 154)
[2018-12-20 14:40:20 +0000] [158] [INFO] Booting worker with pid: 158
I really have no idea how to interpret these things other than that maybe the DB is corrupted and maybe that's causing the app errors. There doesn't seem to be any permission errors since docker is installed as root and the document_cache folder and the document_storage folder are both being written to.

Thanks for your suggestions!

Rob

FlightdeckRob
Posts: 14
Joined: Wed Oct 03, 2018 3:54 pm

Re: Document Content Empty

Post by FlightdeckRob » Thu Dec 20, 2018 4:57 pm

Did some more digging on a new install of Mayan. Test files used to check parsing were docx and txt files.
-Created new Debian 9 VM with root access.
-Loaded docker and docker-compose under root user
-Installed Mayan-EDMS via docker-compose.yml available through gitlab (4 containers).

This process gave the same issues, Mayan created all of the subfolders EXCEPT documents_cache, but after manual folder creation it was able to preview and OCR the documents but still no document "Content" parsed.

Removed all volumes, images, containers, etc

Then I tried following the 2 container instructions from the documentation:

Code: Select all

Using a dedicated Docker network
Use this method to avoid having to expose PostreSQL port to the host’s network or if you have other PostgreSQL instances but still want to use the default port of 5432 for this installation.

Create the network:

docker network create mayan
Launch the PostgreSQL container with the network option and remove the port binding (-p 5432:5432):

docker run -d \
--name mayan-edms-postgres \
--network=mayan \
--restart=always \
-e POSTGRES_USER=mayan \
-e POSTGRES_DB=mayan \
-e POSTGRES_PASSWORD=mayanuserpass \
-v /docker-volumes/mayan-edms/postgres:/var/lib/postgresql/data \
-d postgres:9.5
Launch the Mayan EDMS container with the network option and change the database hostname to the PostgreSQL container name (mayan-edms-postgres) instead of the IP address of the Docker host (172.17.0.1):

docker run -d \
--name mayan-edms \
--network=mayan \
--restart=always \
-p 80:8000 \
-e MAYAN_DATABASE_ENGINE=django.db.backends.postgresql \
-e MAYAN_DATABASE_HOST=mayan-edms-postgres \
-e MAYAN_DATABASE_NAME=mayan \
-e MAYAN_DATABASE_PASSWORD=mayanuserpass \
-e MAYAN_DATABASE_USER=mayan \
-e MAYAN_DATABASE_CONN_MAX_AGE=60 \
-v /docker-volumes/mayan-edms/media:/var/lib/mayan \
mayanedms/mayanedms:latest
To be sure there weren't any permission issues I set the docker volume folder to have 777 permissions. This gave the same result with document_storage being created to store the first file on upload. Documents_cache folder was not created so no OCR and no previews on startup but after manually creating the folder previews and OCR worked. Still no parsed document content.

Update:
I checked the log files of DB container and found this with only 2 files uploaded with no errors:

Code: Select all

LOG:  database system was shut down at 2018-12-20 16:26:28 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
ERROR:  column "document__date_added" does not exist at character 29
STATEMENT:  SELECT (date_trunc('month', document__date_added)) AS "d", COUNT("documents_documentversion"."id") AS "agg" FROM "documents_documentversion" INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document__date_added))
ERROR:  column "document_version__document__date_added" does not exist at character 29
STATEMENT:  SELECT (date_trunc('month', document_version__document__date_added)) AS "d", COUNT("documents_documentpage"."id") AS "agg" FROM "documents_documentpage" INNER JOIN "documents_documentversion" ON ("documents_documentpage"."document_version_id" = "documents_documentversion"."id") INNER JOIN "documents_document" ON ("documents_documentversion"."document_id" = "documents_document"."id") WHERE "documents_document"."date_added" BETWEEN '2018-01-01T00:00:00+00:00'::timestamptz AND '2018-12-31T23:59:59.999999+00:00'::timestamptz GROUP BY (date_trunc('month', document_version__document__date_added))

User avatar
rosarior
Posts: 181
Joined: Tue Aug 21, 2018 3:28 am

Re: Document Content Empty

Post by rosarior » Tue Jan 29, 2019 7:37 am

Thanks for finding the source of the issue and opening the tickets, it will be easier to solve these.

Tickets:
https://gitlab.com/mayan-edms/mayan-edms/issues/549
https://gitlab.com/mayan-edms/mayan-edms/issues/550

Post Reply