150k pages taking 2+ minutes to search

Questions, comments, discussions. Over time certain topics might be moved to their own category.
User avatar
rssfed23
Moderator
Moderator
Posts: 213
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

Docker will always add some overhead yes. It also can't take advantage of some of the optimisations available for bare metal.

I just ran some test queries on version 11 (upgraded from 9.6) and it did not add any significant improvement to speed on search.

It should really only be searching that gives you trouble. Postgres has to look through every 150k page and search through for the line of text in order to find the result.

So In summary:

Postgres is running the search query with only 1 thread. To speed things up any or all of the below can be tried and each item alone will help slightly:
- Don't run postgres containerised
- Ensure you've set up Postgres Hugepages and other Optimisation parameters (the pgtune utility can help here)
- Apply other postgres optimisations such as adding in a memcached cache (https://dzone.com/articles/three-steps- ... erformance)
- Run postgres on a faster server

In the background, we'll investigate what we can do as far as tweaking the search. I've logged https://gitlab.com/mayan-edms/mayan-edms/issues/741 to see what investigations can be done on this
As mentioned before, enterprise customers get priority support and feature requests so don't hesitate to get in touch.
Please bear with us during the current global situation. The team all have families and local communities to look after as well as the community here. Responses may be delayed during this time, but rest assured we will get to your query eventually.

User avatar
rssfed23
Moderator
Moderator
Posts: 213
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

Hybred.

As per the github issue (https://gitlab.com/mayan-edms/mayan-edm ... _268258869), we won't be making any immediate changes in terms of the current search setup/postgres configuration as there are broader changes planned for 4.0

Initially that will be to add backend support to the Mayan search engine. That will enable pluggable backends using drivers. Initially the Target will be woosh, but in theory that could be extended to others in the future (e.g. Elasticsearch). Based on my own experience in the past switching to woosh (which is designed for full text search) will significantly improve search speed and likely solve all your performance issues.
I should highlight that this is a two step process: adding support for backends and then adding our first additional backend (woosh). It is hoped that backend will appear in 4.0 this year and then woosh will be added soon after. Of course there is no guarantee it will end up in 4.0 as priorities can change and release dates move, but it is on the plan.

You can see the gitlab issue for all the details but that's the current plan. I hope that gives you some confidence. Prioritising this functionality can definitely be a part of any enterprise consulting agreement discussion.
Please bear with us during the current global situation. The team all have families and local communities to look after as well as the community here. Responses may be delayed during this time, but rest assured we will get to your query eventually.

Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

Re: 150k pages taking 2+ minutes to search

Post by Hybred »

Thanks for making it an official concern and feature improvement.

I ran the database optimization and it still continues to crash after 120 seconds, so it seems like I need to pull the database out of the docker container and into it's own home.

I'll shoot for that early next week and see what comes of it.

Thanks for your help! I just submitted a PR for enterprise support, I'm sure it will be needed.

User avatar
rssfed23
Moderator
Moderator
Posts: 213
Joined: Mon Oct 14, 2019 1:18 pm
Location: United Kingdom
Contact:

Re: 150k pages taking 2+ minutes to search

Post by rssfed23 »

You’re more than welcome!
We’ll be happy to help support you :)
Please bear with us during the current global situation. The team all have families and local communities to look after as well as the community here. Responses may be delayed during this time, but rest assured we will get to your query eventually.

Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

Re: 150k pages taking 2+ minutes to search

Post by Hybred »

Hello again!

After a project was completed I'm able to jump back into this. I was able to dump the database and import it into a separate PostGRE instance. What settings would I need to update to point it to the new DB? Or would it be better to just start a new install and point the data to the db?

Hybred
Posts: 8
Joined: Fri Jan 03, 2020 3:12 pm

Re: 150k pages taking 2+ minutes to search

Post by Hybred »

Also signed up for support - let me know if the base tier is enough, can easily upgrade if not.

ID: I-KCN9BW1Y93YN

m42e
Posts: 12
Joined: Fri Oct 11, 2019 8:21 am

Re: 150k pages taking 2+ minutes to search

Post by m42e »

I recently stumbled across the same issue. Just using the main search on the front-page did not work out. I analysed the requests done. Using the advanced search, I had no issues doing full text search.

I share with you what Postgres explain showed me. I picked the longest query

FrontPage Search

Code: Select all

SELECT COUNT(*) FROM (SELECT DISTINCT "documents_document"."id" AS Col1, "documents_document"."uuid" AS Col2, "documents_document"."document_type_id" AS Col3, "documents_document"."label" AS Col4, "documents_document"."description" AS Col5, "documents_document"."date_added" AS Col6, "documents_document"."language" AS Col7, "documents_document"."in_trash" AS Col8, "documents_document"."deleted_date_time" AS Col9, "documents_document"."is_stub" AS Col10 FROM "documents_document" INNER JOIN "documents_documenttype" ON ("documents_document"."document_type_id" = "documents_documenttype"."id") LEFT OUTER JOIN "documents_documentversion" ON ("documents_document"."id" = "documents_documentversion"."document_id") LEFT OUTER JOIN "cabinets_cabinet_documents" ON ("documents_document"."id" = "cabinets_cabinet_documents"."document_id") LEFT OUTER JOIN "cabinets_cabinet" ON ("cabinets_cabinet_documents"."cabinet_id" = "cabinets_cabinet"."id") LEFT OUTER JOIN "document_comments_comment" ON ("documents_document"."id" = "document_comments_comment"."document_id") LEFT OUTER JOIN "documents_documentpage" ON ("documents_documentversion"."id" = "documents_documentpage"."document_version_id") LEFT OUTER JOIN "document_parsing_documentpagecontent" ON ("documents_documentpage"."id" = "document_parsing_documentpagecontent"."document_page_id") LEFT OUTER JOIN "file_metadata_documentversiondriverentry" ON ("documents_documentversion"."id" = "file_metadata_documentversiondriverentry"."document_version_id") LEFT OUTER JOIN "file_metadata_filemetadataentry" ON ("file_metadata_documentversiondriverentry"."id" = "file_metadata_filemetadataentry"."document_version_driver_entry_id") LEFT OUTER JOIN "metadata_documentmetadata" ON ("documents_document"."id" = "metadata_documentmetadata"."document_id") LEFT OUTER JOIN "metadata_metadatatype" ON ("metadata_documentmetadata"."metadata_type_id" = "metadata_metadatatype"."id") LEFT OUTER JOIN "ocr_documentpageocrcontent" ON ("documents_documentpage"."id" = "ocr_documentpageocrcontent"."document_page_id") LEFT OUTER JOIN "tags_tag_documents" ON ("documents_document"."id" = "tags_tag_documents"."document_id") LEFT OUTER JOIN "tags_tag" ON ("tags_tag_documents"."tag_id" = "tags_tag"."id") WHERE ("documents_document"."in_trash" = false AND "documents_document"."is_stub" = false AND (UPPER("documents_documenttype"."label"::text) LIKE UPPER('%test%') OR UPPER("documents_documentversion"."mimetype"::text) LIKE UPPER('%test%') OR UPPER("documents_document"."label"::text) LIKE UPPER('%test%') OR UPPER("documents_document"."description"::text) LIKE UPPER('%test%') OR UPPER("documents_document"."uuid"::text) LIKE UPPER('%test%') OR UPPER("documents_documentversion"."checksum"::text) LIKE UPPER('%test%') OR UPPER("cabinets_cabinet"."label"::text) LIKE UPPER('%test%') OR UPPER("document_comments_comment"."comment"::text) LIKE UPPER('%test%') OR UPPER("document_parsing_documentpagecontent"."content"::text) LIKE UPPER('%test%') OR UPPER("file_metadata_filemetadataentry"."key"::text) LIKE UPPER('%test%') OR UPPER("file_metadata_filemetadataentry"."value"::text) LIKE UPPER('%test%') OR UPPER("metadata_metadatatype"."name"::text) LIKE UPPER('%test%') OR UPPER("metadata_documentmetadata"."value"::text) LIKE UPPER('%test%') OR UPPER("ocr_documentpageocrcontent"."content"::text) LIKE UPPER('%test%') OR UPPER("tags_tag"."label"::text) LIKE UPPER('%test%')) AND ("documents_document"."id" IN (SELECT U0."object_id" FROM "acls_accesscontrollist" U0 INNER JOIN "acls_accesscontrollist_permissions" U2 ON (U0."id" = U2."accesscontrollist_id") INNER JOIN "permissions_role" U4 ON (U0."role_id" = U4."id") INNER JOIN "permissions_role_groups" U5 ON (U4."id" = U5."role_id") INNER JOIN "auth_group" U6 ON (U5."group_id" = U6."id") INNER JOIN "auth_user_groups" U7 ON (U6."id" = U7."group_id") WHERE (U0."content_type_id" = 68 AND U2."storedpermission_id" = 9 AND U7."user_id" = 2)) OR "documents_document"."document_type_id" IN (SELECT U0."object_id" FROM "acls_accesscontrollist" U0 INNER JOIN "acls_accesscontrollist_permissions" U2 ON (U0."id" = U2."accesscontrollist_id") INNER JOIN "permissions_role" U4 ON (U0."role_id" = U4."id") INNER JOIN "permissions_role_groups" U5 ON (U4."id" = U5."role_id") INNER JOIN "auth_group" U6 ON (U5."group_id" = U6."id") INNER JOIN "auth_user_groups" U7 ON (U6."id" = U7."group_id") WHERE (U0."content_type_id" = 62 AND U2."storedpermission_id" = 9 AND U7."user_id" = 2))))) subquery
Postgres SQL Explain

Code: Select all

Aggregate  (cost=8309.37..8309.38 rows=1 width=8)
  ->  HashAggregate  (cost=8236.92..8269.12 rows=3220 width=83)
        Group Key: documents_document.id, documents_document.uuid, documents_document.document_type_id, documents_document.label, documents_document.description, documents_document.date_added, documents_document.language, documents_document.in_trash, documents_document.deleted_date_time, documents_document.is_stub
        ->  Hash Left Join  (cost=2265.41..7435.35 rows=32063 width=83)
              Hash Cond: (documents_documentpage.id = ocr_documentpageocrcontent.document_page_id)
              Filter: ((upper((documents_documenttype.label)::text) ~~ '%TEST%'::text) OR (upper((documents_documentversion.mimetype)::text) ~~ '%TEST%'::text) OR (upper((documents_document.label)::text) ~~ '%TEST%'::text) OR (upper(documents_document.description) ~~ '%TEST%'::text) OR (upper((documents_document.uuid)::text) ~~ '%TEST%'::text) OR (upper((documents_documentversion.checksum)::text) ~~ '%TEST%'::text) OR (upper((cabinets_cabinet.label)::text) ~~ '%TEST%'::text) OR (upper(document_comments_comment.comment) ~~ '%TEST%'::text) OR (upper(document_parsing_documentpagecontent.content) ~~ '%TEST%'::text) OR (upper((file_metadata_filemetadataentry.key)::text) ~~ '%TEST%'::text) OR (upper((file_metadata_filemetadataentry.value)::text) ~~ '%TEST%'::text) OR (upper((metadata_metadatatype.name)::text) ~~ '%TEST%'::text) OR (upper((metadata_documentmetadata.value)::text) ~~ '%TEST%'::text) OR (upper(ocr_documentpageocrcontent.content) ~~ '%TEST%'::text) OR (upper((tags_tag.label)::text) ~~ '%TEST%'::text))
              ->  Hash Left Join  (cost=1467.19..5892.90 rows=282472 width=1658)
                    Hash Cond: (documents_documentversion.id = file_metadata_documentversiondriverentry.document_version_id)
                    ->  Hash Left Join  (cost=516.77..1751.29 rows=11165 width=1634)
                          Hash Cond: (documents_document.id = tags_tag_documents.document_id)
                          ->  Hash Left Join  (cost=460.05..1514.17 rows=6875 width=1360)
                                Hash Cond: (documents_document.id = metadata_documentmetadata.document_id)
                                ->  Hash Right Join  (cost=430.68..1353.59 rows=6875 width=1233)
                                      Hash Cond: (documents_documentpage.document_version_id = documents_documentversion.id)
                                      ->  Hash Right Join  (cost=259.44..1078.87 rows=9264 width=553)
                                            Hash Cond: (document_parsing_documentpagecontent.document_page_id = documents_documentpage.id)
                                            ->  Seq Scan on document_parsing_documentpagecontent  (cost=0.00..795.21 rows=9221 width=549)
                                            ->  Hash  (cost=143.64..143.64 rows=9264 width=8)
                                                  ->  Seq Scan on documents_documentpage  (cost=0.00..143.64 rows=9264 width=8)
                                      ->  Hash  (cost=161.16..161.16 rows=806 width=684)
                                            ->  Hash Left Join  (cost=105.16..161.16 rows=806 width=684)
                                                  Hash Cond: (documents_document.id = document_comments_comment.document_id)
                                                  ->  Hash Left Join  (cost=104.14..157.11 rows=806 width=652)
                                                        Hash Cond: (cabinets_cabinet_documents.cabinet_id = cabinets_cabinet.id)
                                                        ->  Hash Join  (cost=102.42..152.99 rows=806 width=382)
                                                              Hash Cond: (documents_document.document_type_id = documents_documenttype.id)
                                                              ->  Hash Right Join  (cost=100.99..148.98 rows=806 width=172)
                                                                    Hash Cond: (documents_documentversion.document_id = documents_document.id)
                                                                    ->  Seq Scan on documents_documentversion  (cost=0.00..35.86 rows=1086 width=89)
                                                                    ->  Hash  (cost=90.93..90.93 rows=805 width=87)
                                                                          ->  Hash Right Join  (cost=71.84..90.93 rows=805 width=87)
                                                                                Hash Cond: (cabinets_cabinet_documents.document_id = documents_document.id)
                                                                                ->  Seq Scan on cabinets_cabinet_documents  (cost=0.00..16.36 rows=1036 width=8)
                                                                                ->  Hash  (cost=61.77..61.77 rows=805 width=83)
                                                                                      ->  Seq Scan on documents_document  (cost=29.50..61.77 rows=805 width=83)
                                                                                            Filter: ((NOT in_trash) AND (NOT is_stub) AND ((hashed SubPlan 1) OR (hashed SubPlan 2)))
                                                                                            SubPlan 1
                                                                                              ->  Nested Loop  (cost=0.00..14.75 rows=1 width=4)
                                                                                                    Join Filter: (u5.group_id = u6.id)
                                                                                                    ->  Nested Loop  (cost=0.00..13.68 rows=1 width=12)
                                                                                                          Join Filter: (u0.role_id = u4.id)
                                                                                                          ->  Nested Loop  (cost=0.00..12.57 rows=1 width=20)
                                                                                                                Join Filter: (u5.group_id = u7.group_id)
                                                                                                                ->  Nested Loop  (cost=0.00..11.47 rows=1 width=16)
                                                                                                                      Join Filter: (u0.role_id = u5.role_id)
                                                                                                                      ->  Nested Loop  (cost=0.00..10.29 rows=1 width=8)
                                                                                                                            Join Filter: (u0.id = u2.accesscontrollist_id)
                                                                                                                            ->  Seq Scan on acls_accesscontrollist_permissions u2  (cost=0.00..8.65 rows=19 width=4)
                                                                                                                                  Filter: (storedpermission_id = 9)
                                                                                                                            ->  Materialize  (cost=0.00..1.35 rows=1 width=12)
                                                                                                                                  ->  Seq Scan on acls_accesscontrollist u0  (cost=0.00..1.35 rows=1 width=12)
                                                                                                                                        Filter: (content_type_id = 68)
                                                                                                                      ->  Seq Scan on permissions_role_groups u5  (cost=0.00..1.08 rows=8 width=8)
                                                                                                                ->  Seq Scan on auth_user_groups u7  (cost=0.00..1.09 rows=1 width=4)
                                                                                                                      Filter: (user_id = 2)
                                                                                                          ->  Seq Scan on permissions_role u4  (cost=0.00..1.05 rows=5 width=4)
                                                                                                    ->  Seq Scan on auth_group u6  (cost=0.00..1.03 rows=3 width=4)
                                                                                            SubPlan 2
                                                                                              ->  Nested Loop  (cost=0.00..14.75 rows=1 width=4)
                                                                                                    Join Filter: (u5_1.group_id = u6_1.id)
                                                                                                    ->  Nested Loop  (cost=0.00..13.68 rows=1 width=12)
                                                                                                          Join Filter: (u0_1.role_id = u4_1.id)
                                                                                                          ->  Nested Loop  (cost=0.00..12.57 rows=1 width=20)
                                                                                                                Join Filter: (u5_1.group_id = u7_1.group_id)
                                                                                                                ->  Nested Loop  (cost=0.00..11.47 rows=1 width=16)
                                                                                                                      Join Filter: (u0_1.role_id = u5_1.role_id)
                                                                                                                      ->  Nested Loop  (cost=0.00..10.29 rows=1 width=8)
                                                                                                                            Join Filter: (u0_1.id = u2_1.accesscontrollist_id)
                                                                                                                            ->  Seq Scan on acls_accesscontrollist_permissions u2_1  (cost=0.00..8.65 rows=19 width=4)
                                                                                                                                  Filter: (storedpermission_id = 9)
                                                                                                                            ->  Materialize  (cost=0.00..1.35 rows=1 width=12)
                                                                                                                                  ->  Seq Scan on acls_accesscontrollist u0_1  (cost=0.00..1.35 rows=1 width=12)
                                                                                                                                        Filter: (content_type_id = 62)
                                                                                                                      ->  Seq Scan on permissions_role_groups u5_1  (cost=0.00..1.08 rows=8 width=8)
                                                                                                                ->  Seq Scan on auth_user_groups u7_1  (cost=0.00..1.09 rows=1 width=4)
                                                                                                                      Filter: (user_id = 2)
                                                                                                          ->  Seq Scan on permissions_role u4_1  (cost=0.00..1.05 rows=5 width=4)
                                                                                                    ->  Seq Scan on auth_group u6_1  (cost=0.00..1.03 rows=3 width=4)
                                                              ->  Hash  (cost=1.19..1.19 rows=19 width=214)
                                                                    ->  Seq Scan on documents_documenttype  (cost=0.00..1.19 rows=19 width=214)
                                                        ->  Hash  (cost=1.32..1.32 rows=32 width=278)
                                                              ->  Seq Scan on cabinets_cabinet  (cost=0.00..1.32 rows=32 width=278)
                                                  ->  Hash  (cost=1.01..1.01 rows=1 width=36)
                                                        ->  Seq Scan on document_comments_comment  (cost=0.00..1.01 rows=1 width=36)
                                ->  Hash  (cost=18.75..18.75 rows=850 width=131)
                                      ->  Hash Left Join  (cost=1.29..18.75 rows=850 width=131)
                                            Hash Cond: (metadata_documentmetadata.metadata_type_id = metadata_metadatatype.id)
                                            ->  Seq Scan on metadata_documentmetadata  (cost=0.00..14.50 rows=850 width=21)
                                            ->  Hash  (cost=1.13..1.13 rows=13 width=118)
                                                  ->  Seq Scan on metadata_metadatatype  (cost=0.00..1.13 rows=13 width=118)
                          ->  Hash  (cost=34.69..34.69 rows=1762 width=278)
                                ->  Hash Left Join  (cost=2.01..34.69 rows=1762 width=278)
                                      Hash Cond: (tags_tag_documents.tag_id = tags_tag.id)
                                      ->  Seq Scan on tags_tag_documents  (cost=0.00..27.62 rows=1762 width=8)
                                      ->  Hash  (cost=1.45..1.45 rows=45 width=278)
                                            ->  Seq Scan on tags_tag  (cost=0.00..1.45 rows=45 width=278)
                    ->  Hash  (cost=608.23..608.23 rows=27375 width=32)
                          ->  Hash Right Join  (cost=30.34..608.23 rows=27375 width=32)
                                Hash Cond: (file_metadata_filemetadataentry.document_version_driver_entry_id = file_metadata_documentversiondriverentry.id)
                                ->  Seq Scan on file_metadata_filemetadataentry  (cost=0.00..505.75 rows=27375 width=32)
                                ->  Hash  (cost=16.82..16.82 rows=1082 width=8)
                                      ->  Seq Scan on file_metadata_documentversiondriverentry  (cost=0.00..16.82 rows=1082 width=8)
              ->  Hash  (cost=715.32..715.32 rows=6632 width=726)
                    ->  Seq Scan on ocr_documentpageocrcontent  (cost=0.00..715.32 rows=6632 width=726)
This is one of the requests timing out.

For the advanced search tab I did not catch the request in time to copy it. Of Cours this will be doable but not with my onboard tools.


I am not sure, if this actual helps somebody, but I wanted to share with you.

Post Reply