Whoosh in version 4.3 does not reindex all documents

When things don't work as they should.
Post Reply
fulkron
Posts: 20
Joined: Fri Mar 26, 2021 5:25 pm

Whoosh in version 4.3 does not reindex all documents

Post by fulkron »

After upgrading to version 4.3 (docker) whoosh does not reindex (bulk reindexing) all the documents in the database but only 50%.
No error log but simply the search_status does not progress anymore.
Is it possible to resume the bulk index keeping the already portion indexed?
Thanks
Dario
fulkron
Posts: 20
Joined: Fri Mar 26, 2021 5:25 pm

Re: Whoosh in version 4.3 does not reindex all documents

Post by fulkron »

this is the result of search_status
the numbers of documents present in the database is more than 4000, but after a bulk reindex Whoosh reports only 1950.
If the user does not have a correct result from a search query, because Whoosh is unable to reindex all the documents, has the perception that documents have been lost! That raise an unreliable confidence in the EDMS project driven by Mayan-edms...
so become a critical issue.... to manage

Please help

Code: Select all

root@642a71537d85:/# /opt/mayan-edms/bin/mayan-edms.py search_status
/opt/mayan-edms/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.11) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
Whoosh search model indexing status
===================================
Cabinet: 338
Document: 1950
Document file: 1375
Document file page: 7588
Document type: 16
Document version: 2375
Document version page: 10275
Group: 0
Index instance node: 0
Message: 0
Metadata type: 0
Role: 0
Signature capture: 0
Tag: 0
User: 0
root@642a71537d85:/# 
fulkron
Posts: 20
Joined: Fri Mar 26, 2021 5:25 pm

Re: Whoosh in version 4.3 does not reindex all documents

Post by fulkron »

After several attempts without success to bulk reindex, at the end I was able to exit anyway from the situation.
1) Recreating a completely new MayanEdms instance copying all the documents and database from the failing machine.
2) Start a bulk reindex on the just created instance and checking the “new” whoosh reindexing for all the documents.
3) Copying back the “new” whoosh directory to the failing machine and restarting the original (failing) MayanEdms.

… but this remain a loophole and not the solution.

The questions open are:
Why Whoosh stop the bulk reindex without any log?
Is it possible to restart the whoosh bulk reindex from a partial already reindexed step instead to start always from the begin?
What do you think to implement a check between the numbers of documents indexed by whoosh and the number of documents present in the database?
Any reply is welcome.
Dario
Post Reply