Page 1 of 1

Whoosh in version 4.3 does not reindex all documents

Posted: Thu Aug 04, 2022 6:46 pm
by fulkron
After upgrading to version 4.3 (docker) whoosh does not reindex (bulk reindexing) all the documents in the database but only 50%.
No error log but simply the search_status does not progress anymore.
Is it possible to resume the bulk index keeping the already portion indexed?

Re: Whoosh in version 4.3 does not reindex all documents

Posted: Sat Aug 06, 2022 2:27 pm
by fulkron
this is the result of search_status
the numbers of documents present in the database is more than 4000, but after a bulk reindex Whoosh reports only 1950.
If the user does not have a correct result from a search query, because Whoosh is unable to reindex all the documents, has the perception that documents have been lost! That raise an unreliable confidence in the EDMS project driven by Mayan-edms...
so become a critical issue.... to manage

Please help

Code: Select all

root@642a71537d85:/# /opt/mayan-edms/bin/ search_status
/opt/mayan-edms/lib/python3.9/site-packages/requests/ RequestsDependencyWarning: urllib3 (1.26.11) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
Whoosh search model indexing status
Cabinet: 338
Document: 1950
Document file: 1375
Document file page: 7588
Document type: 16
Document version: 2375
Document version page: 10275
Group: 0
Index instance node: 0
Message: 0
Metadata type: 0
Role: 0
Signature capture: 0
Tag: 0
User: 0

Re: Whoosh in version 4.3 does not reindex all documents

Posted: Tue Aug 09, 2022 6:13 am
by fulkron
After several attempts without success to bulk reindex, at the end I was able to exit anyway from the situation.
Just in case somebody else experience the same issue, here the steps I did:
1) Recreating a completely new MayanEdms instance copying all the documents and database from the failing machine.
2) Start a bulk reindex on the just created instance and checking the “new” whoosh reindexing for all the documents.
3) Copying back the “new” whoosh directory to the failing machine and restarting the original (failing) MayanEdms.

… but this remain a loophole and not the solution.

The questions open are:
Why Whoosh stop the bulk reindex without any log?
Is it possible to restart the whoosh bulk reindex from a partial already reindexed step instead to start always from the begin?
What do you think to implement a check between the numbers of documents indexed by whoosh and the number of documents present in the database?
Any answer is welcome.