I have some documents with metadata attached to them and I would like to use the regular expression feature to find them. Everything is working well as long as there is no whitespace in the metadata value.
The search backend has its default value (Whoosh engine).
So this is what I have tried:
Doc Metadata Value
Document 1 Word A
Document 2 Word B
Document 3 Word C
I need an expression which finds all documents matching exactly Word A or Word B (regex Word A|Word B). When using the advanced seach, I tried the following expressions:
No Expression Translates to
1 %^Word metadata__value REGULAREXPRESSION ^Word
2 %^Word A metadata__value REGULAREXPRESSION ^Word AND metadata__value PARTIAL A
3 `%^Word A` metadata__value REGULAREXPRESSION `^Word A`
4 "%^Word A" metadata__value REGULAREXPRESSION "^Word A"
5 %^Word\sA metadata__value REGULAREXPRESSION ^Word\sA
6 ^Word.A metadata__value REGULAREXPRESSION ^Word\sA
So #1 gives me all 3 documents, which is correct. All other approaches return no results. What would be the correct expression to match exactly “Word A”? And additional Word A or Word B (Word A|WordB)?
Do I have to escape whitespaces in a certain way?
I haven’t tested this in Mayan only in a regex tester (https://regex101.com/):
Give it a try.
thanks for your response. I also use the same regex tester for building and testing regex terms
Your expression also gives no results. It correctly translates to
metadata__value REGULAREXPRESSION ^Word\s(A|B)
but result list remains empty - even when using wildcards like ^Word.*
Any other idea?
I played around a lot with regex search, everything works well as long as there is no whitespace in the metadata value. I also cannot simply replace this in the metadata source value because this would not fit to our companies notation of sequence numbers (for drawings, items, orders, …)
From the technical view I have the assumption that the search backend splits the metadata value into two separate words to handle them separately. Can you confirm this?
Is there any chance to work around this behaviour?
Besides the regular expression, there is another stange behaviour:
For the same example as in my initial post, I tried to use the EXACT search mode to find “Word A” in the metadata value. So I entered =“Word A” in the advanced search which translates to
metadata__value EXACT "Word A"
This term finds all 3 Documents with “Word” in metadata value, but
metadata__value EXACT "Word A"
metadata__value EXACT "Word B"
metadata__value EXACT "Word C"
metadata__value EXACT "Word and"
metadata__value EXACT "Word or"
do also, where
metadata__value EXACT "Word ABC"
do not. It seems like single letters as Words are ignored, also binding words like “And, or, …”.
Is there a way to really get the EXACT quoted term?
could anyone confirm this behaviour of searching a word followed by a space and a single letter like Word A in metadata values? Is there actually a way to give the exact search result (without pre-splitting or removing single chars from raw data)?
Here my search example:
If there is no way to get it work with the search interpreter, is there maybe a chance to make it work with the raw search mode?
How does this work exactly?
`Word` -> returns the document shown in the image
`Word A` -> returns empty
PS: sorry for returning to this topic this often, but the other solution than finding a technical solution would be the hard way to re-design our logical numbering processes. Which would have a pattern with a whitespace in it
Have not been able to come up with a solution. Still looking into this.
Thanks, Roberto for looking into this.
I studied the Whoosh documentation (not worked before with it) and came accross the default settings of Tokenizers and Anazylers. I could not yet figure out which one Mayan uses by default. However, many of them have a
minsize param set to 2 by default, meaning that words having only one letter are filtered out.
Could this be a starting point for me? Is there a way of using
SEARCH_BACKEND_ARGUMENTS to specify Whoosh options?