I have some documents with metadata attached to them and I would like to use the regular expression feature to find them. Everything is working well as long as there is no whitespace in the metadata value.
The search backend has its default value (Whoosh engine).
So this is what I have tried:
Doc Metadata Value
------------------------------
Document 1 Word A
Document 2 Word B
Document 3 Word C
I need an expression which finds all documents matching exactly Word A or Word B (regex Word A|Word B). When using the advanced seach, I tried the following expressions:
No Expression Translates to
1 %^Word metadata__value REGULAREXPRESSION ^Word
2 %^Word A metadata__value REGULAREXPRESSION ^Word AND metadata__value PARTIAL A
3 `%^Word A` metadata__value REGULAREXPRESSION `^Word A`
4 "%^Word A" metadata__value REGULAREXPRESSION "^Word A"
5 %^Word\sA metadata__value REGULAREXPRESSION ^Word\sA
6 ^Word.A metadata__value REGULAREXPRESSION ^Word\sA
So #1 gives me all 3 documents, which is correct. All other approaches return no results. What would be the correct expression to match exactly “Word A”? And additional Word A or Word B (Word A|WordB)?
I played around a lot with regex search, everything works well as long as there is no whitespace in the metadata value. I also cannot simply replace this in the metadata source value because this would not fit to our companies notation of sequence numbers (for drawings, items, orders, …)
From the technical view I have the assumption that the search backend splits the metadata value into two separate words to handle them separately. Can you confirm this?
Is there any chance to work around this behaviour?
Besides the regular expression, there is another stange behaviour:
For the same example as in my initial post, I tried to use the EXACT search mode to find “Word A” in the metadata value. So I entered =“Word A” in the advanced search which translates to
metadata__value EXACT "Word A"
This term finds all 3 Documents with “Word” in metadata value, but
could anyone confirm this behaviour of searching a word followed by a space and a single letter like Word A in metadata values? Is there actually a way to give the exact search result (without pre-splitting or removing single chars from raw data)?
If there is no way to get it work with the search interpreter, is there maybe a chance to make it work with the raw search mode?
How does this work exactly?
`Word` -> returns the document shown in the image
`Word A` -> returns empty
Torsten
PS: sorry for returning to this topic this often, but the other solution than finding a technical solution would be the hard way to re-design our logical numbering processes. Which would have a pattern with a whitespace in it XXX-XXXX X
I studied the Whoosh documentation (not worked before with it) and came accross the default settings of Tokenizers and Anazylers. I could not yet figure out which one Mayan uses by default. However, many of them have a minsize param set to 2 by default, meaning that words having only one letter are filtered out.
Could this be a starting point for me? Is there a way of using SEARCH_BACKEND_ARGUMENTS to specify Whoosh options?