regex explained

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
bwakkie
Posts: 33
Joined: Fri Feb 14, 2020 8:28 pm

regex explained

Post by bwakkie »

I'm trying to find any information on how I should use the regex functionality:

I like in the sand box to show the different keywords.
Whats wrong and where is the documentation if the regex usage in templates?

example content:

Code: Select all

Received June 16, 1999

Abstract Based on the sequence analysis of 5.8S subunit and internal transcribed spacers (ITS )
of ribosomal RNA gene (rDNA), the molecular phylogenetic tree of representative species of Pipizini and three groups of Syrphidae with different feeding habits (seven species belong to six genera)
was constructed. Meanwhile, the phylogenetic tree of tribes (including Pipizini and other 17 tribes of
Syrphidae) was constructed using morphological characteristics of adults and larvae and the number of chromosomes. Both the results show that the relationship between Pipizini and predatory
groups is closer than that between Pipizini and saprophagous groups. So it is suggested that
Pipizini be transferred from Milesiinae to Syrphinae.
Keywords: rDNA, molecular phylogeny, cladistics, Syrphidae, Pipizini, phylogenetic position.

bla bla next section etc...

Code: Select all

{{ keywords =  {% regex_search "Keywords\:\(,\? \(\w\+\)\)\+" "{{ document.content }} %}" }}
{% for keyword in keywords: %}
    <li>{{ keyword }}</li>
{% endfor %}
In the end I am looking for a way to automatically create tags based on the given keywords and assign them to the document.
Arwis
Posts: 2
Joined: Wed Dec 23, 2020 12:45 pm

Re: regex explained

Post by Arwis »

I tried to use regex_search to search for a date from my documents.
With a lot try and error i got this...

Code: Select all

{% regex_search "((0[1-9]|[12]\d|3[01]).(0[1-9]|1[0-2]).[12]\d{3})" document.ocr_content|join:"" ignorecase=True %}
It is working in the sandbox and also in a workflow. The problem is, i do not only get the match back.
This is what i get back:

Code: Select all

<re.Match object; span=(474, 484), match='20.07.2020'>
Maybe the code can help with your problem and maybe you have an idea how to only get the match back
Planktom
Posts: 1
Joined: Fri Jan 01, 2021 5:15 pm

Re: regex explained

Post by Planktom »

Hi Arwis,
thanks for the line of the regex_search. That helped me a lot.
In return, I figured out how to get the results back from the Match object:

Code: Select all

{% regex_search "((0[1-9]|[12]\d|3[01]).(0[1-9]|1[0-2]).[12]\d{3})" document.ocr_content|join:"" ignorecase=True as m %}
{{ m.0 }}
the m.0 returns the whole result. If you want to have the different capture groups, then you can address them via m.1, m.2 ...and so on.
these python docs helped me: https://docs.python.org/3/library/re.html#match-objects

Thanks again. I hope, I can help you as well.
Arwis
Posts: 2
Joined: Wed Dec 23, 2020 12:45 pm

Re: regex explained

Post by Arwis »

Hi Planktom,

wow, many many thanks. That helped me a lot.
Tim
Posts: 3
Joined: Wed Feb 03, 2021 12:08 pm

Re: regex explained

Post by Tim »

This post has been most helpful for me in utilizing the power of the regex search. I have created a workflow that searches newly added documents for a series of unique identifiers: account numbers, unique order number formats, etc. The workflow is a series of regex searches followed by values of metadata should it be true.

For example, this would decide what the metadata value for “Company” would be:

Code: Select all

{% regex_search "[search parameters]" document.latest_version.ocr_content|join:" " as amazon %}
{% regex_search "[different search]" document.latest_version.ocr_content|join:" " as comcast %}
{% if amazon.0 is not None %}Amazon
{% elif xfinity.0 is not None %}Comcast
{% endif %}
So far, so good. Except, the regex lines are preserved, so it actually returns a line with 2 spaces at the top, then the value. Think “\r\r”

What that means is that if you build an index on {{document.metadata_value_of_Company}} the entries added by the workflow are not grouped with those manually done. It would look like this:

Amazon
Comcast
Amazon
Comcast

If the workflow entry is all one line, it seems to work, e.g.:

Code: Select all

{% regex_search "[search parameters]" document.latest_version.ocr_content|join:" " as amazon %}{% regex_search "[different search]" document.latest_version.ocr_content|join:" " as comcast %}{% if amazon.0 is not None %}Amazon{% elif xfinity.0 is not None %}Comcast{% endif %}
But, with even a small number of searches, this becomes pretty hard to manage and edit. Does anyone what a suggestion on how to ignore those preceding line breaks?
Post Reply