regex explained

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
bwakkie
Posts: 31
Joined: Fri Feb 14, 2020 8:28 pm

regex explained

Post by bwakkie »

I'm trying to find any information on how I should use the regex functionality:

I like in the sand box to show the different keywords.
Whats wrong and where is the documentation if the regex usage in templates?

example content:

Code: Select all

Received June 16, 1999

Abstract Based on the sequence analysis of 5.8S subunit and internal transcribed spacers (ITS )
of ribosomal RNA gene (rDNA), the molecular phylogenetic tree of representative species of Pipizini and three groups of Syrphidae with different feeding habits (seven species belong to six genera)
was constructed. Meanwhile, the phylogenetic tree of tribes (including Pipizini and other 17 tribes of
Syrphidae) was constructed using morphological characteristics of adults and larvae and the number of chromosomes. Both the results show that the relationship between Pipizini and predatory
groups is closer than that between Pipizini and saprophagous groups. So it is suggested that
Pipizini be transferred from Milesiinae to Syrphinae.
Keywords: rDNA, molecular phylogeny, cladistics, Syrphidae, Pipizini, phylogenetic position.

bla bla next section etc...

Code: Select all

{{ keywords =  {% regex_search "Keywords\:\(,\? \(\w\+\)\)\+" "{{ document.content }} %}" }}
{% for keyword in keywords: %}
    <li>{{ keyword }}</li>
{% endfor %}
In the end I am looking for a way to automatically create tags based on the given keywords and assign them to the document.
Arwis
Posts: 2
Joined: Wed Dec 23, 2020 12:45 pm

Re: regex explained

Post by Arwis »

I tried to use regex_search to search for a date from my documents.
With a lot try and error i got this...

Code: Select all

{% regex_search "((0[1-9]|[12]\d|3[01]).(0[1-9]|1[0-2]).[12]\d{3})" document.ocr_content|join:"" ignorecase=True %}
It is working in the sandbox and also in a workflow. The problem is, i do not only get the match back.
This is what i get back:

Code: Select all

<re.Match object; span=(474, 484), match='20.07.2020'>
Maybe the code can help with your problem and maybe you have an idea how to only get the match back
Planktom
Posts: 1
Joined: Fri Jan 01, 2021 5:15 pm

Re: regex explained

Post by Planktom »

Hi Arwis,
thanks for the line of the regex_search. That helped me a lot.
In return, I figured out how to get the results back from the Match object:

Code: Select all

{% regex_search "((0[1-9]|[12]\d|3[01]).(0[1-9]|1[0-2]).[12]\d{3})" document.ocr_content|join:"" ignorecase=True as m %}
{{ m.0 }}
the m.0 returns the whole result. If you want to have the different capture groups, then you can address them via m.1, m.2 ...and so on.
these python docs helped me: https://docs.python.org/3/library/re.html#match-objects

Thanks again. I hope, I can help you as well.
Arwis
Posts: 2
Joined: Wed Dec 23, 2020 12:45 pm

Re: regex explained

Post by Arwis »

Hi Planktom,

wow, many many thanks. That helped me a lot.
Post Reply