Extract metadata from email

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
maxsabba
Posts: 7
Joined: Mon Dec 03, 2018 1:05 pm

Extract metadata from email

Post by maxsabba »

Following the "Exploring Mayan EDMS" book I was able to create a email source. Now I need to extract information from sender email address and from the subject of emails and associate these information to two metadata.

Some one has experience about that?

Thanks.
sam228
Posts: 1
Joined: Wed Apr 29, 2020 8:38 am

Re: Extract metadata from email

Post by sam228 »

  • First you create a document type, e.g. mail
  • Then you set up two meta data types: subject and sender and associate them with the document type you created.
  • When you set up the mail-source you choose the document type
  • And in the two drop downs metadata subject and metadata sender you select the two meta data types you created.
That's it!
maxsabba
Posts: 7
Joined: Mon Dec 03, 2018 1:05 pm

Re: Extract metadata from email

Post by maxsabba »

Manually I can assign the subject and sender metadata has you described. But I need a way/process, that automatically, extract these metadata from mail and, and assign these values to the downloaded email and attaches.

Thanks
crypta
Posts: 1
Joined: Sun Jan 24, 2021 11:01 am

Re: Extract metadata from email

Post by crypta »

I would like to 2nd this argument.
How can the header data be parsed for the informations contained there?

Also it seems, html body needs to be handled seperatly. In default it is just a "dead document".
Another problem I do have: the move into... after parsing seem not to work for me. The mails just got deleted.
spirkaa
Posts: 1
Joined: Tue Jan 26, 2021 7:05 am

Re: Extract metadata from email

Post by spirkaa »

crypta wrote: Sun Jan 24, 2021 11:15 am I would like to 2nd this argument.
How can the header data be parsed for the informations contained there?
I configured this with Workflow and some obscure templates to sanitize input.
  • Mail Source assign specific Document Type.
  • Workflow applied to this Document Type.
  • Workflow has no reset transition -> executed only once.
  • Mail header looks like "DocNumber DocDate DocHeader" and being temporary saved by Mail Source in metadata field called meta_header. For ex.: "01-08-320 01.09.2020 Hello There"
  • In Workflow State Actions temporary value of meta_header splitted by space and then parts of it used to fill in corresponding metadata fields: DocNumber => list[0] => meta_number, DocDate => list[1] => meta_date. Last action extracts DocHeader => list[2:] and updates meta_header with final value.
Workflow
  • Image
  • Image

States
  • Image
Transitions (Triggers: Document version parsing finished)
  • Image
No Actions for 0% state
  • Image
Actions for 100 %state. It's important to have numbers in front of names of actions, because they execute by name, and i want to add metafields first, and to update meta_header field last.
  • Image
Action 0 - Add metadata fields meta_number and meta_date
  • Image
Action 1 - Edit metadata field meta_number. Value of this field must be DocNumber.

Code: Select all

{% regex_sub "\s+" " " document.metadata_value_of.meta_header as tmp_header %}{% with tmp_header.strip|split:" " as header_splitted %}{% regex_match "[0-9]" header_splitted.0 as starts_with_number %}{% if starts_with_number %}{{ header_splitted.0 }}{% endif %}{% endwith %}
Whats going on here?
  1. regex_sub used for replace multiple whitespaces "\s+" to single space and result saved as tmp_header
  2. .strip used for strip spaces from beginning and end of tmp_header, then tmp_header splitted by space and result saved as header_splitted
  3. regex_match checks that item with index 0 of header_splitted starts with any number and result saved as starts_with_number
  4. if starts_with_number is True, then write value of header_splitted.0 to metadata field. Else do nothing.
  • Image
Action 2 - Edit metadata field meta_date. Value of this field must be DocDate.

Code: Select all

{% regex_sub "\s+" " " document.metadata_value_of.meta_header as tmp_header %}{% with tmp_header.strip|split:" " as header_splitted %}{% regex_match "[0-9]" header_splitted.1 as starts_with_number %}{% if starts_with_number %}{{ header_splitted.1 }}{% endif %}{% endwith %}
Whats going on here? Same as in Action 1, but with header_splitted.1
  • Image
Action 3 - Edit metadata field meta_header. Value of this field must be only DocHeader.

Code: Select all

{% regex_sub "\s+" " " document.metadata_value_of.meta_header as tmp_header %}{% with tmp_header.strip|split:" " as header_splitted %}{% regex_match "[0-9]" header_splitted.0 as starts_with_number_0 %}{% regex_match "[0-9]" header_splitted.1 as starts_with_number_1 %}{% if starts_with_number_0 and starts_with_number_1 %}{{ header_splitted | slice:"2:" | join:" " }}{% else %}{{ tmp_header.strip }}{% endif %}{% endwith %}
Whats going on here? Same sanitizing actions with regex_sub and strip as above, then we update meta_header with slice of header_splitted started from index 2 (because index 0 is DocNumber, and index 1 is DocDate) and joined in single string with space delimiter, but only if starts_with_number_0 AND starts_with_number_1 both are true.
  • Image
For testing in Document Sandbox you can replace document.metadata_value_of.meta_header in template with string " 01-08-566 26.01.2021 Test i need more space " or any other string of your choice.
User avatar
rosarior
Developer
Developer
Posts: 624
Joined: Tue Aug 21, 2018 3:28 am
Location: Puerto Rico
Contact:

Re: Extract metadata from email

Post by rosarior »

This is an impressive workflow setup @spirkaa. Thanks for sharing it!
Post Reply