Watch Folder/IMAP Source Processing

DocCyblade · February 16, 2023, 5:00pm

So I have IMAP and Watch folder sources configured and they work well. I know you can change the interval for how often it checks. The help text even says “Interval in seconds between checks for new documents.”

It seems in my testing however that it only processes one document or email per check. Just wanted to know if this is as designed? When I did a mass import I had to set it every 10 seconds to get it to processes it faster. Just want to make sure if it’s one document per check.

roberto.rosario · February 17, 2023, 12:38am

Yes, it is designed to work that way. This is to avoid one large document from causing the whole task to get stuck.

Background tasks should be as atomic as possible. Take for example the OCR task. If a document of 2000 pages is scheduled for OCR, and a single page OCR takes 2 minutes, that task would remain executing for 4000 minutes or 66 hours. Adding 100 workers won’t make the OCR finish faster because it is a single task and therefore a single process. The task will remain in the task queue for 4000 minutes and might cause other workers to also start executing it from he start, thinking the previous worker died without finishing the task.

The same applies to the email sources. Having the task process a single email allows for processing many emails in parallel. The only improvement missing from the email source processing is that each task will try to process the first email they find. We have been trying to come up with a way to tell each task to process a different email ID. To do this, the process launching the background task would need to connect to the email and grab the list of email IDs, retain this list and assign them to each task as they launch. While good in theory, this would defeat the purpose of the background task system by having a single point of low concurrency, the initiator code which runs in the foreground and in synchronous mode.

There are other approaches we want to try to distribute the email ID to process for each task and improve the efficiency, but as with many other features, we do a bit of research and if a good solution is not found, we move on to the next item in the agenda to complete the next version release. The item gets re-queued for further research in the next development cycle.

If you would like to learn more about background task processing and asynchronous scaling I’ve given talks about the topic which are still available online.

roberto.rosario · February 17, 2023, 1:34am

I forgot to add that for mass imports we do have a specialized app that was created as part of a commercial collaboration. This task is being generalized to prepare it for addition to the free open source core version. It can import millions of documents in the span of a few days including state of each document and intelligent resuming in case of network failures. This app was designed for Mayan 4.2 and is also in the process of being updated for version 4.5.

DocCyblade · February 17, 2023, 3:25am

As always a very nice explanation. I suspect that was the case that it did one at a time but wanted to confirm. The batch process would have been nice to have few weeks ago! however it did cause issues that caused me to learn a lot so it was not for nothing!

That batch importing looks awesome. That’s interesting that tools built to solve a problem for a customer may get added to the project. I’ll have to keep my eye out for that one.

system · September 13, 2023, 11:46am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.