Mayan for Bioinformatics Data Management

Posted: Fri Oct 25, 2019 5:56 pm
by asclepios

I work for a genomics research center,

We are planning to use Mayan for our Protocols and SOP management, now and we are also looking for a solution to track our data file. Our genomics data files are basically big plaintext files ( 1-10 GB per file) . We mainly want to archive, read and gather extra metadata for these files, so Mayan looks like a good solution.

I wondering if Mayan would perform well on such big files? Are there any configuration parameters that I should adjust to ease the management of these big files?


Posted: Mon Oct 28, 2019 12:31 am
by rosarior

This is an interesting deployment! Mayan should work well with any size files. The files are processed in the background when uploaded. The process detects the MIME type for conversion and preview and determines the page count. The speed at which uploaded documents will appear and become ready for use will depend on the amount of resources you devote for the install. There are not inherent page size or document count limits anywhere in the code.

The Docker image is a one-size-fits-all that favors small to medium images, so the default image is a good choice to get started but sacrifices scalabilty over ease of install, so that might not be the best option for your deployment in the long term. I suggest a direct deployment or a custom Docker image. Also launch multiple workers for the document process queue to avoid a bottleneck when uploading many big documents at the same time.

The other change I recommend is storage. If you can use block storage for performance or object storage if your document count is going to be big.

Mayan has many settings that can be tweaked to optimize for many different workloads but those are the initial general suggestions.

Posted: Mon Oct 28, 2019 12:38 pm
by asclepios

Thanks for the reply, I will definitely apply the changes you are suggesting... I will also test both, block storage, and object storage (Minio) and see which perform best for our situation.

Posted: Thu Jan 02, 2020 11:25 am
by rssfed23
A tip for anyone else interested in this use case; make sure you disable automatic OCR on document types where you're importing FASTQ or BAMs.

Aside from that; I've found no problem using Mayan to store 100GB+ NGS WGS files.

Asclepios: How did you get on in the end? Which storage backend did you find most performant in your setup for what type of files? - I ask as over time we'll be adding tweaks into the documentation so to be able to share experiences on this would be beneficial.
At my setup (which admittedly is a home enthusiast rather than a lab) I found alignment tools (BWA) performed better for me over NFS compared to Samba or iSCSI but Mayan itself did well with object using Minio