JEENK
Scalable genomics tools, powered by Apache Flink
Contacts
Challenge
The rapid advancement of DNA and RNA sequencing technologies generates an exponential increase in the data stream to be processed by sequencing centers. New large-scale applications are enabled by the falling cost of data acquisition, but hindered by the use of conventional computational techniques used to process the data.
Overview
Jeenk is a collection of parallel, distributed tools for genomics, that introduce the distributed stream computing approach to large-scale genomics data analysis. Jeenk is based on the Apache Flink data streaming framework and uses Apache Kafka for data movement.
It consists of three Flink-based tools that implement a full raw-to-CRAM pipeline for Illumina data:
- A reader, that reads the proprietary raw Illumina BCL files directly from the sequencer's run directory and converts them to read-based data (FASTQ-like), which are sent to a Kafka broker for storage and further processing (akin to Illumina's bcl2fastq2);
- An aligner, that aligns the reads to a reference genome using the BWA-MEM plugin through the RAPI library (http://github.com/crs4/rapi/);
- A CRAM writer, that writes the aligned reads as space-efficient CRAM files.
Innovative features
- ultra-scalable state-of-the-art distributed stream processing technology;
- reduced turnaround times.
Potential users
Bioinformatics researchers, sequencing centers professionals
Impact sectors
Biotechnologies
Other resources
- https://github.com/crs4/Jeenk
- F. Versaci, L. Pireddu, G. Zanetti, "Scalable genomics: From raw data to aligned reads on Apache YARN", Proc. IEEE Int. Conf. Big Data (Big Data), pp. 1232-1241, Dec. 2016.
- F. Versaci, L. Pireddu, G. Zanetti, Proc. IEEE EMBS Int. Conf. on Biomedical & Health Informatics (BHI), Vol. 2018, pp. 259-262, 2018