JEENK

JEENK

Scalable genomics tools powered by Apache Flink

Contacts

Francesco VersaciLuca Pireddu, Gianluigi Zanetti. E-mail: valorisation@crs4.it

Challenge

The rapid advancement of DNA and RNA sequencing technologies generates an exponential increase in the data stream to be processed by sequencing centers. New large-scale applications are enabled by the falling cost of data acquisition, but hindered by the use of conventional computational techniques used to process the data.

Overview

Jeenk is a collection of parallel, distributed tools for genomics, that introduce the distributed stream computing approach to large-scale genomics data analysis. Jeenk is based on the Apache Flink data streaming framework and uses Apache Kafka for data movement.

It consists of three Flink-based tools that implement a full raw-to-CRAM pipeline for Illumina data:

  • A reader, that reads the proprietary raw Illumina BCL files directly from the sequencer’s run directory and converts them to read-based data (FASTQ-like), which are sent to a Kafka broker for storage and further processing (akin to Illumina’s bcl2fastq2);
  • An aligner, that aligns the reads to a reference genome using the BWA-MEM plugin through the RAPI library (http://github.com/crs4/rapi/);
  • A CRAM writer, that writes the aligned reads as space-efficient CRAM files.

Innovative features

  • ultra-scalable state-of-the-art distributed stream processing technology;
  • reduced turnaround times.

Potential users

Bioinformatics researchers, sequencing centers professionals

Impact sectors

Biotechnologies

Other resources

  1. https://github.com/crs4/Jeenk
  2. F. Versaci, L. Pireddu, G. Zanetti, “Scalable genomics: From raw data to aligned reads on Apache YARN”, Proc. IEEE Int. Conf. Big Data (Big Data), pp. 1232-1241, Dec. 2016.
  3. F. Versaci, L. Pireddu, G. Zanetti, Proc. IEEE EMBS Int. Conf. on Biomedical & Health Informatics (BHI), Vol. 2018, pp. 259-262, 2018