JEENK

Scalable genomics tools powered by Apache Flink

Contacts

Francesco Versaci, Luca Pireddu, Gianluigi Zanetti. E-mail: valorisation@crs4.it

Challenge

The rapid advancement of DNA and RNA sequencing technologies generates an exponential increase in the data stream to be processed by sequencing centers. New large-scale applications are enabled by the falling cost of data acquisition, but hindered by the use of conventional computational techniques used to process the data.

Overview

Jeenk is a collection of parallel, distributed tools for genomics, that introduce the distributed stream computing approach to large-scale genomics data analysis. Jeenk is based on the Apache Flink data streaming framework and uses Apache Kafka for data movement.

It consists of three Flink-based tools that implement a full raw-to-CRAM pipeline for Illumina data:

A reader, that reads the proprietary raw Illumina BCL files directly from the sequencer’s run directory and converts them to read-based data (FASTQ-like), which are sent to a Kafka broker for storage and further processing (akin to Illumina’s bcl2fastq2);
An aligner, that aligns the reads to a reference genome using the BWA-MEM plugin through the RAPI library (http://github.com/crs4/rapi/);
A CRAM writer, that writes the aligned reads as space-efficient CRAM files.

Innovative features

ultra-scalable state-of-the-art distributed stream processing technology;
reduced turnaround times.

Potential users

Bioinformatics researchers, sequencing centers professionals

Impact sectors

Biotechnologies

JEENK

JEENK

Scalable genomics tools powered by Apache Flink

Contacts

Challenge

Overview

Innovative features

Potential users

Impact sectors

Other resources

Seguici sui social

CRS4

LINK UTILI