Pydoop is an efficient and extremely simplified programming interface layer on top of Hadoop, while offering native compatibility with the full suite of Python tools for data science. Thanks to its flexible and efficient implementation, Pydoop proved to be an enabling technology for many other data-intensive activities. It formed the basis for many of the bioinformatics tools developed at CRS4, such as high-throughput tools to quickly search for lentiviral insertion points in DNA, calculate allele frequencies of single nucleotide polymorphisms, or to preprocess DNA sequences. It was also widely used outside of CRS4, with significant discussion in the community and public endorsements from private companies.
Pydoop was described in the following publication:
- S. Leo, G. Zanetti, Pydoop: a Python MapReduce and HDFS API for Hadoop. Proceeding HPDC '10, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. Pages 819-825 Chicago, Illinois - June 21 - 25, 2010.
The open source repository is maintained here: