Pydoop

Pydoop a Python interface for Apache Hadoop

Contacts

Simone Leo, Gianluigi Zanetti. E-mail: valorisation@crs4.it

Challenge

Over the years, the list of tools for big data analysis kept growing constantly. However, not all of them offer a multi-language API. Apache Hadoop, for instance, is written in Java and expects users to write their applications in Java. Due to the overwhelming popularity of Python across all domains, most notably scientific computing, it is highly desirable to bring its rich toolset to the Hadoop environment.

Overview

Pydoop is a Python interface for Apache Hadoop, which covers both HDFS access and MapReduce job submission.

Innovative features

  • simple to use;
  • compatible with most existing Python libraries, including SciPy and NumPy (it’s built as a CPython extension).

Potential users

Anyone that needs to process huge amounts of data in Python.

Impact sectors

Distributed computing – scientific computing – big data analysis.

Other resources

  1. https://crs4.github.io/pydoop/
  2. S. Leo, G. Zanetti, Pydoop: a Python MapReduce and HDFS API for Hadoop. Proceeding HPDC ’10, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. Pages 819-825 Chicago, Illinois – June 21 – 25, 2010.