Pydoop a Python interface for Apache Hadoop
Simone Leo, Gianluigi Zanetti. E-mail:
Over the years, the list of tools for big data analysis kept growing constantly. However, not all of them offer a multi-language API. Apache Hadoop, for instance, is written in Java and expects users to write their applications in Java. Due to the overwhelming popularity of Python across all domains, most notably scientific computing, it is highly desirable to bring its rich toolset to the Hadoop environment.
Pydoop is a Python interface for Apache Hadoop, which covers both HDFS access and MapReduce job submission.
Innovative features
- simple to use;
- compatible with most existing Python libraries, including SciPy and NumPy (it’s built as a CPython extension).
Potential users
Anyone that needs to process huge amounts of data in Python.
Impact sectors
Distributed computing - scientific computing - big data analysis.