Awesome Spark

A curated list of awesome Apache Spark packages and resources.Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.PackagesLanguage Bindings

Kotlin for Apache Spark

- Kotlin API bindings and extensions.

.NET for Apache Spark

- .NET bindings.

sparklyr

- An alternative R backend, using dplyr.

sparkle

- Haskell on Apache Spark.

spark-connect-rs

- Rust bindings.

spark-connect-go

- Golang bindings.

spark-connect-csharp

- C# bindings.

Notebooks and IDEs

almond

- A scala kernel for Jupyter.

Apache Zeppelin

- Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.

Polynote

- Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.

sparkmagic

- Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

itachi

- A library that brings useful functions from modern database management systems to Apache Spark.

spark-daria

- A Scala library with essential Spark functions and extensions to make you more productive.

quinn

- A native PySpark implementation of spark-daria.

Apache DataFu

- A library of general purpose functions and UDF's.

Joblib Apache Spark Backend

- joblib backend for running tasks on Spark clusters.

SQL Data SourcesSparkSQL has serveral built-in Data Sources for files. These include csv, json, parquet, orc, and avro. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.

Awesome Spark

A curated list of awesome Apache Spark packages and resources.

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).

Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.

Packages

Language Bindings

Kotlin for Apache Spark - Kotlin API bindings and extensions.
.NET for Apache Spark - .NET bindings.
sparklyr - An alternative R backend, using dplyr.
sparkle - Haskell on Apache Spark.
spark-connect-rs - Rust bindings.
spark-connect-go - Golang bindings.
spark-connect-csharp - C# bindings.

Notebooks and IDEs

almond - A scala kernel for Jupyter.
Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
Polynote - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.
sparkmagic - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

itachi - A library that brings useful functions from modern database management systems to Apache Spark.
spark-daria - A Scala library with essential Spark functions and extensions to make you more productive.
quinn - A native PySpark implementation of spark-daria.
Apache DataFu - A library of general purpose functions and UDF's.
Joblib Apache Spark Backend - joblib backend for running tasks on Spark clusters.

SQL Data Sources

SparkSQL has serveral built-in Data Sources for files. These include csv, json, parquet, orc, and avro. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.