The Data Incubator

Businesses are drowning in data
but starving for insights

Distributed Computing with Spark


Spark is a technology at the forefront of distributed computing that offers a more abstract but more powerful API. This module is taught using the Python API. We cover core concepts of Spark like resilient distributed data sets, memory caching, actions, transformations, tuning, and optimization. Students get to build functioning applications from end to end. They apply that knowledge to directly developing, building, and deploying Spark jobs to run on large, real-world data sets in the cloud (AWS and Google Cloud Platform).

Associated project work

Students will use Spark to parse and process 10GB of data on posts and users at a popular Q&A website. They will extract insights on the posting habits of users and develop predictors of user’s behavior from their posts. Spark’s machine-learning capabilities will be used to discover meaning in unstructured text data.

This module is currently part of our Data Science Fellowship.

Basic to intermediate Python
Basic to intermediate programming