Distributed Computing with Spark

Summary

Spark is a technology at the forefront of distributed computing that offers a more abstract but more powerful API. This module is taught using the Python API. We cover core concepts of Spark like resilient distributed data sets, memory caching, actions, transformations, tuning, and optimization. Students get to build functioning applications from end to end. They apply that knowledge to directly developing, building, and deploying Spark jobs to run on large, real-world data sets in the cloud (AWS and Google Cloud Platform).

Associated project work

Students will use Spark to parse and process 10GB of data on posts and users at a popular Q&A website. They will extract insights on the posting habits of users and develop predictors of user’s behavior from their posts. Spark’s machine-learning capabilities will be used to discover meaning in unstructured text data.

This module is currently part of our Data Science Fellowship.

Introduction to Distributed Computing

Big Data Distributed Computing MapReduce: A simple distributed-computing framework Word Count: The "Hello World" of distributed computing Word Count in Spark Other Spark Features

Introduction to Functional Programming Style

Stateful vs. Stateless Code Decorators Map, Filter, and Reduce Anonymous Functions

PySpark Intro

The Spark API Word Count example ETL example Computing statistics Translating from SQL Joins in Spark

PySpark DataFrames

Motivation and Spark SQL Exploring the Catalyst Optimizer SQL and DataFrames Adding columns and functions Type safety and DataSets DataFrame Optimization Joins

PySpark ML

Parallelizing the training of a machine learning algorithm Spark ML Algorithms ML vs. MLlib packages A first example Pipeline Cross-validation and grid search Feature processing

Advanced Topics in Spark

Key Terminology Relation to Hadoop and MapReduce Understanding the Shuffle Data Partitioning Shared Variables Best Practices and Optimization Resource Tuning Spark UI

Streaming Technologies

Apache Kafka Apache Storm Spark Streaming Building a Spark Streaming application Keeping track of state Windowed state Streaming tweets demo

Tweet mini case study

Spark SQL and DataFrames - a convenient abstraction Caching and persistence - the key to Spark's speed

Creating Spark Applications

REPLs Building Spark applications Spark on Amazon Web Services Spark on Google Cloud Platform

Prerequisites

Basic to intermediate Python

Basic to intermediate programming

The Data Incubator

Distributed Computing with Spark

Summary

Associated project work