Data Manipulation and Cleaning in Python

Summary

The first step of data science is mastering the computational foundations on which data science is built. We cover the fundamental topics of programming relevant for data science - including pandas, NumPy, SciPy, matplotlib, regular expressions, SQL, JSON, XML, checkpointing, and web scraping - that form the core libraries around handling structured and unstructured data in Python. Students gain practical experience manipulating messy, real-world data using these libraries. They also walk away with a firm understanding of tools like pip, git, IPython, Jupyter notebooks, pdb, and unit testing that leverage existing open source packages to accelerate data exploration, development, debugging, and collaboration.

Associated project work

Students will scrape picture captions off of a website that tracks the goings-on of New York’s socially well-to-do. By extracting names from these captions, they will assemble a graph of friendships amongst this crowd. Analysis of this graph will produce insights about the most connected New Yorkers.

Students will gain experience with Python-based data wrangling technologies to extract insights from a structured, web-API-based dataset. Students will learn the fundamental building blocks of data extraction, manipulation, and aggregation via Pandas DataFrames and good Python programming practice.

This module is currently part of our Data Science Fellowship.

NumPy and SciPy

NumPy Data types (the nouns): Operations (the verbs): Persisting NumPy objects SciPy

Matplotlib

Matplotlib and Pyplot Matplotlib plots from Pandas Seaborn

Pandas

Nouns (objects) in Pandas Verbs (operations) in Pandas Loading data (and basic statistics / visualization) Indexing data frames Filtering data Joining data Aggregating data Sorting by indices and columns cut and qcut Unique values Handling missing and NA data Manipulating strings Indices in Pandas Function application and mapping Pandas Timestamps Multi-indices, stacking, and pivot tables Plugging into more advanced analytics Taking advantage of parallel computing

Pandas

Using Web APIs

The Requests package Web APIs Authenticated APIs API Request Limitations Conclusion

Scraping

Basic web scraping workflow Find the webpage Inspect the webpage HTML Write code to parse HTML Fetching subsequent pages Conclusion

Overview of Scraping and Munging Technologies

Concepts, languages, and tools Concrete tasks in Python Python library cheat sheet

How to (Software) Engineer Real Good

Writing Functional Code Version control and other tools Testing Testing the web in Flask Linting Writing "good code" Self-documenting code Code review Time Management

Python

Jupyter notebooks and the kernel Variables Functions Logic and program flow Iteration Whitespace Matters Putting it all together

Dealing with Strings in Python

The string data structure Unicode and Byte Strings Basic string processing StringIO in Python Regular expressions

Functions

Python functions basics Functions as first-class objects Recursion Closures Decorators

Object-oriented programming

Everything is an Object Defining a Python Class Adding Attributes and Methods Inheritance Putting it all together...

Exceptions

Catching general exceptions Handling success Doing something with the error Raising errors Exceptions and the call stack Reading traceback

Debugging

Errors and Exceptions NameError TypeError AttributeError KeyError Reading code critically

Iterators, Generators, and Coroutines

Iterables and Iterators Generators Generator "pipelines" Generator comprehensions Time complexity Itertools in Python Coroutines Coroutine "pipelines" Broadcasting Coroutines as classes Unifying generators and coroutines

Prerequisites

Basic Python

The Data Incubator

Data Manipulation and Cleaning in Python

Summary

Associated project work