The Data Incubator

Businesses are drowning in data
but starving for insights
Forrester

12-Day Program

Everyone is starting out at slightly different levels. We really suggest you set aside some time (approximately 5 hours a day over 12 days) between now and the start of the bootcamp to work through this preparatory material. Some people will be able to work through this material in less time; for others with less previous experience, it may take longer. Past program participants have said they got a lot more out of the progam because they took the time to practice these skills before starting.

Take a look at each day's Action Items and look at the accompanying material. If you don't know it, take a deeper dive and try to understand it. If you already know some things, just focus on things you don't already know. But definitely make sure you do each day's Action Items.

Ideally, you will be able go through all of the material of this 12-day program. However, the first eight days are the most important to be able to start the program on the right foot.

Warning: When working through a coding tutorial, it is really tempting to copy and paste code from the tutorial directly into the Python shell and think you understand it. Don't do this! Force yourself to type the code in character by character: this forces you to learn the commands so that when you need to actually use them, you'll have them closer to your fingertips.

There is a Milestone Project associated with working through this material, which draws on the material you should be learning and involves hosting your work with Binder on day 8. Keep an eye out for Milestone action items.

Day 1: Python Basics

Python is a great language for data analytics. It offers a lot (although not all) of the tools available in languages like Matlab and R. Unlike Matlab, it's free. Unlike both these languages, it promotes good coding practices. But more importantly (for when you start working), it's a real engineering language that makes it easy for you to:

  • Collect data from existing databases using the tools that are currently available in your company.
  • Integrate your code and contributions into the rest of the codebase that your company will use.

You are already familiar with programming; you just have to get familiar with Python’s syntax (if you aren't already) and the numerical and scientific tools available. We are using Python 3.

You will use a prepared Python environment on a cloud server for our program, but you will likely find it useful to have a Python environment on your local machine as well. If you don't have one installed, we recommend the Anaconda distribution. It's free for personal use, and it comes with many of the packages we will use already installed.

Resources

  1. A decent online (free!) book is Think Python by Allen B. Downey. Features of Python particularly important to data science include:
    • Dictionaries (and converting back and forth between other data structures)
    • List Comprehensions
    • Generators (as opposed to lists)
    • Lambda Functions
    • Classes
    • Decorators
    • Variable number of function arguments

Action Items

  1. Take a minute to fill out the Pre-Fellowship Questionnaire.
  2. Get Python running on your local machine. If you don't know where to start, use the Anaconda distribution.
  3. To get started with learning Python syntax, we suggest beginning with either the Learn python tutorial or this tutorial on the official Python website.
  4. Once you’ve gone through the above short tutorials, go to Project Euler and use Python to solve at least 5 problems. Try to choose problems that allow you to practice using dictionaries, list comprehensions, and other Pythonic features.

Day 2: More Python

Now that you've spent time learning Python, continue to improve your proficiency with the language. You want to spend this day getting familiar with the extensive Python standard library and reinforcing any knowledge gaps you have encountered from the previous day. There is no problem with having to spend more time on the material of the first two days. It's really important you start the program being comfortable programming with Python.

Resources

  1. The following are important modules that are part of the Python standard library. It's good to be aware of these before the Fellowship program starts.
    • collections: While you can accomplish much with Python's built-in data structures such as list, tuple, dict, and set, the collections module provides some specialized data structures that make your life more convenient. A few to know about are Counter, namedtuple, and defaultdict.
    • itertools: There are times when you need to come up with all possible permutations or combinations and iterate over them. While you might be able to programmatically create them and store them in a list, that would be memory inefficient, slow, and prone to error. Instead, consider using some of the tools offered by itertools. In general, the module provides a set of fast and memory efficient iterators. A few iterators of note are permutations, combinations, and product.
    • os, glob, and pathlib: You may find yourself needing to do operating system tasks like creating and removing directories. The tools in the os module will help make your code compatible across different operating systems. If you are looking for functions to work with paths, look into os.path. For example, os.path.join is used to join paths intelligently, for example, properly using the correct directory separator for your operating system. The glob module lets you find all file paths satisfying a given pattern. For example, you can find all files in the "data" directory that end in .csv using the pattern data/*.csv. pathlib is an alternative to os.path and glob. Instead of representing paths and files as strings, it provides a Path class, offering an object-oriented way to work with paths. For example, with a Path object, you can have access to methods to check for file existence and iterate over all subdirectories.

Action Items

  1. Continue with the Learn python tutorial or this tutorial on the official Python website.
  2. Take an assessment of the areas of Python you feel least comfortable with. Create an account on HackerRank, a platform for taking coding challenges. Under the Python section, look for challenges that target the areas you are least comfortable with. HackerRank even has a Python skills assessment test you can take to better understand what areas you need to work on.

Day 3: Scientific Computation

Python has a wonderful suite of libraries for doing scientific computing. The two main packages are numpy and scipy.

What is NumPy, and why is it relevant for scientific computing? NumPy is a Python package that provides, among other things, a multi-dimensional array object, the ndarray, which allows for very fast array operations. NumPy is written in the C programming language, allowing the library to leverage the speed of C for fast vectorized operations. Rather than using slow for loops, where possible you should always vectorize your operations using NumPy.

NumPy also forms the basis of, or is used extensively by, other packages such as scipy and scikit-learn, so having a firm understanding of NumPy is useful for using other packages.

scipy is a package that provides many efficient numerical routines such as numeric integration, optimization, linear algebra, and statistics methods. While useful for scientific computing, we won't use scipy much in the Fellowship program.

Resources

  1. An excellent tutorial on NumPy.
  2. If you like videos, try watching this one and following along with their GitHub code.
  3. For those of you familiar with MATLAB, here’s a guide for translating your MATLAB knowledge into NumPy.

Action Items

  1. Make certain that you know how to create an array in numpy. There are many different ways this can be done, such as using the array function, zeros and ones for pre-filled in arrays of given sizes, and the arange and linspace functions for sequences.
  2. Create some NumPy arrays, manipulate them with basic operations, and try out some of the numpy universal functions.
  3. Do the one-star exercises and as many of the two-star exercises as you can from the NumPy 100 exercises page.

Day 4: Pandas

pandas is the Python package for data analysis and manipulation. It's a package designed to make working with relational (or tabular) data fast and easy. The package provides many methods for performing data analysis, data manipulation and aggregation, as well as many visualization tools (largely built on top of the matplotlib library). If you are familiar with R, then you'll notice similarities with R's data.frame.

pandas is built on top of the numpy package, hence a good understanding of numpy is valuable for working easily in pandas, and pandas is designed to integrate well with other Python libraries such as scikit-learn and statsmodels. pandas also has great capabilities for working with time series data and includes a large suite of time series-specific methods.

Resources

  1. The homepage of pandas is a great place to start, as are the Getting started tutorials. Important things to know include:
    • How to read data from a CSV (comma-separated value) file to create a DataFrame.
    • How to filter data in a DataFrame.
    • How to compute summary statistics for a DataFrame.
    • How to use the groupby method to aggregate data.
    • How to dump results of analysis to a CSV file.
  2. There are several nice pandas tutorial videos on YouTube. Here is a playlist of a series we liked. It's broken up by topic so feel free to skip around and focus on areas you need to strengthen.

Milestone Action Items

Download the data set about Value of Energy Cost Saving Program for businesses in New York City (under the "Export" option, there is a way to retrieve a CSV file). Answer the following questions.

  1. How many different companies are represented in the data set?
  2. What is the total number of jobs created for businesses in Queens?
  3. How many different unique email domains names are there in the data set?
  4. Considering only NTAs with at least 5 listed businesses, what is the average total savings and the total jobs created for each NTA?
  5. Save your result for the previous question as a CSV file.

Day 5: Plotting

Matplotlib is Python's most popular plotting package. It provides an object-oriented programming API and a procedural API similar to that of MATLAB. While much can be done with Matplotlib, for both nicer looking and statistically focused visualization, consider using Seaborn, which extends and builds on Matplotlib.

With these plotting libraries, don't focus on trying to memorize the entire API. A good starting point is to visit the examples page and look for similar visualizations of what you are trying to build. Altair and Bokeh are two Python packages for creating interactive visualizations. In the Fellowship program, we'll focus on using Altair.

Milestone Action Items

  1. Go through the Introductory Tutorials on Matplotlib.
  2. Using the same data set and results that you were working with in the pandas action items section (Day 4), create a
    • scatter plot of jobs created versus average savings. Use both a standard and a logarithmic scale for the average savings.
    • histogram of the log of the average total savings.
    • line plot of the total jobs created for each month.
  3. If you have time, take a look at this short tutorial on Altair.

Day 6: Unix and Git

In the tech field, it's really hard to avoid working in a Unix-like operating system. Two of the most common Unix-like operating systems are Linux and macOS. As a data scientist, you need to be comfortable in working in a Unix-like environment and leveraging Unix tools. The cloud server we will provide you is running a Jupyter server on JupyterHub, running a Debian-based Linux distribution.

When developing software, it's crucial to keep track of the evolution of the changes of the files the make up the project (the codebase). In short, developers need version control software. Git is the most commonly-used version control software. With Git, you can better manage your code and do things like easily try new things and discard those changes. It also helps with collaborating with others.

Action Items

  1. Get access to a Unix terminal. This should not be a problem if you run macOS or Linux, but may be an issue if you are running Windows. Until you've been provided our cloud server, you may want to either install Windows Subsystem for Linux or Cygwin.
  2. Complete this tutorial about working in a Linux command line.
  3. Install Git and setup a GitHub account, if you haven't done so already. GitHub is a popular service that lets one to host their Git-tracked projects. Additionally, GitHub provides other features that facilitate collaboration and development. GitHub no longer lets users authenticate with a password for Git operations. You'll need to authenticate using SSH keys.
  4. Complete the first ten labs of this Git tutorial.
  5. Learn branching by completing "Introduction Sequence" under the "Main" tab and "Push & Pull -- Git Remotes!" under the "Remote" tab of this interactive Git branching tutorial.

Day 7: HTTP, HTML, and Regular Expressions

Data scientists work with web APIs all the time; we'll cover what they are and how to use them in the Fellowship program. However, knowing how communication across the web works will help when using web APIs.

A common task for a data scientist is to extract data from messy text or HTML. During the program, we will go over web scraping in Python with Beautiful Soup, but in order for you to properly wield such a tool, you need to understand the basics of HTML.

Regular expressions are a sequence of characters that represent a search pattern. With this search pattern identified, one can perform text processing operations such as find and replace. Regular expressions have been implemented via the re module in Python. There's a great interactive regular expression online tutorial that teaches you how to use them. For testing your regular expression, use RegEx101. It's a free regex debugger, complete with real-time explanations, error detection, and analysis. (Just be certain to choose "Python" as the "flavor" in the menu on the left.)

Resources

  1. Read this short tutorial and this other one explaining how Hyper Text Transfer Protocol (HTTP) works.
  2. Review this HTML tutorial by W3 Schools. In particular, look at the "HTML Introduction", "HTML Elements", and "HTML Attributes" sections.

Action Item

  1. Work through at least the first eight lessons on RegexOne. If you have more time, do as many of the remaining lessons as you can.

Day 8: Hosting your Work with Binder

Binder is a great, and free, service that makes it easy for someone to share their Jupyter notebooks for others to both view and run. A Jupyter notebook is an interactive document that can contain code, text, and visualization. Data scientists will often use Jupyter notebooks to document and showcase their work. In fact, our curriculum is built using Jupyter notebooks and hosted with JupyterHub.

Since we'll be using Jupyter notebooks, you should spend some time to familiarize yourself with how they look and work. Launch an instance of JupyterHub here to go through the basics of interacting with Jupyter notebooks.

Milestone Project

For the milestone project, you will host a Git repository on GitHub that is capable of launching an instance of JupyterHub hosting a notebook file that contains your work from the pandas (Day 4) and plotting (Day 5) action items. Follow these steps:

  1. Fork this repo that contains the files to get going with Binder. Forking is a process that creates a copy of the repo that lives on your own GitHub account. That way, you can work on the forked copy. You can read about forking and how it's done here.
  2. Create and add a notebook file to the repository that contains the work of the action items of the days about pandas and plotting. In other words, the notebook should now be part of your forked Git repo. Note, any data files you used will need to be part of the repo.
  3. Update the file requirements.txt. This file contains the third-party Python packages that are required to run the code in your notebooks. We already have a few packages listed but you may need to include at least one more. While the versions are not necessary, their inclusion will ensure that the specified version of the package is installed. By ensuring the exact version is installed, we can be more certain that our code will run and our results are reproducible. You can read more about the requirements.txt file here.
  4. Update the Binder launch link in README.md. If you look at the raw README.md file, you will see that "launch binder" link points to the original repo and not your copy. Make sure to update your README.md with the link of your copy of the forked repo.
  5. Push all of your changes to GitHub. If everything worked, then hitting the "launch binder" icon in the README.md will take you to a working instance of JupyterHub with the notebooks and files of your repo. Make sure to run your notebook to check that everything works.
  6. Submit your project here.

Day 9: SQL

As data scientist, you will often have to work with databases, both reading and writing to them. SQL is the language used to communicate with relational databases, i.e. data arranged in tabular form (rows and columns). Feedback from some of our hiring partners indicates that being knowledgeable about more advanced SQL queries is important for interviewing.

During the program, we will spend time going over the syntax of SQL, advanced SQL features, and interfacing with relational databases with Python. However, this material will be easier to digest if you come in with some experience with SQL.

Working with SQL on your computer requires installing a Relational Database Management Software that uses SQL. Installation and setup can be a little painful and you will then need to setup a database.

Resources

  1. There’s a nice online interactive tutorial by Mode.
  2. If you want to intall a relational database system on your own laptop, we might suggest PostgreSQL or SQLite. Both are free.

Action Items

  1. Work through Mode's "Basic SQL" tutorial.
  2. If you have more time, considering looking at the "Intermediate SQL" section of the Mode tutorial. Specifically, look at the section on aggregating functions, group by, and joins. Don't worry if you do not have time; we'll go over those operations during the program.

Day 10: Algorithms and Data Structures 1

Algorithms and data structures are important foundational knowledge for programming. While we can get away without knowing these concepts when we start to learn how to program, eventually we'll need a better understanding to become effective programmers. You should not think of the following materials as necessary to complete in the next two days, but as a starting point. Over the course of the program, if you put in the time, you'll be able to converse fluently about the fundamental topics of computer science.

Resources

There's a great interactive Python-based tutorial on data structures and algorithms available online. From Basic Data Structures, understand:

  • Stacks
  • Queues
  • Linked Lists

Many algorithms are implemented by breaking a problem down into smaller and simpler subproblems. You'll need to understand recursion and dynamic programming. Work through the Recursion chapter.

Action Items

  1. As you go over chapters, make sure to work through the in-text activities.
  2. Complete the following HackerRank problems over linked lists which will have you rely on implementing recursion.

Attention

The material here and in the next section are exactly the kinds of questions that tend to come up in programming job interviews for data science/engineering positions. So this is good practice for finding a job. Consider doing these exercises with a timer because they will really help you.

Day 11: Algorithms and Data Structures 2

With an understanding of fundamental data structures and recursion, we can move onto more complicated data structures to solve a wider range of problems. As you're thinking about these topics and the ones from yesterday, keep time complexity in mind. It's an extremely common question to ask about the running time of an algorithm or potential solution. You can find more on Wikipedia and use this cheatsheet as reference. The following are some additional topics you need to be familiar with.

Resources

From Sorting and Searching, understand:

  • Hash Tables
  • The Bubble Sort
  • The Quick Sort

From Trees and Tree Algorithms, understand:

  • Binary Trees
  • Priority Queues with Binary Heaps
  • Binary Search Trees

From Graphs and Graph Algorithms, understand:

  • Breadth First Search
  • Depth First Search
  • Shortest Path

Action Item

  1. As you go over chapters, make sure work through the in-text activities.

Day 12: Statistics and Probability

At this point, you should learn or brush up on the basics of statistics and probability. There are a plethora of books out there over statistics and probabiltiy. However, Think Stats by Allen Downey is good, free, and written for Python programmers.

Resources

From Think Stats, understand:
  1. Distributions, Chapter 2
  2. Modeling distributions, Chapter 5
  3. Estimation, Chapter 8
  4. Hypothesis testing, Chapter 9
  5. Bayes' Theorem, Chapter 2 of Think Bayes 2.

Attention

The material here are exactly the kinds of questions that tend to come up in job interviews for data science/engineering positions. So this is good practice for finding a job. Again, consider doing these exercises with a timer because they will really help you.

Extra: Web Apps with Flask and Streamlit

The following project is optional. If you have extra time before the program starts, you may want to attempt this project. Further, you should consult this resource if you want to create a web application for your capstone.

The power of the web comes from building dynamic websites, also known as a "web server". For your capstone, you may wish to create a web application which will enable you to have user interactivity, or provide another way to showcase your work. If you follow this route for your capstone, your application may consist of three main parts:

  1. A database to store the data,
  2. A middle layer of code that handles the "business logic" of the website, and/or
  3. HTML which is rendered to the user.

Amongst a number of "frameworks" for building web servers, Python's Flask is probably the easiest to get started with. Recently, however, libraries like Streamlit have made it possible to build simple applications without handling the web framework directly. They simplify the development cycle, but offer limited functionality and customization as compared to Flask.

Extra Project

Fork this Streamlit Demo repository and create your own Streamlit app on Render.com that accepts a stock ticker input from the user and plots closing price data for the last 100 days. If you want more room for customization, you can instead fork this repository to create the app using Flask.

You may use any data source you like for this project. One example is the Alpha Vantage API, which provides this data for free (you just need to request a free API key). For security reasons, you should avoid putting API keys (or any credentials) directly in your code, especially when it will be published in a public repository. One solution is to use the Python-dotenv package for local development, and to use Render.com's methods for inserting a .env file that contains "secrets", or you may define environmental variables for production.

You can use Python's requests package to access the Alpha Vantage API. This guide demos Alpha Vantage and several other providers, as well as giving tips on the best practices for dealing with stock market data in general. You can analyze the data using pandas and plot using Altair, Bokeh, or Plotly. By the end, you should have some kind of visualization viewable from the internet.

Render.com is a popular cloud application platform, and their documentation is a great resource for understanding how to deploy a simple (and free!) web service. Follow this tutorial on how to use Render to launch a Python web application using Flask. You can connect a github account to your Render account, or you should be able to deploy any publicly available repository to create a Render web application. Check our Streamlit repository for a few comments about configuring your Streamlit deployment on Render, as you need to give Render the proper Streamlit command to successfully run your application.

To get started with Streamlit, we recommend working through the examples in the official documentation. For a more in-depth look, try this tutorial. Finally, this site showcases a wide range of example apps that may provide further inspiration.

For more information about Flask, here is a good starting point, especially the links in the "Starting Off" section.

When you finish, you should end up with a project that looks like this if you are using Flask. . (Note: It might take a minute or two for this application to start up.) If you go with Streamlit, your project might look something like this.

Project Deployment Tips

Environment Management

You may do your development on your local machine or on your remote server, but accessing the running Flask/Streamlit server on the remote system is difficult. To develop locally, you will need a Python distribution with the relevant packages. We highly recommend using a package and environment manager. Unfortunately, there are several in the Python ecosytem. Since this is a fairly simple project, you want to just use the standard venv and pip.

Useful to Know

Is your app taking a while to start on Render? The first time you deploy your app, Render needs to build the image and begin the web service to start running the app, which can take upwards of about ten minutes for the whole deployment process. Subsequent visits to your app should load more quickly, but Render will shut down the server if it sits idle.