Businesses are drowning in dataForrester
but starving for insights
Everyone is starting out at slightly different levels. We really suggest you set aside some time (approximately 5 hours a day over 12 days) between now and the start of the bootcamp to work through this preparatory material. Some people will be able to work through this material in less time; for others with less previous experience, it may take longer. Past program participants have said they got a lot more out of the progam because they took the time to practice these skills before starting.
Take a look at each day's Action Items and look at the accompanying material. If you don't know it, take a deeper dive and try to understand it. If you already know some things, just focus on things you don't already know. But definitely make sure you do each day's Action Items.
Ideally, you will be able go through all of the material of this 12-day program. However, the first eight days are the most important to be able to start the program on the right foot.
Warning: When working through a coding tutorial, it is really tempting to copy and paste code from the tutorial directly into the Python shell and think you understand it. Don't do this! Force yourself to type the code in character by character: this forces you to learn the commands so that when you need to actually use them, you'll have them closer to your fingertips.
There is a Milestone Project associated with working through this material, which draws on the material you should be learning and involves hosting your work with Binder on day 8. Keep an eye out for Milestone action items.
Python is a great language for data analytics. It offers a lot (although not all) of the tools available in languages like Matlab and R. Unlike Matlab, it's free. Unlike both these languages, it promotes good coding practices. But more importantly (for when you start working), it's a real engineering language that makes it easy for you to:
You are already familiar with programming; you just have to get familiar with Python’s syntax (if you aren't already) and the numerical and scientific tools available. We are using Python 3.
You will use a prepared Python environment on a cloud server for our program, but you will likely find it useful to have a Python environment on your local machine as well. If you don't have one installed, we recommend the Anaconda distribution. It's free for personal use, and it comes with many of the packages we will use already installed.
Resources
Action Items
Now that you've spent time learning Python, continue to improve your proficiency with the language. You want to spend this day getting familiar with the extensive Python standard library and reinforcing any knowledge gaps you have encountered from the previous day. There is no problem with having to spend more time on the material of the first two days. It's really important you start the program being comfortable programming with Python.
Resources
collections
: While you can accomplish much with Python's built-in data structures such as list
, tuple
, dict
, and set
, the collections
module provides some specialized data structures that make your life more convenient. A few to know about are Counter
, namedtuple
, and defaultdict
. itertools
: There are times when you need to come up with all possible permutations or combinations and iterate over them. While you might be able to programmatically create them and store them in a list, that would be memory inefficient, slow, and prone to error. Instead, consider using some of the tools offered by itertools
. In general, the module provides a set of fast and memory efficient iterators. A few iterators of note are permutations
, combinations
, and product
. os
, glob
, and pathlib
: You may find yourself needing to do operating system tasks like creating and removing directories. The tools in the os
module will help make your code compatible across different operating systems. If you are looking for functions to work with paths, look into os.path
. For example, os.path.join
is used to join paths intelligently, for example, properly using the correct directory separator for your operating system. The glob
module lets you find all file paths satisfying a given pattern. For example, you can find all files in the "data" directory that end in .csv
using the pattern data/*.csv
. pathlib
is an alternative to os.path
and glob
. Instead of representing paths and files as strings, it provides a Path
class, offering an object-oriented way to work with paths. For example, with a Path
object, you can have access to methods to check for file existence and iterate over all subdirectories. Action Items
Python has a wonderful suite of libraries for doing scientific computing. The two main packages are numpy
and scipy
.
What is NumPy, and why is it relevant for scientific computing? NumPy is a Python package that provides, among other things, a multi-dimensional array object, the ndarray
, which allows for very fast array operations. NumPy is written in the C programming language, allowing the library to leverage the speed of C for fast vectorized operations. Rather than using slow for
loops, where possible you should always vectorize your operations using NumPy.
NumPy also forms the basis of, or is used extensively by, other packages such as scipy
and scikit-learn, so having a firm understanding of NumPy is useful for using other packages.
scipy
is a package that provides many efficient numerical routines such as numeric integration, optimization, linear algebra, and statistics methods. While useful for scientific computing, we won't use scipy
much in the Fellowship program.
Resources
Action Items
numpy
. There are many different ways this can be done, such as using the array
function, zeros
and ones
for pre-filled in arrays of given sizes, and the arange
and linspace
functions for sequences. numpy
universal functions. pandas
is the Python package for data analysis and manipulation. It's a package designed to make working with relational (or tabular) data fast and easy. The package provides many methods for performing data analysis, data manipulation and aggregation, as well as many visualization tools (largely built on top of the matplotlib
library). If you are familiar with R, then you'll notice similarities with R's data.frame
.
pandas
is built on top of the numpy
package, hence a good understanding of numpy
is valuable for working easily in pandas
, and pandas
is designed to integrate well with other Python libraries such as scikit-learn
and statsmodels
. pandas
also has great capabilities for working with time series data and includes a large suite of time series-specific methods.
Resources
groupby
method to aggregate data.Milestone Action Items
Download the data set about Value of Energy Cost Saving Program for businesses in New York City (under the "Export" option, there is a way to retrieve a CSV file). Answer the following questions.
Matplotlib is Python's most popular plotting package. It provides an object-oriented programming API and a procedural API similar to that of MATLAB. While much can be done with Matplotlib, for both nicer looking and statistically focused visualization, consider using Seaborn, which extends and builds on Matplotlib.
With these plotting libraries, don't focus on trying to memorize the entire API. A good starting point is to visit the examples page and look for similar visualizations of what you are trying to build. Altair and Bokeh are two Python packages for creating interactive visualizations. In the Fellowship program, we'll focus on using Altair.
Milestone Action Items
In the tech field, it's really hard to avoid working in a Unix-like operating system. Two of the most common Unix-like operating systems are Linux and macOS. As a data scientist, you need to be comfortable in working in a Unix-like environment and leveraging Unix tools. The cloud server we will provide you is running a Jupyter server on JupyterHub, running a Debian-based Linux distribution.
When developing software, it's crucial to keep track of the evolution of the changes of the files the make up the project (the codebase). In short, developers need version control software. Git is the most commonly-used version control software. With Git, you can better manage your code and do things like easily try new things and discard those changes. It also helps with collaborating with others.
Action Items
Data scientists work with web APIs all the time; we'll cover what they are and how to use them in the Fellowship program. However, knowing how communication across the web works will help when using web APIs.
A common task for a data scientist is to extract data from messy text or HTML. During the program, we will go over web scraping in Python with Beautiful Soup, but in order for you to properly wield such a tool, you need to understand the basics of HTML.
Regular expressions are a sequence of characters that represent a search pattern. With this search pattern identified, one can perform text processing operations such as find and replace. Regular expressions have been implemented via the re module in Python. There's a great interactive regular expression online tutorial that teaches you how to use them. For testing your regular expression, use RegEx101. It's a free regex debugger, complete with real-time explanations, error detection, and analysis. (Just be certain to choose "Python" as the "flavor" in the menu on the left.)
Resources
Action Item
Binder is a great, and free, service that makes it easy for someone to share their Jupyter notebooks for others to both view and run. A Jupyter notebook is an interactive document that can contain code, text, and visualization. Data scientists will often use Jupyter notebooks to document and showcase their work. In fact, our curriculum is built using Jupyter notebooks and hosted with JupyterHub.
Since we'll be using Jupyter notebooks, you should spend some time to familiarize yourself with how they look and work. Launch an instance of JupyterHub here to go through the basics of interacting with Jupyter notebooks.
Milestone ProjectFor the milestone project, you will host a Git repository on GitHub that is capable of launching an instance of JupyterHub hosting a notebook file that contains your work from the pandas (Day 4) and plotting (Day 5) action items. Follow these steps:
requirements.txt
. This file contains the third-party Python packages that are required to run the code in your notebooks. We already have a few packages listed but you may need to include at least one more. While the versions are not necessary, their inclusion will ensure that the specified version of the package is installed. By ensuring the exact version is installed, we can be more certain that our code will run and our results are reproducible. You can read more about the requirements.txt
file here. README.md
. If you look at the raw README.md
file, you will see that "launch binder" link points to the original repo and not your copy. Make sure to update your README.md
with the link of your copy of the forked repo. README.md
will take you to a working instance of JupyterHub with the notebooks and files of your repo. Make sure to run your notebook to check that everything works. As data scientist, you will often have to work with databases, both reading and writing to them. SQL is the language used to communicate with relational databases, i.e. data arranged in tabular form (rows and columns). Feedback from some of our hiring partners indicates that being knowledgeable about more advanced SQL queries is important for interviewing.
During the program, we will spend time going over the syntax of SQL, advanced SQL features, and interfacing with relational databases with Python. However, this material will be easier to digest if you come in with some experience with SQL.
Working with SQL on your computer requires installing a Relational Database Management Software that uses SQL. Installation and setup can be a little painful and you will then need to setup a database.
Resources
Action Items
Algorithms and data structures are important foundational knowledge for programming. While we can get away without knowing these concepts when we start to learn how to program, eventually we'll need a better understanding to become effective programmers. You should not think of the following materials as necessary to complete in the next two days, but as a starting point. Over the course of the program, if you put in the time, you'll be able to converse fluently about the fundamental topics of computer science.
Resources
There's a great interactive Python-based tutorial on data structures and algorithms available online. From Basic Data Structures, understand:
Many algorithms are implemented by breaking a problem down into smaller and simpler subproblems. You'll need to understand recursion and dynamic programming. Work through the Recursion chapter.
Action Items
Attention
The material here and in the next section are exactly the kinds of questions that tend to come up in programming job interviews for data science/engineering positions. So this is good practice for finding a job. Consider doing these exercises with a timer because they will really help you.
With an understanding of fundamental data structures and recursion, we can move onto more complicated data structures to solve a wider range of problems. As you're thinking about these topics and the ones from yesterday, keep time complexity in mind. It's an extremely common question to ask about the running time of an algorithm or potential solution. You can find more on Wikipedia and use this cheatsheet as reference. The following are some additional topics you need to be familiar with.
Resources
From Sorting and Searching, understand:
From Trees and Tree Algorithms, understand:
From Graphs and Graph Algorithms, understand:
Action Item
At this point, you should learn or brush up on the basics of statistics and probability. There are a plethora of books out there over statistics and probabiltiy. However, Think Stats by Allen Downey is good, free, and written for Python programmers.
Resources
From Think Stats, understand:Attention
The material here are exactly the kinds of questions that tend to come up in job interviews for data science/engineering positions. So this is good practice for finding a job. Again, consider doing these exercises with a timer because they will really help you.
The following project is optional. If you have extra time before the program starts, you may want to attempt this project. Further, you should consult this resource if you want to create a web application for your capstone.
The power of the web comes from building dynamic websites, also known as a "web server". For your capstone, you may wish to create a web application which will enable you to have user interactivity, or provide another way to showcase your work. If you follow this route for your capstone, your application may consist of three main parts:
Amongst a number of "frameworks" for building web servers, Python's Flask is probably the easiest to get started with. Recently, however, libraries like Streamlit have made it possible to build simple applications without handling the web framework directly. They simplify the development cycle, but offer limited functionality and customization as compared to Flask.
Extra Project
Fork this Streamlit Demo repository and create your own Streamlit app on Render.com that accepts a stock ticker input from the user and plots closing price data for the last 100 days. If you want more room for customization, you can instead fork this repository to create the app using Flask.
You may use any data source you like for this project. One example is the Alpha Vantage API, which provides this data for free (you just need to request a free API key). For security reasons, you should avoid putting API keys (or any credentials) directly in your code, especially when it will be published in a public repository. One solution is to use the Python-dotenv package for local development, and to use Render.com's methods for inserting a .env
file that contains "secrets", or you may define environmental variables for production.
You can use Python's requests
package to access the Alpha Vantage API. This guide demos Alpha Vantage and several other providers, as well as giving tips on the best practices for dealing with stock market data in general. You can analyze the data using pandas and plot using Altair, Bokeh, or Plotly. By the end, you should have some kind of visualization viewable from the internet.
Render.com is a popular cloud application platform, and their documentation is a great resource for understanding how to deploy a simple (and free!) web service. Follow this tutorial on how to use Render to launch a Python web application using Flask. You can connect a github account to your Render account, or you should be able to deploy any publicly available repository to create a Render web application. Check our Streamlit repository for a few comments about configuring your Streamlit deployment on Render, as you need to give Render the proper Streamlit command to successfully run your application.
To get started with Streamlit, we recommend working through the examples in the official documentation. For a more in-depth look, try this tutorial. Finally, this site showcases a wide range of example apps that may provide further inspiration.
For more information about Flask, here is a good starting point, especially the links in the "Starting Off" section.
When you finish, you should end up with a project that looks like this if you are using Flask. . (Note: It might take a minute or two for this application to start up.) If you go with Streamlit, your project might look something like this.
You may do your development on your local machine or on your remote server, but accessing the running Flask/Streamlit server on the remote system is difficult. To develop locally, you will need a Python distribution with the relevant packages. We highly recommend using a package and environment manager. Unfortunately, there are several in the Python ecosytem. Since this is a fairly simple project, you want to just use the standard venv and pip
.
Is your app taking a while to start on Render? The first time you deploy your app, Render needs to build the image and begin the web service to start running the app, which can take upwards of about ten minutes for the whole deployment process. Subsequent visits to your app should load more quickly, but Render will shut down the server if it sits idle.