www.datascience.pizza
Module 1 Welcome!
What this is, and what it isn’t
This is not a textbook or an encyclopedia. This is not a reference manual. It is not exhaustive or comprehensive. We barely mention statistical tests at all. There is almost no theory. In fact, this curriculum is biased, non-representative, and incomplete – by design.
So what is this? This guide is an accelerator, an incubator designed to guide you along the most direct path from your first line of code to becoming a capable data scientist. Our goal is to help you through the most dangerous period in your data science education: your very first steps. The first three weeks. That is when 99% percent of people give up on learning to code.
But it doesn’t need to be this way. We made this book to reach more than just the 1%.
We have based our approach on three core premises:
Premise 1: We learn best by doing. Our goal is to get you doing data science. We will keep theory and detail to a minimum. We will give you the absolute basics, then offer you exercises and puzzles that motivate you to learn the rest. Then, once you’ve been doing data science for a bit, you soon begin thinking like a data scientist. By that, we mean tackling ambiguous problems with persistence, independence, and creative problem solving.
Premise 2: We learn best with purpose. Once you gain comfort with the basic skills, you will be able to start working on real data, for real projects, with real impact. You will start to care about what you are coding. And that is when the learning curve skyrockets – because you are motivated, and because you are learning reactively, instead of preemptively. Our goal is to get you to the point of take-off as quickly as possible.
Premise 3: A simple toolbox is all you need to build a house. Once you become comfortable with a few basic coding tools, you can build pretty much anything. The toolbox doesn’t need to be that big; if you know how to use your tools well, and if you have enough building supplies (i.e., data), the possibilities are limitless.
One more thing that this is not: This is not a fancy interactive tutorial with bells or whistles. We purposefully designed this to be simple and “analog”. You will not be typing your code into this website and getting feedback from a robot, or setting up an account to track your progress, or getting pretty merit badges or points when you complete each module.
Instead, you will be doing your work on your own machine, working with real folders and files, downloading data and moving it around, etc. – all the things you will be doing as a real data scientist in the real world.
Who this is for
This curriculum covers everything from the absolute basics of writing code in R
to machine learning. As such, it is designed to be useful to everyone in some way. But the target audience for these tutorials is the rookie: the student who wants to work with data but has zero formal training in programming, computer science, or statistics.
This curriculum was originally developed for the DataLab at Sewanee: The University of the South, TN, USA.
What you will learn
- The Core theory unit establishes the conceptual foundations and motivations for this work: what data science is, why it matters, and ethical issues surrounding it: the good, the bad, and the ugly. Don’t slog through this all at once. Sprinkle it in here and there. The most important thing, at first, is to start writing code.
The next several units comprise a core curriculum for tackling data science problems:
The Getting started unit teaches you how to use
R
(inRStudio
). Here you will add the first and most important tools to your toolbox: working with variables, vectors, dataframes, scripts, and file directories.The Basic
R
workflow unit teaches you how to bring in your own data and work with it inR
. You will learn how to produce beautiful plots and how to reorganize, filter, and summarize your datasets. You will also learn how to conduct basic statistics, from exploratory data analyses (e.g., producing and comparing distributions) to significance testing.
For these first two units, we encourage you to take on these modules one at a time, in the exact order they are presented: we put a lot of thought into what we included in these modules (and what we did not).
The Review exercises unit provides various puzzles that allow you to apply the basic
R
skills from the previous unit to fun questions and scenarios. In each of these exercises, questions are arranged in increasing order of difficulty, so that beginners will not feel stuck right out of the gate, nor will experienced coders become bored. This is where you really begin to cut your teeth on real-world data puzzles: figuring out how to use theR
tools in your toolbag to tackle an ambiguous problem and deliver an excellent data product.The Reproducible research unit equips you with basic tools needed for truly reproducible data science: documenting your research and code with
Markdown
; weaving together your code and your reporting withRMarkdown
; allowing users to explore the data themselves with an interactiveShiny
dashboard or web app; and sharing your code and tracking versions of your code usingGit
.The Presenting research unit teaches you how to produce well-organized and well-written research reports, and how to deliver compelling presentations about your work.
The final unit, Deep
R
, introduces you to a variety of more involvedR
tools and advanced data science techniques, from writing custom functions andfor
loops to producing interactive maps, iterative simulations, and machine learning algorithms. These modules are designed to be used as needed, in whatever order is most helpful for you in your own work.
Contributors
Eric Keen is a data scientist, marine ecologist, and educator. He is the Science Co-director at BCwhales, a research biologist at Marecotel, a data scientist at Hyfe, and a professor of Environmental Studies at Sewanee: the University of the South. He earned his BA at Sewanee (2008) and his PhD at Scripps Institution of Oceanography (2017). His research focuses on the ecology and conservation of whales in developing coastal habitats. He is passionate about whales, conservation, teaching, small-scale farming, running, and bicycles. And pizza.
Joe Brew is a data scientist, epidemiologist, and economist. He has worked with the Florida Department of Health (USA), the Chicago Department of Public Health (USA), the Barcelona Public Health Agency (Spain), the Tigray Regional Health Bureau (Ethiopia) and the Manhiça Health Research Center (Mozambique). He is a co-founder of Hyfe and DataBrew. His research focuses on the economics of malaria and its elimination. He earned his BA at Sewanee: The University of the South (2008), an MA at the Institut Catholique de Paris (2009) and an MPH at the Kobenhavns Universitet (2013). He is passionate about international development, infectious disease surveillance, teaching, running, and pizza.
Ben Brew is a data scientist, economist, and health sciences researcher. In addition to co-founding DataBrew, he has spent most of the last few years working with SickKids Hospital in Ontario on machine learning applications for cancer genomics. He earned his BA at Sewanee: The University of the South (2012), and a Master’s in Mathetical Models for Economics from the Paris School of Economics (Paris I) (2015). He is passionate about econometrics, applied machine learning, and cycling.
Matthew Rudd is a mathematician fascinated by statistical modeling, data analysis, and the tensions between theory, practice, and interpretability in data science. He has been using R
and RStudio
for years in his teaching and research as a professor at Sewanee, has experience with software development and web applications, and enjoys learning new tools and technologies. He welcomes opportunities to apply his skills and experience to practical problems.
Instructor tip:
Here are a few general strategies that should work with nearly every module in this curriculum:
(1) Use the module as your lecture notes. Read them on your own screen, or from the printed version of this book, as you manually type each command into a blank Rstudio
window on the screen you are sharing with the students. The R script you make as you teach becomes a record of your lecture notes.
(2) Ask that the students follow along with you, writing the code into Rstudio
themselves on their own computers.
(3) Once you introduce the concept of R
scripts in a few modules, begin each class by opening up a new script and saving it with a simple name that reflects the module. Alternatively, ask students to do this before class begins, so that you can rock and roll immediately.
(4) We recommend setting up a Slack channel for your class. Have Slack open during instruction, so that students can copy and paste errors into a public channel. This will make it easy to address questions, and it will open up the possibility that students will help those who are stuck on their own.
(5) Plan for (i) very little time spent on teaching a module, and (ii) a large amount of time reserved for working through the exercises at the end of each module. When you transition from teaching to introducing the exercises, share the link to the module so that students can reference the module as they work.
(6) If there is time, once students have the chance to tackle the exercises on their own, work through the puzzles together as a group.