- What Version Control Systems do and why they are awesome
- How to install
- How to work on your Terminal (command line / bash / shell / console).
- How to install
- How to use pull and push code to and from
Perhaps the below is familiar to you…
… or this …
Version control systems (VCS) are software tools meant to help programmers collaborate, maintain source code, document changes, and generally keep track of their files. Instead of reading, sit back and watch this video.
Having a working knowledge of a version control system will allow you to better organize and track your own files, as well as collaborate on teams. Though there are lots of different systems out there, the most popular version control system is
Git is a software for version control: i.,e., tracking changes to sets of files. It is a very commonly used tool in programming, computer science, data science, and software development in general. For anyone working in data science, knowing
git is a must.
But what is it, exactly?
Git is a system for tracking, organizing, sharing, and reviewing changes to software. It’s very flexible, and very powerful. And learning it can sometimes feel a bit overwhelming, because it has so many features and capabilities. But the Pareto principle (ie, the 80/20 rule) applies here: most of what you need to know to competently handle
git, you can do in very little time.
git? Why not just save files with meaningful names, make changes to them, overwrite the old changes, etc.? Why not just treat code the same way we treat a MS Word document, or write code collaboratively using interactive, auto-saving tools like Google Docs?
There are a number reasons:
First, writing code is not like writing a paper. If you make a mistake in the introduction of a term paper, it doesn’t “break” your conclusion. But with code, minor changes to one line of code can have a very large impact on how other parts of that code work. Therefore, tracking minor changes is essential to recovering from errors and managing complex, interdependent software components.
Second, collaborating on code is more complex than collaborating on a term paper. To combine (merge) one person’s work with another often requires very careful review.
Gitoptimizes for this.
Third, code is rarely “done”. It’s usually a work in progress.
Gittakes this into account, and is set up for very structured checkpoints (commits), change suggestions (pull requests), etc.
Finally, git is the “lingua franca” of version control. Employers often request to see a prospective employee’s
Githubprofile, and expect that programmers and developers of all types (including data scientists) be proficient in
It’s time to
git started. Rather than diving into too much theory, let’s skip right to the practice. We’ll do first, then we will try to understand later. This comic knows what we mean:
git, we first need to get comfortable working with in our computer’s Terminal. The Terminal goes by many names, depending on your operating system: shell, bash,
Git bash, console, command line, command prompt, etc. For simplicity, we are going to use just one name: Terminal.
The Terminal is a way for you to interact with your computer using lines of commands (i.e., command line) instead of pointing or clicking. Just as in the
R Console, in your Temrinal you can type code and the computer will do things.
This will feel a bit unfamiliar at first, but fear not – there are just a few commands you will need to learn.
To open Terminal, go to the search box in the Start Menu. Type
cmd and hit enter. A Terminal “shell” window will appear.
To open Terminal, open the Mac Search (
Spacebar) and search “Terminal”. Press
Enter. A Terminal “shell” window will appear:
Let’s start with
pwd. This stands for “path to working directory”. Type it, press enter, and see what happens.
If everything went to plan, you’ll see a path (ie, a location within your computer’s file structure). “Working directory” means the folder you are “in” right now – just like in
When you typed
pwd you were asking your computer which directory (folder) you’re in. It responded. Great! Now let’s see what else is in the folder where you are.
ls. This stands for “list” as in "list all the files in this directory.
ls is commonly used to quickly see contents of a folder.
You now know how to (a) see where you are and (b) see what else is there.
But what about navigation? Let’s say you want to “move around” (ie, go to other folders). You do this using the
cd command, which stands for “change directory”.
Navigating into folders:
cd, you should type a space and then the name of the directory you want to navigate to. This directory can be either (a) relative or (b) full. “Relative” means relative to where you are. So, for example, if you are currently in the following folder:
And within that folder you have the following sub-folders:
folder_a folder_b folder_c
You could navigate to one of them by typing the full path:
Or by typing the relative path:
Both of the last two commands have the same result: changing your working directory to
/home/abc/folder_1/folder_a. You can confirm that using
Navigating out of folders:
What about if you want to navigate “up” to the
/home/abc folder. Again, you can do this using relative or full paths. To navigate there with the full path, simply type:
To go “up” one level using a relative path, use
So far so good. Now you can (a) see where you are with
pwd, (b) list folder contents with
ls and (c) navigate with
cd. Those 3 command will cover 80% of what you’ll need to do to use git from the command line.
Go here and follow the instructions for your operating system. We have also offered details for Windows and Mac below.
On windows, once you’ve downloaded, you’ll want to select the below parameters:
The program will prompt you to pick a text editor. Select notepad.
For all other options, leave settings as default and click “next”. The program will now install.
After installation, there is an option to “Launch Git Bash” (Terminal). Click it.
This will open a Terminal window:
Git on your Mac, you will be typing commands into your computer’s Terminal.
Open your Terminal.
Next we need to download and install a software management tool called Homebrew. Copy and paste the following command into your Terminal, and press
You will likely be asked for your computer password. Press
Enter when instructed to do so. Downloading Homebrew might take a few minutes.
- Now that Homebrew is installed, copy and paste the following command into your Terminal, and press
After a minute or two,
git should be downloaded, installed and ready to go.
Once you’ve installed
git, you’ll likely want to configure some options and preferences. Go here and walk through the steps.
Now you’ve got
git. Great! Git is often used in conjunction with a third party, web-based platform. The most popular is
Go to www.github.com and create a user account. Make sure you use the same email address you used when configuring
Once you’ve created an account and logged in to
GitHub, let’s create a repository.
What is a repository? Basically, it’s a coding project in the form of a folder.
Having logged into git, click the “plus” icon in the upper right and then click “New repository” (or go directly here).
You can fill in the “Repository name” field with whatever you’d like. For the following examples, we’ll use the word “testrepo”.
Fill in the “Description” field with the word “My first git repository”.
Set the repo as “Public” (unless you plan on putting any secrets here!), and then click the “Add a README file” checkbox. Finally, click “Create repository”.
Cool! You’ve now created your first git repository. It’s public url is
https://github.com/<YOUR USERNAME>/<repo-name>. Others can see your code there, and you can too.
Your new repo exists on the internet, but does not yet exist on your local machine. In order to get “testrepo” on your computer, you’ll need to do something that you’ll only do once: “clone” the repo. “Cloning” in git-speak means creating a copy of the repository locally.
To clone, you’ll first open your terminal window and
cd into a directory where you’d like to clone your repo. For example, if you want to put your
testrepo directory into
~/Documents, you’ll type the below into terminal:
You can confirm that you are in
~/Documents by typing:
pwd. There? Good.
Now, we’ll write the code to “clone”
git clone https://github.com/<USERNAME>/<repo-name>
Now, you’ve got a folder on your machine named
testrepo. You can confirm this by writing
ls in the terminal.
testrepo there? Great!
In your local
testrepo folder, you have a “cloned” copy of the repository at https://github.com/
As of now, it’s a pretty uninteresting folder. The only thing in it is a file named
README.md. A “README” file is generally an explanation of a repository’s content, purpose, etc. Like all files, a README can be tracked in git.
Let’s open the README file and make a change to it. We’ll add the below line of “code”:
This is my first git repository.
The save and close the README file.
Now, let’s ask
git if it noticed if we had made any changes. Type the below into terminal:
If everything has gone well until now,
git will reply by telling us that we’ve made a change to the file. We can ask
git what change we made by typing:
git diff README.md
diff stands for “difference”, as in "what is the difference between the code I had and the code I have.
Git will spit back some color-coded lines showing the change you’ve made.
Satisfied with your change? Great, now it’s time to confirm it by doing the following:
git add README.md
git that we want it to notice and track the change we made to
git commit -m 'my first change'
git that we are “committing” our change, it marking a checkpoint (to which we can revert later). The
-m flag is followed by a message in quotations which will help us to navigate this checkpoint in the future.
Almost there. Now that we’ve added and committed, we need to “push” our change to the remote repository (
GitHub), by running:
You did it! Go to https://github.com/
README.md file. You’ll notice that your most recent changes are there. Now, if someone else wants to get your code, they can “clone” your repository, and they’ll have the code you’ve “pushed” there.
We get it:
git can be daunting and a bit confusing. This diagram tries to summarize the way
There are three workspaces to keep in mind: the remote version of your repo, the local version of it, and your work area. Your interact with your remote repo on the
GitHubwebsite, your local repo through
Terminal, and your work area through
You sync your remote and local repos using the
git pushcommands in
Terminal. When you
pull, you are updating your local repo using the remote repo. When you
push, you do the opposite: update the remote repo using your local repo.
You sync your local repo and your work area using the
git commitcommands. You use these commands when you are ready to move files off of your work bench (
RStudio) and back onto the shelf (local repo): you’ve reached a stopping point and want to stop working on your code for now.
Setting up a new repo:
1. Start a new remote repo (
2. Navigate to the local folder into which you want to clone your repo. (
Terminal, using a combination of
3. Clone the remote repo to your local machine (
git clone https://github.com/<your username>/<repo name>)
Working in a repo:
cd into your repo (
pull on your remote repo to make sure your local repo contains the latest version of your project (
6. Make changes to your code (revisions, new files, etc.) (
7. Stage those revised/new files for a commit (
git add <filename>
8. Commit your changes (
git commit -m"add specific message here"
9. Push your changes to the remote repo (
10. Check out your remote repo to verify that changes were pushed. (
Create another repo:
Let’s face it:
testrepo is a pretty lame name for a repository. How about we make a repo that’s actually real and useful? We’ll make one for storing all the code we’re writing in this course.
1) Go to https://github.com/
2) Click the “plus” icon in the upper right and then click “New repository” (or go directly here).
3) Now, for “Repository name”, write “datalab”.
4) Fill in the “Description” field with the words “Code I wrote during my intro to data science course”.
5) Set the repo as “Public” (unless you plan on putting any secrets here!), and then click the “Add a README file” checkbox. Finally, click “Create repository”.
6) Clone the repo to your computer (Documents folder).
7) Create a new R script in your
datalab repository called
8) Create a histogram of life expectancy in 1982.
9) Create a line plot for population in Asia, colored by country. Make the lines a bit thicker and more transparent.
10) Add new x and y axis labels, as well as a chart title.
11) Create a bar chart of all European countries gdp per capita in 2002.
12) Make the bars transparent and filled with the color blue.
13) Create a new data set called
the_nineties that only contains years from the 1990s.
14) Save this dataset to your repository (use
15) Add, commit, and push your files to
cd into your local
datalab repo, and then type
git status, you might note that it’s a bit “busy”. That is, there are a lot of documents there! You’re going to want to (i) add, (ii) commit, and (iii) push these documents, but perhaps there are some kinds of documents you don’t want to push.
For example, maybe you want to push R code files (
.R), but not data files (
.csv). In this case, you can explicitly tell
git that you don’t want it to pay any attention to
.csv files by creating a
.gitignore file. A
.gitignore file is simply a text file in a git repository that indicates to git that the contents of that file should be ignored.
Let’s do it.
16) First, create an empty
.gitignore file. In
17) Then, open the
.gitignore file in RStudio.
18) Finally, add the following line to it:
The star is a “wildcard”, meaning that it stands in place of anything (such as
With this in your repo,
git now knows to ignore anything that ends with the extension
19) Push everything to your repo.
Now you can share your code with others, and your future self.