Getting started with Git - Dave Tang's blog

Git is a distributed version control and source code management (SCM) system with an emphasis on speed. What's version control? Version control is a system that records changes to a file or a set of files over time so that you can recall specific versions later. Here's an example: check out this tweet and the corresponding replies. It was a tweet regarding this scientist. If you read the latest version of the article there's nothing flamboyant (as stated in the tweet) about it because it has been edited since that tweet. However, if you wanted to see "the most glowing Wikipedia article written about any scientist", you can click view history on the article page and look at previous versions of the article. For example, through version control you can access this older version of the article; one that's definitely flashier than the current one.

Anyway back to the topic; Git is a Distributed Version Control Systems (DVCS), which means that clients don’t just check out the latest snapshot of the files: they fully mirror the repository. So what makes Git different from other version control systems? Quoting this guide:

The major difference between Git and any other Version Control Systems (VCS) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems think of the information they keep as a set of files and the changes made to each file over time. Git doesn't think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini file-system. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn't store the file again - just a link to the previous identical file it has already stored.

In Git, there are three main stages that a file can reside in: committed, modified, and staged. Committed means that the data is safely stored in your local database. Modified means that you have changed the file but have not committed it to your database yet. Staged means that you have marked a modified file in its current version to go into your next commit snapshot.

The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.

The working directory is a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify. The staging area is a simple file, generally contained in your Git directory, that stores information about what will go into your next commit. It’s sometimes referred to as the index, but it’s becoming standard to refer to it as the staging area.

The basic Git workflow goes something like this:

You modify files in your working directory.
You stage the files, adding snapshots of them to your staging area.
You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git directory.

If a particular version of a file is in the git directory, it’s considered committed. If it’s modified but has been added to the staging area, it is staged. And if it was changed since it was checked out but has not been staged, it is modified.

Let's try it out

So a while ago I found this course on Code School for testing out Git. I'm going to follow the Try Git course but on my own computer. At the moment I'm working on my Windows laptop, but it should be all the same. Also please ignore the syntax colouring for the git code; I chose to colour the syntax by shell code (since I didn't have a Git code option), so the colouring doesn't make any sense.

I'll start by creating a new directory:

mkdir getting_started_with_git
cd getting_started_with_git

Now to initialize a Git repository:

git init
Initialized empty Git repository in C:/Users/Dave/Documents/GitHub/getting_started_with_git/.git/

Once you've done this step, a directory named ".git" is created, which is actually the (empty) repository residing inside the octobox directory. To find out the status of the project type:

git status
# On branch master
#
# Initial commit
#
nothing to commit (create/copy files and use "git add" to track)

Let's create a new file called hello.txt in the getting_started_with_git directory:

echo Hello world > hello.txt

We can run git status to see the status change:

git status
# On branch master
#
# Initial commit
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       hello.txt
nothing added to commit but untracked files present (use "git add" to track)

We find out that the hello.txt is untracked, which means that Git is not tracking changes made to this file. Now comes step two of the Git workflow, which is to stage the file, by adding a snapshot of hello.txt to the staging area. We can do this by running:

git add hello.txt
warning: LF will be replaced by CRLF in hello.txt.
The file will have its original line endings in your working directory.
#check status again
git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#
#       new file:   hello.txt
#

The status tells us that there are changes to be committed, i.e. adding hello.txt to the repository. Now comes step three of the Git workflow where we commit and store a snapshot of the hello.txt file to the Git directory. We can also include a message that documents the change:

git commit -m "Added hello.txt"
[master (root-commit) 89fbe8a] Added hello.txt
warning: LF will be replaced by CRLF in hello.txt.
The file will have its original line endings in your working directory.
 1 file changed, 1 insertion(+)
 create mode 100644 hello.txt

Let's make some more files:

echo Goodbye > goodbye.txt
echo Goodnight > goodnight.txt

Stage all the files using wildcards:

#add more text files to the staging area
git add '*.txt'
warning: LF will be replaced by CRLF in goodbye.txt.
The file will have its original line endings in your working directory.
warning: LF will be replaced by CRLF in goodnight.txt.
The file will have its original line endings in your working directory.
#commit all the text files
git commit -m 'Added goodbye and goodnight'
[master b695217] Added goodbye and goodnight
warning: LF will be replaced by CRLF in goodbye.txt.
The file will have its original line endings in your working directory.
warning: LF will be replaced by CRLF in goodnight.txt.
The file will have its original line endings in your working directory.
 2 files changed, 2 insertions(+)
 create mode 100644 goodbye.txt
 create mode 100644 goodnight.txt

To see what we're done, we can check the log. Do this via:

#We can see the two commit steps we ran
git log
commit b695217a6510429fb9076093d8e30915001e1838
Author: davetang <davetingpongtang at gmail.com>
Date:   Wed Sep 4 23:46:14 2013 +0900

    Added goodbye and goodnight

commit 89fbe8a69b63e8376cc4a39fb5c61d0380d70c9a
Author: davetang <davetingpongtang@gmail.com>
Date:   Wed Sep 4 23:43:36 2013 +0900

    Added hello.txt

Remote Repositories

Now we can add our local repository to the GitHub server by running:

git remote add github https://github.com/davetang/getting_started_with_git/

The next step is to push our local changes to our github repository on GitHub:

#The name of our remote is github
#the default local branch name is master
#-u tells Git to remember the parameters
#so that next time we can simply run git push
git push -u github master
Counting objects: 7, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (7/7), 538 bytes | 0 bytes/s, done.
Total 7 (delta 0), reused 0 (delta 0)
To https://github.com/davetang/getting_started_with_git/
 * [new branch]      master -> master
Branch master set up to track remote branch master from github.

The opposite of push is pulling, which is retrieving the remote changes to our local repository:

git pull github master
From https://github.com/davetang/getting_started_with_git
 * branch            master     -> FETCH_HEAD
Already up-to-date.

#I'll make a change to goodbye.txt on the remote side and pull again
git pull github master
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From https://github.com/davetang/getting_started_with_git
 * branch            master     -> FETCH_HEAD
Updating b695217..473679b
Fast-forward
 goodbye.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Now if I check goodbye.txt in my local directory, I can see the updated file:

cat goodbye.txt
Goodbye matey
#make another file for the next demonstration
echo Sayonara > sayonara.txt

Now let's add another file:

#add another file to the stage
git add sayonara.txt
warning: LF will be replaced by CRLF in sayonara.txt.
The file will have its original line endings in your working directory.

We can use diff to look at changes within files that have been staged:

git diff --staged
diff --git a/sayonara.txt b/sayonara.txt
new file mode 100644
index 0000000..2aa8570
--- /dev/null
+++ b/sayonara.txt
@@ -0,0 +1 @@
+Sayonara
warning: LF will be replaced by CRLF in sayonara.txt.
The file will have its original line endings in your working directory.

To unstage a file use git reset:

git reset sayonara.txt

Branching out

Whenever I'm testing something new, I tend to make a new folder to do the testing. In Git, developers often create a new copy of their code that they can make separate commits to, i.e. branching out. Then when all the testing is done and all is well, they can merge this branch back into the main master branch. Say we wanted to clean up all the text files, let's create a new branch called clean_up:

git branch clean_up
#check the branches
  clean_up
* master

Notice the asterisk? That means we're in the master branch; to switch branches use the git checkout command:

git checkout clean_up
Switched to branch 'clean_up'

Now to remove all the files use git rm, which will not only remove the actual files from disk, but will also stage the removal of the files for us:

git rm *.txt
rm 'goodbye.txt'
rm 'goodnight.txt'
rm 'hello.txt'
#now to commit
git commit -m "Removed all text files"
[clean_up 01fb521] Removed all text files
 3 files changed, 3 deletions(-)
 delete mode 100644 goodbye.txt
 delete mode 100644 goodnight.txt
 delete mode 100644 hello.txt

Now to switch back to the master branch, merge it with the clean_up branch, and remove the clean_up branch:

git checkout master
Switched to branch 'master'
Your branch is ahead of 'github/master' by 1 commit.
  (use "git push" to publish your local commits)
#merge with the clean_up branch
git merge clean_up
Updating 473679b..01fb521
Fast-forward
 goodbye.txt   | 1 -
 goodnight.txt | 1 -
 hello.txt     | 1 -
 3 files changed, 3 deletions(-)
 delete mode 100644 goodbye.txt
 delete mode 100644 goodnight.txt
 delete mode 100644 hello.txt
git branch -d clean_up
Deleted branch clean_up (was 01fb521).

And finally let's push the changes to the remote repository on GitHub:

git push
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1/1), done.
Writing objects: 100% (2/2), 205 bytes | 0 bytes/s, done.
Total 2 (delta 0), reused 0 (delta 0)
To https://github.com/davetang/getting_started_with_git/
   473679b..01fb521  master -> master
#now to exit
exit

Now if I check my local directory:

#all that's left is sayonara
ls
sayonara.txt

Lastly to work on the project on another computer, clone it:

git clone https://github.com/davetang/getting_started_with_git

Git on Linux

There's a bit more setting up to get Git working on Linux (CentOS for me):

Download the source and compile
Generate private and public keys
The remote address should be git@github.com:davetang/getting_started_with_git/ instead of https://github.com/davetang/getting_started_with_git/ (as I showed above)
Push as per usual, e.g. git push -u github master

Conclusions

I've finally taken a little bit of time on getting started with Git; it's long overdue. I'm trying to become more and more systematic in my work, not only to promote reproducibility, but to become more efficient. In the past I've created README files in directories that document what I've done and relied on comments within my code. However, I'm trying to move on to the next level, i.e. by working like the pros and using version control software. There are other version control software out there but I chose Git because one of my colleague, who I think is such a great bioinformatician (and biologist), uses it. So I'm always trying to follow in his footsteps.

If you didn't find this useful, have a look at this Git guide from Lifehacker.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Let's try it out

Remote Repositories

Branching out

Git on Linux

Conclusions

Leave a Reply Cancel reply