Git is a distributed version control and source code management (SCM) system with an emphasis on speed. What's version control? Version control is a system that records changes to a file or a set of files over time so that you can recall specific versions later. Here's an example: check out this tweet and the corresponding replies. It was a tweet regarding this scientist. If you read the latest version of the article there's nothing flamboyant (as stated in the tweet) about it because it has been edited since that tweet. However, if you wanted to see "the most glowing Wikipedia article written about any scientist", you can click view history on the article page and look at previous versions of the article. For example, through version control you can access this older version of the article; one that's definitely flashier than the current one.
Anyway back to the topic; Git is a Distributed Version Control Systems (DVCS), which means that clients don’t just check out the latest snapshot of the files: they fully mirror the repository. So what makes Git different from other version control systems? Quoting this guide:
The major difference between Git and any other Version Control Systems (VCS) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems think of the information they keep as a set of files and the changes made to each file over time. Git doesn't think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini file-system. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn't store the file again - just a link to the previous identical file it has already stored.
In Git, there are three main stages that a file can reside in: committed, modified, and staged. Committed means that the data is safely stored in your local database. Modified means that you have changed the file but have not committed it to your database yet. Staged means that you have marked a modified file in its current version to go into your next commit snapshot.
The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.
The working directory is a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify. The staging area is a simple file, generally contained in your Git directory, that stores information about what will go into your next commit. It’s sometimes referred to as the index, but it’s becoming standard to refer to it as the staging area.
The basic Git workflow goes something like this:
- You modify files in your working directory.
- You stage the files, adding snapshots of them to your staging area.
- You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git directory.
If a particular version of a file is in the git directory, it’s considered committed. If it’s modified but has been added to the staging area, it is staged. And if it was changed since it was checked out but has not been staged, it is modified.
Let's try it out
So a while ago I found this course on Code School for testing out Git. I'm going to follow the Try Git course but on my own computer. At the moment I'm working on my Windows laptop, but it should be all the same. Also please ignore the syntax colouring for the git code; I chose to colour the syntax by shell code (since I didn't have a Git code option), so the colouring doesn't make any sense.
I'll start by creating a new directory:
mkdir getting_started_with_git cd getting_started_with_git
Now to initialize a Git repository:
git init Initialized empty Git repository in C:/Users/Dave/Documents/GitHub/getting_started_with_git/.git/
Once you've done this step, a directory named ".git" is created, which is actually the (empty) repository residing inside the octobox directory. To find out the status of the project type:
git status # On branch master # # Initial commit # nothing to commit (create/copy files and use "git add" to track)
Let's create a new file called hello.txt in the getting_started_with_git directory:
echo Hello world > hello.txt
We can run git status to see the status change:
git status # On branch master # # Initial commit # # Untracked files: # (use "git add <file>..." to include in what will be committed) # # hello.txt nothing added to commit but untracked files present (use "git add" to track)
We find out that the hello.txt is untracked, which means that Git is not tracking changes made to this file. Now comes step two of the Git workflow, which is to stage the file, by adding a snapshot of hello.txt to the staging area. We can do this by running:
git add hello.txt warning: LF will be replaced by CRLF in hello.txt. The file will have its original line endings in your working directory. #check status again git status # On branch master # # Initial commit # # Changes to be committed: # (use "git rm --cached <file>..." to unstage) # # new file: hello.txt #
The status tells us that there are changes to be committed, i.e. adding hello.txt to the repository. Now comes step three of the Git workflow where we commit and store a snapshot of the hello.txt file to the Git directory. We can also include a message that documents the change:
git commit -m "Added hello.txt" [master (root-commit) 89fbe8a] Added hello.txt warning: LF will be replaced by CRLF in hello.txt. The file will have its original line endings in your working directory. 1 file changed, 1 insertion(+) create mode 100644 hello.txt
Let's make some more files:
echo Goodbye > goodbye.txt echo Goodnight > goodnight.txt
Stage all the files using wildcards:
#add more text files to the staging area git add '*.txt' warning: LF will be replaced by CRLF in goodbye.txt. The file will have its original line endings in your working directory. warning: LF will be replaced by CRLF in goodnight.txt. The file will have its original line endings in your working directory. #commit all the text files git commit -m 'Added goodbye and goodnight' [master b695217] Added goodbye and goodnight warning: LF will be replaced by CRLF in goodbye.txt. The file will have its original line endings in your working directory. warning: LF will be replaced by CRLF in goodnight.txt. The file will have its original line endings in your working directory. 2 files changed, 2 insertions(+) create mode 100644 goodbye.txt create mode 100644 goodnight.txt
To see what we're done, we can check the log. Do this via:
#We can see the two commit steps we ran git log commit b695217a6510429fb9076093d8e30915001e1838 Author: davetang <davetingpongtang at gmail.com> Date: Wed Sep 4 23:46:14 2013 +0900 Added goodbye and goodnight commit 89fbe8a69b63e8376cc4a39fb5c61d0380d70c9a Author: davetang <davetingpongtang@gmail.com> Date: Wed Sep 4 23:43:36 2013 +0900 Added hello.txt
Remote Repositories
Now we can add our local repository to the GitHub server by running:
git remote add github https://github.com/davetang/getting_started_with_git/
The next step is to push our local changes to our github repository on GitHub:
#The name of our remote is github #the default local branch name is master #-u tells Git to remember the parameters #so that next time we can simply run git push git push -u github master Counting objects: 7, done. Delta compression using up to 8 threads. Compressing objects: 100% (3/3), done. Writing objects: 100% (7/7), 538 bytes | 0 bytes/s, done. Total 7 (delta 0), reused 0 (delta 0) To https://github.com/davetang/getting_started_with_git/ * [new branch] master -> master Branch master set up to track remote branch master from github.
The opposite of push is pulling, which is retrieving the remote changes to our local repository:
git pull github master From https://github.com/davetang/getting_started_with_git * branch master -> FETCH_HEAD Already up-to-date. #I'll make a change to goodbye.txt on the remote side and pull again git pull github master remote: Counting objects: 5, done. remote: Compressing objects: 100% (2/2), done. remote: Total 3 (delta 0), reused 0 (delta 0) Unpacking objects: 100% (3/3), done. From https://github.com/davetang/getting_started_with_git * branch master -> FETCH_HEAD Updating b695217..473679b Fast-forward goodbye.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Now if I check goodbye.txt in my local directory, I can see the updated file:
cat goodbye.txt Goodbye matey #make another file for the next demonstration echo Sayonara > sayonara.txt
Now let's add another file:
#add another file to the stage git add sayonara.txt warning: LF will be replaced by CRLF in sayonara.txt. The file will have its original line endings in your working directory.
We can use diff to look at changes within files that have been staged:
git diff --staged diff --git a/sayonara.txt b/sayonara.txt new file mode 100644 index 0000000..2aa8570 --- /dev/null +++ b/sayonara.txt @@ -0,0 +1 @@ +Sayonara warning: LF will be replaced by CRLF in sayonara.txt. The file will have its original line endings in your working directory.
To unstage a file use git reset:
git reset sayonara.txt
Branching out
Whenever I'm testing something new, I tend to make a new folder to do the testing. In Git, developers often create a new copy of their code that they can make separate commits to, i.e. branching out. Then when all the testing is done and all is well, they can merge this branch back into the main master branch. Say we wanted to clean up all the text files, let's create a new branch called clean_up:
git branch clean_up #check the branches clean_up * master
Notice the asterisk? That means we're in the master branch; to switch branches use the git checkout
git checkout clean_up Switched to branch 'clean_up'
Now to remove all the files use git rm, which will not only remove the actual files from disk, but will also stage the removal of the files for us:
git rm *.txt rm 'goodbye.txt' rm 'goodnight.txt' rm 'hello.txt' #now to commit git commit -m "Removed all text files" [clean_up 01fb521] Removed all text files 3 files changed, 3 deletions(-) delete mode 100644 goodbye.txt delete mode 100644 goodnight.txt delete mode 100644 hello.txt
Now to switch back to the master branch, merge it with the clean_up branch, and remove the clean_up branch:
git checkout master Switched to branch 'master' Your branch is ahead of 'github/master' by 1 commit. (use "git push" to publish your local commits) #merge with the clean_up branch git merge clean_up Updating 473679b..01fb521 Fast-forward goodbye.txt | 1 - goodnight.txt | 1 - hello.txt | 1 - 3 files changed, 3 deletions(-) delete mode 100644 goodbye.txt delete mode 100644 goodnight.txt delete mode 100644 hello.txt git branch -d clean_up Deleted branch clean_up (was 01fb521).
And finally let's push the changes to the remote repository on GitHub:
git push Counting objects: 3, done. Delta compression using up to 8 threads. Compressing objects: 100% (1/1), done. Writing objects: 100% (2/2), 205 bytes | 0 bytes/s, done. Total 2 (delta 0), reused 0 (delta 0) To https://github.com/davetang/getting_started_with_git/ 473679b..01fb521 master -> master #now to exit exit
Now if I check my local directory:
#all that's left is sayonara ls sayonara.txt
Lastly to work on the project on another computer, clone it:
git clone https://github.com/davetang/getting_started_with_git
Git on Linux
There's a bit more setting up to get Git working on Linux (CentOS for me):
- Download the source and compile
- Generate private and public keys
- The remote address should be git@github.com:davetang/getting_started_with_git/ instead of https://github.com/davetang/getting_started_with_git/ (as I showed above)
- Push as per usual, e.g. git push -u github master
Conclusions
I've finally taken a little bit of time on getting started with Git; it's long overdue. I'm trying to become more and more systematic in my work, not only to promote reproducibility, but to become more efficient. In the past I've created README files in directories that document what I've done and relied on comments within my code. However, I'm trying to move on to the next level, i.e. by working like the pros and using version control software. There are other version control software out there but I chose Git because one of my colleague, who I think is such a great bioinformatician (and biologist), uses it. So I'm always trying to follow in his footsteps.
If you didn't find this useful, have a look at this Git guide from Lifehacker.
This work is licensed under a Creative Commons
Attribution 4.0 International License.