If you are planning on running some of the GATK workflows locally you will require various reference files, known as the Resource Bundle. Initially I had downloaded the files using FTP but I later realised various files were missing from the FTP server, such as BWA index and scatter files. Only later did I realise that the required files are available in the Google Cloud bucket. This post is on using the Google Cloud SDK, which contains tools and libraries for interacting with Google Cloud products and services, to download the GATK resource bundle files.

We'll use Conda to install Google Cloud SDK into a new environment called google_cloud.

# install

# activate the environment

# outputs subcommands
gsutil


We will use the cp subcommand from gsutil to transfer the files from the bucket to our local computer. You can find more information about cp using the help command.

gsutil help cp


In addition, we will be using the -m option:

Causes supported operations (acl ch, acl set, cp, mv, rm, rsync, and setmeta) to run in parallel. This can significantly improve performance if you are performing operations on a large number of files over a reasonably fast network connection.

gsutil help options


Before we begin transferring files, let's use the ls subcommand to list the files inside the bucket.

gsutil ls gs://genomics-public-data/references/hg38/v0/

# output not shown


Let's transfer the README file first into the current directory.

# transfer README into the current directory
- [1 files][  3.9 KiB/  3.9 KiB]
Operation completed over 1 objects/3.9 KiB.

# check out the first line


Before we start transferring, let's find out how much data will be sent using the du subcommand with the -h (human readable sizes) option.

gsutil du -h gs://genomics-public-data/references/hg38/v0 > bundle_files.txt

tail bundle_files.txt
571.95 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0047_of_50/scattered.interval_list
571.95 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0047_of_50/
570.8 KiB    gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0048_of_50/scattered.interval_list
570.8 KiB    gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0048_of_50/
570.85 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0049_of_50/scattered.interval_list
570.85 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0049_of_50/
572.71 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0050_of_50/scattered.interval_list
572.71 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0050_of_50/
27.88 MiB    gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/
294.08 GiB   gs://genomics-public-data/references/hg38/v0/


The command below will transfer all files nested under gs://genomics-public-data/references/hg38/v0 into hg38 following the same directory structure.

mkdir -p hg38
gsutil -m cp -r gs://genomics-public-data/references/hg38/v0 hg38/


That would have been the end of the post but I found out that there were several files that could not be downloaded, such as:

1. autosomes-1kg-minusNA12878-ALL.vcf
2. Homo_sapiens_assembly38.dbsnp138.vcf

This warning from gsutil may be the reason why I couldn't download these files:

CommandException:
but your crcmod installation isn't using the module's C extension, so the
installing the extension, please see "gsutil help crcmod".

checks, see the "check_hashes" option in your boto config file.

NOTE: It is strongly recommended that you not disable integrity checks. Doing so

gsutil help crcmod provided information on how to overcome this problem, which I didn't attempt. Since I need Homo_sapiens_assembly38.dbsnp138.vcf for several of the pipelines, I just manually downloaded it from my web browser.

The gsutil tool can accept from STDIN a list of files to be transferred. For example, the CrossSpeciesContamination folder is 222.3 GiB in size and as far as I'm aware isn't used in any of the pipelines I want to run.

cat bundle_files.txt | grep Cross | tail -1
222.3 GiB    gs://genomics-public-data/references/hg38/v0/CrossSpeciesContamination/


We can use grep -v to exclude files in the CrossSpeciesContamination folder and create a list of files to download.

cat bundle_files.txt |
grep -v CrossSpeciesContamination |
grep -v "/$" | # excludes directories awk '{print$3}' > files_to_download.txt


Now we just need to pipe the list of files to gsutil.

mkdir -p hg38


### Large files

If you try to download large files, you may get the following note:

gsutil cp gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .
Copying gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf...
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

CommandException:
but your crcmod installation isn't using the module's C extension, so the
installing the extension, please see "gsutil help crcmod".

checks, see the "check_hashes" option in your boto config file.

NOTE: It is strongly recommended that you not disable integrity checks. Doing so


We will use my Docker image to setup crcmod. Modify the mount path to accordingly, so you have access to the downloaded file outside of Docker.

# run Docker interactively
docker run --rm -it -v /data/:/data/ davetang/base /bin/bash

# activate the environment

# install required libraries
apt update
apt install -y gcc python3-dev python3-setuptools

# install crcmod
pip3 uninstall crcmod
pip3 install --no-cache-dir -U crcmod

gsutil cp gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .