Using Google Cloud SDK to download GATK resource bundle files

If you are planning on running some of the GATK workflows locally you will require various reference files, known as the Resource Bundle. Initially I had downloaded the files using FTP but I later realised various files were missing from the FTP server, such as BWA index and scatter files. Only later did I realise that the required files are available in the Google Cloud bucket. This post is on using the Google Cloud SDK, which contains tools and libraries for interacting with Google Cloud products and services, to download the GATK resource bundle files.

We'll use Conda to install Google Cloud SDK into a new environment called google_cloud.

# install
conda create -c conda-forge -n google_cloud google-cloud-sdk

# activate the environment
conda activate google_cloud

# outputs subcommands
gsutil

We will use the cp subcommand from gsutil to transfer the files from the bucket to our local computer. You can find more information about cp using the help command.

gsutil help cp

In addition, we will be using the -m option:

Causes supported operations (acl ch, acl set, cp, mv, rm, rsync, and setmeta) to run in parallel. This can significantly improve performance if you are performing operations on a large number of files over a reasonably fast network connection.

For more information on various options use the help command.

gsutil help options

Before we begin transferring files, let's use the ls subcommand to list the files inside the bucket.

gsutil ls gs://genomics-public-data/references/hg38/v0/

# output not shown

Let's transfer the README file first into the current directory.

# transfer README into the current directory
gsutil cp gs://genomics-public-data/references/hg38/v0/README .
Copying gs://genomics-public-data/references/hg38/v0/README...
- [1 files][  3.9 KiB/  3.9 KiB]                                                
Operation completed over 1 objects/3.9 KiB.

# check out the first line
head -1 README 
Details about this reference are available in PO-1914

Before we start transferring, let's find out how much data will be sent using the du subcommand with the -h (human readable sizes) option.

gsutil du -h gs://genomics-public-data/references/hg38/v0 > bundle_files.txt

tail bundle_files.txt 
571.95 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0047_of_50/scattered.interval_list
571.95 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0047_of_50/
570.8 KiB    gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0048_of_50/scattered.interval_list
570.8 KiB    gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0048_of_50/
570.85 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0049_of_50/scattered.interval_list
570.85 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0049_of_50/
572.71 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0050_of_50/scattered.interval_list
572.71 KiB   gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0050_of_50/
27.88 MiB    gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/
294.08 GiB   gs://genomics-public-data/references/hg38/v0/

The command below will transfer all files nested under gs://genomics-public-data/references/hg38/v0 into hg38 following the same directory structure.

mkdir -p hg38
gsutil -m cp -r gs://genomics-public-data/references/hg38/v0 hg38/

That would have been the end of the post but I found out that there were several files that could not be downloaded, such as:

autosomes-1kg-minusNA12878-ALL.vcf
Homo_sapiens_assembly38.dbsnp138.vcf

This warning from gsutil may be the reason why I couldn't download these files:

CommandException:
Downloading this composite object requires integrity checking with CRC32c,
but your crcmod installation isn't using the module's C extension, so the
hash computation will likely throttle download performance. For help
installing the extension, please see "gsutil help crcmod".

To download regardless of crcmod performance or to skip slow integrity
checks, see the "check_hashes" option in your boto config file.

NOTE: It is strongly recommended that you not disable integrity checks. Doing so
could allow data corruption to go undetected during uploading/downloading.

gsutil help crcmod provided information on how to overcome this problem, which I didn't attempt. Since I need Homo_sapiens_assembly38.dbsnp138.vcf for several of the pipelines, I just manually downloaded it from my web browser.

Downloading files from a list of files

The gsutil tool can accept from STDIN a list of files to be transferred. For example, the CrossSpeciesContamination folder is 222.3 GiB in size and as far as I'm aware isn't used in any of the pipelines I want to run.

cat bundle_files.txt | grep Cross | tail -1
222.3 GiB    gs://genomics-public-data/references/hg38/v0/CrossSpeciesContamination/

We can use grep -v to exclude files in the CrossSpeciesContamination folder and create a list of files to download.

cat bundle_files.txt |
grep -v CrossSpeciesContamination |
grep -v "/$" | # excludes directories
awk '{print $3}' > files_to_download.txt

Now we just need to pipe the list of files to gsutil.

mkdir -p hg38
cat files_to_download.txt | gsutil -m cp -I hg38

Large files

If you try to download large files, you may get the following note:

gsutil cp gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .
Copying gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf...
==> NOTE: You are downloading one or more large file(s), which would            
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

CommandException: 
Downloading this composite object requires integrity checking with CRC32c,
but your crcmod installation isn't using the module's C extension, so the
hash computation will likely throttle download performance. For help
installing the extension, please see "gsutil help crcmod".

To download regardless of crcmod performance or to skip slow integrity
checks, see the "check_hashes" option in your boto config file.

NOTE: It is strongly recommended that you not disable integrity checks. Doing so
could allow data corruption to go undetected during uploading/downloading.

We will use my Docker image to setup crcmod. Modify the mount path to accordingly, so you have access to the downloaded file outside of Docker.

# run Docker interactively
docker run --rm -it -v /data/:/data/ davetang/base /bin/bash

# install google cloud
conda create -c conda-forge -n google_cloud google-cloud-sdk

# activate the environment
conda activate google_cloud

# install required libraries
apt update
apt install -y gcc python3-dev python3-setuptools

# install crcmod
pip3 uninstall crcmod
pip3 install --no-cache-dir -U crcmod

# download
gsutil cp gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .

This work is licensed under a Creative Commons
Attribution 4.0 International License.

2 comments Add yours

xiao ren says:

November 28, 2020 at 20:06

(base) [root@VM-0-11-centos hg38]# docker run –rm -it -v /data/:/data/ davetang/base /bin/bash
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See ‘docker run –help’.

1. Davo says:
  
  December 1, 2020 at 01:18
  
  Can you run “docker run –help”?

Downloading files from a list of files

Large files

Leave a Reply Cancel reply