If you are planning on running some of the GATK workflows locally you will require various reference files, known as the Resource Bundle. Initially I had downloaded the files using FTP but I later realised various files were missing from the FTP server, such as BWA index and scatter files. Only later did I realise that the required files are available in the Google Cloud bucket. This post is on using the Google Cloud SDK, which contains tools and libraries for interacting with Google Cloud products and services, to download the GATK resource bundle files.
We'll use Conda to install Google Cloud SDK into a new environment called google_cloud.
# install conda create -c conda-forge -n google_cloud google-cloud-sdk # activate the environment conda activate google_cloud # outputs subcommands gsutil
We will use the cp subcommand from gsutil to transfer the files from the bucket to our local computer. You can find more information about cp using the help command.
gsutil help cp
In addition, we will be using the -m option:
Causes supported operations (acl ch, acl set, cp, mv, rm, rsync, and setmeta) to run in parallel. This can significantly improve performance if you are performing operations on a large number of files over a reasonably fast network connection.
For more information on various options use the help command.
gsutil help options
Before we begin transferring files, let's use the ls subcommand to list the files inside the bucket.
gsutil ls gs://genomics-public-data/references/hg38/v0/ # output not shown
Let's transfer the README file first into the current directory.
# transfer README into the current directory gsutil cp gs://genomics-public-data/references/hg38/v0/README . Copying gs://genomics-public-data/references/hg38/v0/README... - [1 files][ 3.9 KiB/ 3.9 KiB] Operation completed over 1 objects/3.9 KiB. # check out the first line head -1 README Details about this reference are available in PO-1914
Before we start transferring, let's find out how much data will be sent using the du subcommand with the -h (human readable sizes) option.
gsutil du -h gs://genomics-public-data/references/hg38/v0 > bundle_files.txt tail bundle_files.txt 571.95 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0047_of_50/scattered.interval_list 571.95 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0047_of_50/ 570.8 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0048_of_50/scattered.interval_list 570.8 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0048_of_50/ 570.85 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0049_of_50/scattered.interval_list 570.85 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0049_of_50/ 572.71 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0050_of_50/scattered.interval_list 572.71 KiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/temp_0050_of_50/ 27.88 MiB gs://genomics-public-data/references/hg38/v0/scattered_calling_intervals/ 294.08 GiB gs://genomics-public-data/references/hg38/v0/
The command below will transfer all files nested under gs://genomics-public-data/references/hg38/v0 into hg38 following the same directory structure.
mkdir -p hg38 gsutil -m cp -r gs://genomics-public-data/references/hg38/v0 hg38/
That would have been the end of the post but I found out that there were several files that could not be downloaded, such as:
- autosomes-1kg-minusNA12878-ALL.vcf
- Homo_sapiens_assembly38.dbsnp138.vcf
This warning from gsutil may be the reason why I couldn't download these files:
CommandException:
Downloading this composite object requires integrity checking with CRC32c,
but your crcmod installation isn't using the module's C extension, so the
hash computation will likely throttle download performance. For help
installing the extension, please see "gsutil help crcmod".To download regardless of crcmod performance or to skip slow integrity
checks, see the "check_hashes" option in your boto config file.NOTE: It is strongly recommended that you not disable integrity checks. Doing so
could allow data corruption to go undetected during uploading/downloading.
gsutil help crcmod provided information on how to overcome this problem, which I didn't attempt. Since I need Homo_sapiens_assembly38.dbsnp138.vcf for several of the pipelines, I just manually downloaded it from my web browser.
Downloading files from a list of files
The gsutil tool can accept from STDIN a list of files to be transferred. For example, the CrossSpeciesContamination folder is 222.3 GiB in size and as far as I'm aware isn't used in any of the pipelines I want to run.
cat bundle_files.txt | grep Cross | tail -1 222.3 GiB gs://genomics-public-data/references/hg38/v0/CrossSpeciesContamination/
We can use grep -v to exclude files in the CrossSpeciesContamination folder and create a list of files to download.
cat bundle_files.txt | grep -v CrossSpeciesContamination | grep -v "/$" | # excludes directories awk '{print $3}' > files_to_download.txt
Now we just need to pipe the list of files to gsutil.
mkdir -p hg38 cat files_to_download.txt | gsutil -m cp -I hg38
Large files
If you try to download large files, you may get the following note:
gsutil cp gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf . Copying gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf... ==> NOTE: You are downloading one or more large file(s), which would run significantly faster if you enabled sliced object downloads. This feature is enabled by default but requires that compiled crcmod be installed (see "gsutil help crcmod"). CommandException: Downloading this composite object requires integrity checking with CRC32c, but your crcmod installation isn't using the module's C extension, so the hash computation will likely throttle download performance. For help installing the extension, please see "gsutil help crcmod". To download regardless of crcmod performance or to skip slow integrity checks, see the "check_hashes" option in your boto config file. NOTE: It is strongly recommended that you not disable integrity checks. Doing so could allow data corruption to go undetected during uploading/downloading.
We will use my Docker image to setup crcmod. Modify the mount path to accordingly, so you have access to the downloaded file outside of Docker.
# run Docker interactively docker run --rm -it -v /data/:/data/ davetang/base /bin/bash # install google cloud conda create -c conda-forge -n google_cloud google-cloud-sdk # activate the environment conda activate google_cloud # install required libraries apt update apt install -y gcc python3-dev python3-setuptools # install crcmod pip3 uninstall crcmod pip3 install --no-cache-dir -U crcmod # download gsutil cp gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .
This work is licensed under a Creative Commons
Attribution 4.0 International License.
(base) [root@VM-0-11-centos hg38]# docker run –rm -it -v /data/:/data/ davetang/base /bin/bash
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See ‘docker run –help’.
Can you run “docker run –help”?