Execute gatk-workflows locally

The Broad Institute have shared their GATK workflows on GitHub, however they are configured to be used with Google Cloud. I was not able to find a lot of information on executing the workflows locally and I only found this tutorial. I ran into problems while trying to follow the tutorial but eventually got it working. This post contains the set of steps I used to run seq-format-validation locally with Docker.

First download and install Docker; if you are new to Docker, I prepared a tutorial. Once you have successfully installed Docker, pull the latest GATK image. I wanted to pull a specific version but noticed that in the WDL file, broadinstitute/gatk:latest is specified.

docker pull broadinstitute/gatk:latest

I’ll use the same directory structure as the tutorial.

mkdir gatk-workflows
cd gatk-workflows
mkdir inputs

Download this example BAM file (wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.bam) from Google Cloud and save it in the inputs folder we just created. You will have to log into your Google account (or make one if you don’t already have an account).

Download Cromwell and store it in the gatk-workflows folder.

wget https://github.com/broadinstitute/cromwell/releases/download/47/cromwell-47.jar

The next step requires Git, so make sure you have that installed. Clone the seq-format-validation workflow into the gatk-workflows folder.

git clone https://github.com/gatk-workflows/seq-format-validation.git

Lastly, modify seq-format-validation/validate-bam.inputs.json so that it looks like the block below (but change the location of wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.bam to where you downloaded it on your computer):

  "ValidateBamsWf.bam_array": [ 
  "ValidateBamsWf.ValidateBAM.validation_mode": "SUMMARY",

  "##Comment3":"Runtime - uncomment the lines below and supply a valid docker container to override the default",
  "ValidateBamsWf.ValidateBAM.machine_mem_gb": "1",  
  "ValidateBamsWf.ValidateBAM.disk_space_gb": "100",
  "#ValidateBamsWf.ValidateBAM.gatk_path_override": "String (optional)",
  "#ValidateBamsWf.gatk_docker_override": "String (optional)"

Your directory structure should look something like this:

|-- cromwell-47.jar
|-- inputs
|   `-- wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.bam
|-- seq-format-validation
|   |-- LICENSE
|   |-- README.md
|   |-- generic.google-papi.options.json
|   |-- validate-bam.inputs.json
|   `-- validate-bam.wdl

Lastly, make sure you have Java installed (use version 8).

java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Run the following command in the directory gatk-workflows.

java -jar cromwell-47.jar run ./seq-format-validation/validate-bam.wdl --inputs ./seq-format-validation/validate-bam.inputs.json

You should see two new folders: cromwell-executions and cromwell-workflow-logs. Inside cromwell-executions are the outputs generated by the workflow.

cat cromwell-executions/ValidateBamsWf/810afc1e-5b06-4a71-a6e5-65b64a72462d/call-ValidateBAM/shard-0/execution/wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.validation_SUMMARY.txt 
No errors found

The directory structure follows the format below.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
12 comments Add yours
  1. Hello Dave,

    I tried to follow your instructions but am not able to execute the script, could you please help me?

    I did install Docker. Do I have to run docker image – broadinstitute/gatk:latest, before executing the follwing command: java -jar cromwell-47.jar run ./seq-format-validation/validate-bam.wdl –inputs ./seq-format-validation/validate-bam.inputs.json

    Thanks a lot.

    1. Hi Akhil,

      what error did you receive? Do you have the same directory setup as the post? And did you edit the JSON file so that it points to where you downloaded the example BAM file?


      1. Thanks Dave for the reply.

        I did all that you suggested: I did set up the same directory as the post, and I did edit he JSON file as well.

        What I am not able to get is what command in your post is pointing to the Docker image? Do I have to run the Docker image before executing the script. What code in the script is ensuring that Docker image is utilized?


        1. The WDL script specifies that the latest Docker GATK image should be used and Cromwell takes care of the execution; I didn’t have to modify anything.

          Can you ensure that you have successfully pulled the latest GATK image and can run it as a container? Other than that I am not sure what could be the problem.

              1. I get the following output:

                Usage template for all tools (uses –spark-runner LOCAL when used with a Spark tool)
                gatk AnyTool toolArgs

                Usage template for Spark tools (will NOT work on non-Spark tools)
                gatk SparkTool toolArgs [ — –spark-runner sparkArgs ]

                Getting help
                gatk –list Print the list of available tools

                gatk Tool –help Print help on a particular tool

                Configuration File Specification
                –gatk-config-file PATH/TO/GATK/PROPERTIES/FILE

                gatk forwards commands to GATK and adds some sugar for submitting spark jobs

                –spark-runner controls how spark tools are run
                valid targets are:
                LOCAL: run using the in-memory spark runner
                SPARK: run using spark-submit on an existing cluster
                –spark-master must be specified
                –spark-submit-command may be specified to control the Spark submit command
                arguments to spark-submit may optionally be specified after —
                GCS: run using Google cloud dataproc
                commands after the — will be passed to dataproc
                –cluster must be specified after the —
                spark properties and some common spark-submit parameters will be translated
                to dataproc equivalents

                –dry-run may be specified to output the generated command line without running it
                –java-options ‘OPTION1[ OPTION2=Y … ]’ optional – pass the given string of options to the
                java JVM at runtime.
                Java options MUST be passed inside a single string with space-separated values.

  2. Hello Dave,

    I gave your instructions another try and it worked like a charm. Thanks a lot for all the wonderful help you are giving to the bioinformatics community, I really appreciate it.

    Now, coud you please direct me to few good use cases of runnung GATK locally.

    Thanks a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.