Execute gatk-workflows locally

The Broad Institute have shared their GATK workflows on GitHub, however they are configured to be used with Google Cloud. I was not able to find a lot of information on executing the workflows locally and I only found this tutorial. I ran into problems while trying to follow the tutorial but eventually got it working. This post contains the set of steps I used to run seq-format-validation locally with Docker.

First download and install Docker; if you are new to Docker, I prepared a tutorial. Once you have successfully installed Docker, pull the latest GATK image. I wanted to pull a specific version but noticed that in the WDL file, broadinstitute/gatk:latest is specified. Let's pull the image and quickly check out the GATK image.

docker pull broadinstitute/gatk:latest

docker run --rm -it broadinstitute/gatk:latest /bin/bash

# which OS is used
cat /etc/os-release 
VERSION="16.04.6 LTS (Xenial Xerus)"
PRETTY_NAME="Ubuntu 16.04.6 LTS"

# where is GATK installed
which gatk

# Conda is installed too
which conda

# R is also installed
R --version
R version 3.2.5 (2016-04-14) -- "Very, Very Secure Dishes"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see

I'll use the same directory structure as the tutorial.

mkdir gatk-workflows
cd gatk-workflows
mkdir inputs

Download this example BAM file (wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.bam) from Google Cloud and save it in the inputs folder we just created. You will have to log into your Google account (or make one if you don't already have an account).

Download Cromwell and store it in the gatk-workflows folder.

wget https://github.com/broadinstitute/cromwell/releases/download/47/cromwell-47.jar

The next step requires Git, so make sure you have that installed. Clone the seq-format-validation workflow into the gatk-workflows folder.

git clone https://github.com/gatk-workflows/seq-format-validation.git

Lastly, modify seq-format-validation/validate-bam.inputs.json so that it looks like the block below (but change the location of wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.bam to where you downloaded it on your computer):

  "ValidateBamsWf.bam_array": [ 
  "ValidateBamsWf.ValidateBAM.validation_mode": "SUMMARY",

  "##Comment3":"Runtime - uncomment the lines below and supply a valid docker container to override the default",
  "ValidateBamsWf.ValidateBAM.machine_mem_gb": "1",  
  "ValidateBamsWf.ValidateBAM.disk_space_gb": "100",
  "#ValidateBamsWf.ValidateBAM.gatk_path_override": "String (optional)",
  "#ValidateBamsWf.gatk_docker_override": "String (optional)"

Your directory structure should look something like this:

|-- cromwell-47.jar
|-- inputs
|   `-- wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.bam
|-- seq-format-validation
|   |-- LICENSE
|   |-- README.md
|   |-- generic.google-papi.options.json
|   |-- validate-bam.inputs.json
|   `-- validate-bam.wdl

Lastly, make sure you have Java installed (use version 8).

java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Run the following command in the directory gatk-workflows.

java -jar cromwell-47.jar run ./seq-format-validation/validate-bam.wdl --inputs ./seq-format-validation/validate-bam.inputs.json

You should see two new folders: cromwell-executions and cromwell-workflow-logs. Inside cromwell-executions are the outputs generated by the workflow.

cat cromwell-executions/ValidateBamsWf/810afc1e-5b06-4a71-a6e5-65b64a72462d/call-ValidateBAM/shard-0/execution/wgs_bam_NA12878_24RG_hg38_NA12878_24RG_small.hg38.validation_SUMMARY.txt 
No errors found

The directory structure follows the format below.


This part looks at some specifics of the WDL script. I have written a basic introduction to WDL that may be helpful.

workflow ValidateBamsWf {
  Array[File] bam_array 
  String? gatk_docker_override
  String gatk_docker = select_first([gatk_docker_override, "broadinstitute/gatk:latest"])
  String? gatk_path_override
  String gatk_path = select_first([gatk_path_override, "/gatk/gatk"])

  # Process the input files in parallel
  scatter (input_bam in bam_array) {

    # Get the basename, i.e. strip the filepath and the extension
    String bam_basename = basename(input_bam, ".bam")

    # Run the validation 
    call ValidateBAM {
        input_bam = input_bam,
        output_basename = bam_basename + ".validation",
        docker = gatk_docker,
        gatk_path = gatk_path

  # Outputs that will be retained when execution is complete
  output {
    Array[File] validation_reports = ValidateBAM.validation_report

You can process more than one BAM file at a time by specifying the list of files in the JSON input. These list of files are stored as an array called bam_array.

"ValidateBamsWf.bam_array": [
   # input bam goes here

You can also specify a different Docker image that may have GATK installed somewhere else besides /gatk/gatk by setting the override variables in the JSON input. The select_first function will select the first defined value. I have the following two inputs commented out because I was using the Docker image provided by the GATK team.

 "#ValidateBamsWf.ValidateBAM.gatk_path_override": "String (optional)",
  "#ValidateBamsWf.gatk_docker_override": "String (optional)"

Next in the WDL is the scatter call to the ValidateBAM task, which will validate the BAMs in parallel. This tutorial explains how to use scatter-gather to joint call genotypes. The command that is run:

command {
   ${gatk_path} \
      ValidateSamFile \
      --INPUT ${input_bam} \
      --OUTPUT ${output_name} \
      --MODE ${default="SUMMARY" validation_mode}

If you are wondering what the actual Docker command that was run, you can check script.submit in the execution folder.

# make sure there is no preexisting Docker CID file
rm -f /Users/dtang/github/gatk-workflows/cromwell-executions/ValidateBamsWf/15000b39-6a1e-4091-8069-3a48305eeb73/call-ValidateBAM/shard-0/execution/docker_cid
# run as in the original configuration without --rm flag (will remove later)
docker run \
  --cidfile /Users/dtang/github/gatk-workflows/cromwell-executions/ValidateBamsWf/15000b39-6a1e-4091-8069-3a48305eeb73/call-ValidateBAM/shard-0/execution/docker_cid \
  -i \
  --entrypoint /bin/bash \
  -v /Users/dtang/github/gatk-workflows/cromwell-executions/ValidateBamsWf/15000b39-6a1e-4091-8069-3a48305eeb73/call-ValidateBAM/shard-0:/cromwell-executions/ValidateBamsWf/15000b39-6a1e-4091-8069-3a48305eeb73/call-ValidateBAM/shard-0:delegated \
  broadinstitute/gatk@sha256:2c0e2ba20c9beb58842ba2149efc29059bc52a5178ce05debf0f38238c0bde86 /cromwell-executions/ValidateBamsWf/15000b39-6a1e-4091-8069-3a48305eeb73/call-ValidateBAM/shard-0/execution/script

# get the return code (working even if the container was detached)
rc=$(docker wait `cat /Users/dtang/github/gatk-workflows/cromwell-executions/ValidateBamsWf/15000b39-6a1e-4091-8069-3a48305eeb73/call-ValidateBAM/shard-0/execution/docker_cid`)

# remove the container after waiting
docker rm `cat /Users/dtang/github/gatk-workflows/cromwell-executions/ValidateBamsWf/15000b39-6a1e-4091-8069-3a48305eeb73/call-ValidateBAM/shard-0/execution/docker_cid`

# return exit code
exit $rc

The call-ValidateBAM folder is mounted to the Docker container and the a shell script called script is run.

You can alter the docker run command by specifying a configuration file to Cromwell. I used the following conf file to mount an extra volume to the container (/data/).

# include the application.conf at the top
include required(classpath("application"))

backend {
   default = "Docker"
   providers {
      Docker {
         actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
         config {

            concurrent-job-limit = 1
            run-in-background = true

            runtime-attributes = """
            String? docker
            String? docker_user
            submit = "/bin/bash ${script}"

            submit-docker = """
            docker run \
               --rm -i \
               ${"--user " + docker_user} \
               --entrypoint /bin/bash \
               -v ${cwd}:${docker_cwd} \
               -v /data/:/data/ \
               ${docker} < ${script}

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
16 comments Add yours
  1. Hello Dave,

    I tried to follow your instructions but am not able to execute the script, could you please help me?

    I did install Docker. Do I have to run docker image – broadinstitute/gatk:latest, before executing the follwing command: java -jar cromwell-47.jar run ./seq-format-validation/validate-bam.wdl –inputs ./seq-format-validation/validate-bam.inputs.json

    Thanks a lot.

    1. Hi Akhil,

      what error did you receive? Do you have the same directory setup as the post? And did you edit the JSON file so that it points to where you downloaded the example BAM file?


      1. Thanks Dave for the reply.

        I did all that you suggested: I did set up the same directory as the post, and I did edit he JSON file as well.

        What I am not able to get is what command in your post is pointing to the Docker image? Do I have to run the Docker image before executing the script. What code in the script is ensuring that Docker image is utilized?


        1. The WDL script specifies that the latest Docker GATK image should be used and Cromwell takes care of the execution; I didn’t have to modify anything.

          Can you ensure that you have successfully pulled the latest GATK image and can run it as a container? Other than that I am not sure what could be the problem.

              1. I get the following output:

                Usage template for all tools (uses –spark-runner LOCAL when used with a Spark tool)
                gatk AnyTool toolArgs

                Usage template for Spark tools (will NOT work on non-Spark tools)
                gatk SparkTool toolArgs [ — –spark-runner sparkArgs ]

                Getting help
                gatk –list Print the list of available tools

                gatk Tool –help Print help on a particular tool

                Configuration File Specification
                –gatk-config-file PATH/TO/GATK/PROPERTIES/FILE

                gatk forwards commands to GATK and adds some sugar for submitting spark jobs

                –spark-runner controls how spark tools are run
                valid targets are:
                LOCAL: run using the in-memory spark runner
                SPARK: run using spark-submit on an existing cluster
                –spark-master must be specified
                –spark-submit-command may be specified to control the Spark submit command
                arguments to spark-submit may optionally be specified after —
                GCS: run using Google cloud dataproc
                commands after the — will be passed to dataproc
                –cluster must be specified after the —
                spark properties and some common spark-submit parameters will be translated
                to dataproc equivalents

                –dry-run may be specified to output the generated command line without running it
                –java-options ‘OPTION1[ OPTION2=Y … ]’ optional – pass the given string of options to the
                java JVM at runtime.
                Java options MUST be passed inside a single string with space-separated values.

        2. This is the error I am getting:

          Workflow 86ab41f6-323d-42df-8fc9-ffc40b391409 transitioned to state Failed

  2. Hello Dave,

    I gave your instructions another try and it worked like a charm. Thanks a lot for all the wonderful help you are giving to the bioinformatics community, I really appreciate it.

    Now, coud you please direct me to few good use cases of runnung GATK locally.

    Thanks a lot.

  3. Hi Dave

    I am a newbie in the NGS downstream analysis field. You have very nicely explained each and every step for GATK WDL use. But I have a question: what if I want to run GATK WDL scripts on a remote server without docker being installed. Additionally, I want to use python script to submit my jobs on that server, so how can we use these GATK WDL scripts in python.
    Your help will be highly appreciated!

    1. Hi AR,

      you don’t have to use Docker but you will have to set up all the programs you need on the server and modify the WDL scripts accordingly. Go through these sets of tutorials https://support.terra.bio/hc/en-us/articles/360037117492-Getting-Started-with-WDL so that you can learn how to modify the WDL scripts.

      You can use Cromwell to submit your jobs on your server; have a look at https://cromwell.readthedocs.io/en/stable/tutorials/HPCIntro/.

      There is a lot to learn and it can become frustrating. You can try to build a simple pipeline first and build up your knowledge. For example, take a look at https://github.com/ecerami/wdl_sandbox.

      Good luck!


      1. Thanks a ton Dave!

        Thanks for your quick reply and I am pretty sure these resources will be of great help to me.
        You are doing a great job.


  4. Hi Dave,

    I am trying to replicate GATK WDL commands on bash for Novaseq whole genome sequencing data of humans. I ran scatter-gather python script provided by GATK. I was able to produce 18 different files as GATK team suggested. But when I am running the BaseRecalibrator step, with genomic intervals option (files generated from scatter-gather), BaseRecalibrator is not accepting these files and giving error: Invalid file and not in proper format. So, I went the other way around and used the WGS.intervals_list provided by GATK reference bundle GRCh38. I am able to run BaseRecalibrator with the WGS.intervals_list and it generated a single recalibration report. So, my confusion is:

    1. Will it effect the overall output of my GATK pipeline if I am not using scatter-gather step and instead directly using WGS.intervals_list provided by GATK reference bundle GRCh38 for BaseRecalibrator? Other than speed (parallelism).

    2. Why I am not able to use the scatter-gather output files for the BaseRecalibrator step?

    3. What if I do not provide genomic interval list in -L option in BaseRecalibrator step? Will the recal_table be different from the one generated when provided with WGS.intervals_list provided by GATK reference bundle GRCh38?

    4. In the reference bundle provided by GATK (GRCh38), there are two files for was intervals list. One is inside the beta folder (WGS.intervals_list) and other in main hg38 folder (WGS_Calling_regions.hg38.intervals_list). I used the second one as only difference was, it was coordinate sorted and genomic intervals for all chromosomes are also provided. Is it fine?

    For your reference, I am attaching here the scatter-gather script. Pardon me if my questions are too naive. I just want to understand the basics. I have searched everywhere for this.

    Any help pls!!

    Python Script:

    >import sys
    with open(“/Users/AARTI/Desktop/hg38.dict”, “r”) as ref_dict_file:
    sequence_tuple_list = []
    longest_sequence = 0
    for line in ref_dict_file:
    if line.startswith(“@SQ”):
    line_split = line.split(“\t”)
    # (Sequence_Name, Sequence_Length)
    sequence_tuple_list.append((line_split[1].split(“SN:”)[1], int(line_split[2].split(“LN:”)[1])))
    longest_sequence = sorted(sequence_tuple_list, key=lambda x: x[1], reverse=True)[0][1]
    # We are adding this to the intervals because hg38 has contigs named with embedded colons and a bug in GATK strips off
    # the last element after a :, so we add this as a sacrificial element.
    hg38_protection_tag = “:1+”
    max_line = 0
    # initialize the tsv string with the first sequence
    tsv_string = sequence_tuple_list[0][0]
    temp_size = sequence_tuple_list[0][1]
    for sequence_tuple in sequence_tuple_list[1:]:
    if temp_size + sequence_tuple[1] <= longest_sequence and max_line <= 1600:
    temp_size += sequence_tuple[1]
    if not ":" in sequence_tuple[0]:
    tsv_string += "\t" + sequence_tuple[0]
    tsv_string += "\t" + sequence_tuple[0] + hg38_protection_tag
    max_line = max_line + 1
    if not ":" in sequence_tuple[0]:
    tsv_string += "\n" + sequence_tuple[0]
    tsv_string += "\n" + sequence_tuple[0] + hg38_protection_tag
    temp_size = sequence_tuple[1]
    max_line = 0
    # add the unmapped sequences as a separate line to ensure that they are recalibrated as well
    tsv_string += '\n' + "unmapped"
    sequence_groups_unmapped = tsv_string.split("\n")
    for i in range(0, len(sequence_groups_unmapped)):
    with open("sequence_grouping_with_unmapped_{0}.txt".format(i), "w") as tsv_file_with_unmapped:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.