BioWDL: jointgenotyping

A BioWDL workflow for generating a multisample VCF file from gVCF files.

This is not a stable version!
You are currently viewing the documentation for a development version. It is not guaranteed that this documentation is up to date. Things will likely change without announcement or versioning incrementation. If there is no other documentation available, there are likely no releases available for this repository. The content is, therefore, likely still in development and not production ready. Use at your own risk!

Please be aware that the page you are currently viewing is not for the latest available version!

The workflow can be used to aggregate and genotype GVCF files for multiple samples using GATK’s GenotypeGVCFs.

This workflow is part of BioWDL developed by the SASC team at Leiden University Medical Center.

Usage

This workflow can be run using Cromwell:

java -jar cromwell-<version>.jar run -i inputs.json jointgenotyping.wdl

Inputs

Inputs are provided through a JSON file. The minimally required inputs are described below and a template containing all possible inputs can be generated using Womtool as described in the WOMtool documentation.

{
  "JointGenotyping.gvcfFiles": "A list of GVCF files and their indexes (see the example)",
  "JointGenotyping.dbsnpVCF": {
    "file": "A dbSNP VCF file",
    "index": "The index (.tbi) for the dbSNP VCF file"
  },
  "JointGenotyping.outputDir": "The path to the output directory",
  "JointGenotyping.reference": {
    "fasta": "A reference fasta file",
    "fai": "The index for the reference fasta",
    "dict": "The dict file for the reference fasta"
  }
}

Some additional inputs that may be of interest are:

{
  "JointGenotyping.mergeGvcfFiles": "Whether or not to output a merged GVCF files, defaults to true",
  "JointGenotyping.scatterSize": "The size of scatter regions (see explanation of scattering below), defaults to 10,000,000",
  "JointGenotyping.vcfBasename": "The basename of the to be outputed VCF files, defaults to 'multisample'",
  "JointGenotyping.scatterList.regions": "The path to a bed file containing the regions be processed"
}

An output directory can be set using an options.json file. See the cromwell documentation for more information.

Example options.json file:

{
"final_workflow_outputs_dir": "my-analysis-output",
"use_relative_output_paths": true,
"default_runtime_attributes": {
  "docker_user": "$EUID"
  }
}

Alternatively an output directory can be set with GatkPreprocess.outputDir. GatkPreprocess.outputDir must be mounted in the docker container. Cromwell will need a custom configuration to allow this.

Example

{
  "JointGenotyping.gvcfFiles": [
    {
      "file": "/home/user/analysis/results/s1.vcf.gz",
      "index": "/home/user/analysis/results/s1.vcf.gz.tbi"
    }, {
      "file": "/home/user/analysis/results/s2.vcf.gz",
      "index": "/home/user/analysis/results/s2.vcf.gz.tbi"
    }
  ],
  "JointGenotyping.dbsnpVCF": {
    "file": "/home/user/genomes/human/dbsnp/dbsnp-151.vcf.gz",
    "index": "/home/user/genomes/human/dbsnp/dbsnp-151.vcf.gz.tbi"
  },
  "JointGenotyping.outputDir": "/home/user/analysis/results/genotyping",
  "JointGenotyping.reference": {
    "fasta": "/home/user/genomes/human/GRCh38.fasta",
    "fai": "/home/user/genomes/human/GRCh38.fasta.fai",
    "dict": "/home/user/genomes/human/GRCh38.dict"
  }
}

Dependency requirements and tool versions

Biowdl pipelines use docker images to ensure reproducibility. This means that biowdl pipelines will run on any system that has docker installed. Alternatively they can be run with singularity.

For more advanced configuration of docker or singularity please check the cromwell documentation on containers.

Images from biocontainers are preferred for biowdl pipelines. The list of default images for this pipeline can be found in the default for the dockerImages input.

Output

A multisample VCF file. If mergeGvcfFiles is set to true, also a multisample GVCF file.

scattering

This pipeline performs scattering to speed up analysis on grid computing clusters. This is done by splitting the reference genome into regions of roughly equal size (see the scatterSize input). Each of these regions will be analyzed in separate jobs, allowing them to be processed in parallel.

Contact

For any question about running this workflow or feature requests, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.