This is not a stable version!
You are currently viewing the documentation for a development version. It is not guaranteed that this documentation is up to date. Things will likely change without announcement or versioning incrementation. If there is no other documentation available, there are likely no releases available for this repository. The content is, therefore, likely still in development and not production ready. Use at your own risk!
Please be aware that the page you are currently viewing is not for the latest available version!
This workflow performs preprocessing steps required for variantcalling based on the GATK Best Practices. This workflow can be used for both DNA data and RNA-seq data. It recalibrates a BAM file and optionally splits spliced reads.
This workflow is part of BioWDL developed by the SASC team at Leiden University Medical Center.
Usage
This workflow can be run using Cromwell:
java -jar cromwell-<version>.jar run -i inputs.json gatk-preprocess.wdl
Inputs
Inputs are provided through a JSON file. The minimally required inputs are described below and a template containing all possible inputs can be generated using Womtool as described in the WOMtool documentation. For an overview of all available inputs, see this page.
{
"GatkPreprocess.referenceFasta": "The path to the reference fasta file",
"GatkPreprocess.referenceFastaFai": "The path to the index for the reference fasta",
"GatkPreprocess.referenceFastaDict": "The path to the sequence dictionary dict file for the reference fasta",
"GatkPreprocess.bam": "A path to an input BAM file",
"GatkPreprocess.bamIndex": "A path to the index of the BAM file.",
"GatkPreprocess.bamName": "The name for the output bam. The final output will be <bamName>.bam or <bamName>.bqsr",
"GatkPreprocess.dbsnpVCF": "A path to a dbSNP VCF file",
"GatkPreprocess.dbsnpVCFIndex": "The path to the index (.tbi) file associated with the dbSNP VCF"
}
Some additional inputs that may be of interest are:
{
"GatkPreprocess.scatters": "A list of bed files describing the regions to be processed.",
"GatkPreprocess.splitSplicedReads": "Whether or not SplitNCigarReads should be executed (recommended for RNA-seq data), defaults to false",
}
Each bed file supplied with scatters
will be used in a seperate job for
most of the steps taken in this workflow. This will allow for parallelization
if the backend used supports this. It is recommended to use this input and supply
one bed file per chromosome (small chromosomes can be together in one bed file).
An output directory can be set using an options.json
file. See the
cromwell documentation for more
information.
Example options.json
file:
{
"final_workflow_outputs_dir": "my-analysis-output",
"use_relative_output_paths": true,
"default_runtime_attributes": {
"docker_user": "$EUID"
}
}
Alternatively an output directory can be set with GatkPreprocess.outputDir
.
GatkPreprocess.outputDir
must be mounted in the docker container. Cromwell will
need a custom configuration to allow this.
Example
{
"GatkPreprocess.referenceFasta": "/home/user/genomes/human/GRCh38.fasta",
"GatkPreprocess.referenceFastaFai": "/home/user/genomes/human/GRCh38.fasta.fai",
"GatkPreprocess.referenceFastadict": "/home/user/genomes/human/GRCh38.dict",
"GatkPreprocess.bamName": "s1_preprocessed",
"GatkPreprocess.dbsnpVCF": "/home/user/genomes/human/dbsnp/dbsnp-151.vcf.gz",
"GatkPreprocess.dbsnpVCFIndex": "/home/user/genomes/human/dbsnp/dbsnp-151.vcf.gz.tbi",
"GatkPreprocess.bam": "home/user/mapping/results/s1.bam",
"GatkPreprocess.bamIndex":"/home/user/mapping/results/s1.bai",
"GatkPreprocess.splitSplicedReads": true
}
Dependency requirements and tool versions
Biowdl pipelines use docker images to ensure reproducibility. This means that biowdl pipelines will run on any system that has docker installed. Alternatively they can be run with singularity.
For more advanced configuration of docker or singularity please check the cromwell documentation on containers.
Images from biocontainers are preferred for
biowdl pipelines. The list of default images for this pipeline can be
found in the default for the dockerImages
input.
Output
This workflow will produce a BQSR report named according to the bamName
input (bamName + ‘.bqsr’). If one of the splitSplicedReads
or
outputRecalibratedBam
inputs is set to true, a new BAM file (bamName +
‘.bam’) will be produced as well.
Scattering
This pipeline performs scattering to speed up analysis on grid computing
clusters. This is done by splitting the reference genome into regions of
roughly equal size (see the scatterSize
input). Each of these regions will
be analyzed in separate jobs, allowing them to be processed in parallel.
Contact
For any question about running this workflow and feature requests, please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.