This workflow can be used to perform expression quantification for multiple BAM files. Expression levels will be determined for each BAM file/sample and will be merged together into a single table including all samples.
Expression quantification will be performed using StringTie and HTSeq-Count.
This workflow is part of BioWDL developed by the SASC team at Leiden University Medical Center.
Usage
This workflow can be run using Cromwell:
First download the latest version of the workflow wdl file(s) from the github page.
The workflow can then be started with the following command:
java \
-jar cromwell-<version>.jar \
run \
-o options.json \
-i inputs.json \
multi-bam-quantify.wdl
Where options.json
contains the following json:
{
"final_workflow_outputs_dir": "/path/to/outputs",
"use_relative_output_paths": true,
"final_workflow_log_dir": "/path/to/logs/folder"
}
Inputs
Inputs are provided through a JSON file. The minimally required inputs are described below, but additional inputs are available. A template containing all possible inputs can be generated using Womtool as described in the WOMtool documentation. For an overview of all available inputs, see this page.
{
"MultiBamExpressionQuantification.bams": "A list of BAM files and sample identifiers (see 'BAM file input' below).",
"MultiBamExpressionQuantification.strandedness": "The strandedness of the samples: FR (forward-reverse), RF (reverse-forward) or None.",
"MultiBamExpressionQuantification.outputDir": "The path to the output directory.",
"MultiBamExpressionQuantification.referenceGtfFile": "The path to the annotations GTF file. If not specified, Stringtie will be run unguided and the GTF file it produces will be used for HTSeq-Count.",
}
BAM file input
BAM files need to be given as a list with one item per sample. Each of the
items should be an object containing a "Left"
element (the sample id) and a
"Right"
element (the BAM file and its index) following the structure as
shown here:
{
"left": "Sample identifier",
"right": {
"file": "The path to the sample's BAM file",
"index": "The path to the index for the sample's BAM file"
}
}
Example
{
"MultiBamExpressionQuantification.bams": [
{
"left": "s1",
"right": {
"file": "/home/user/mapping/results/s1.bam",
"index": "/home/user/mapping/results/s1.bai"
}
},
{
"left": "s2",
"right": {
"file": "/home/user/mapping/results/s2.bam",
"index": "/home/user/mapping/results/s2.bai"
}
}
],
"MultiBamExpressionQuantification.strandedness": "FR",
"MultiBamExpressionQuantification.outputDir": "/home/user/expression/results",
"MultiBamExpressionQuantification.referenceGtfFile": "/home/user/genomes/human/features/ensembl87.gtf"
}
Dependency requirements and tool versions
Biowdl workflows use docker images to ensure reproducibility. This means that biowdl workflows will run on any system that has docker installed. Alternatively they can be run with singularity.
For more advanced configuration of docker or singularity please check the cromwell documentation on containers.
Images from biocontainers are preferred for
biowdl workflows. The list of default images for this workflow can be
found in the default for the dockerImages
input.
Output
The multi-bam-quantify
workflow produces two directories:
- stringtie: Contains the Stringtie output. Includes two additional files:
all_samples.FPKM
andall_samples.TPM
, which contain the FPKM and TPM values for all samples. - fragments_per_gene: Contains the HTSeq-Count output. Also contains a
file called
all_samples.fragments_per_gene
, which contains the counts for all samples.
Contact
For any questions about running this workflow and feature requests (such as adding additional tools and options), please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.