Home | TALON-WDL

This is not a stable version!
You are currently viewing the documentation for a development version. It is not guaranteed that this documentation is up to date. Things will likely change without announcement or versioning incrementation. If there is no other documentation available, there are likely no releases available for this repository. The content is, therefore, likely still in development and not production ready. Use at your own risk!

This pipeline can be used to process RNA sequenced by either the Pacific Biosciences or Oxford Nanopore sequencer, starting from FastQ files. It will perform mapping to a reference genome (using minimap2), INDEL/mismatch and noncanonical splice junction correction (using TranscriptClean) and identify and count known and novel genes/transcripts (using TALON).

This pipeline is part of BioWDL developed by the SASC team at Leiden University Medical Center.

Usage

You can run the pipeline using Cromwell:

java -jar cromwell-<version>.jar run -i inputs.json talon-wdl.wdl

Inputs

Inputs are provided through a JSON file. The minimally required inputs are described below, but additional inputs are available. A template containing all possible inputs can be generated using Womtool as described in the WOMtool documentation. For an overview of all available inputs, see this page.

{
    "Pipeline.sampleConfigFile": "A sample configuration file (see below).",
    "Pipeline.outputDirectory": "The path to the output directory.",
    "Pipeline.annotationGTF": "GTF annotation containing genes, transcripts, and edges.",
    "Pipeline.genomeBuild": "Name of genome build that the GTF file is based on (ie hg38).",
    "Pipeline.annotationVersion": "Name of supplied annotation (will be used to label data).",
    "Pipeline.referenceGenome": "Reference genome fasta file.",
    "Pipeline.sequencingPlatform": "The sequencing platform used to generate long reads.",
    "Pipeline.organismName": "The name of the organism from which the samples originated.",
    "Pipeline.pipelineRunName": "A short name to distinguish a run.",
    "Pipeline.dockerImagesFile": "A file listing the used docker images.",
    "Pipeline.runTranscriptClean": "Set to true in order to run TranscriptClean, set to false in order to disable TranscriptClean.",
    "Pipeline.executeSampleWorkflow.presetOption": "This option applies multiple options at the same time to minimap2, this should be either 'splice'(directRNA) or 'splice:hq'(cDNA).",
    "Pipeline.executeSampleWorkflow.variantVCF": "A VCF file with common variants should be supplied when running TranscriptClean, this will make sure TranscriptClean does not correct those known variants.",
}

Optional settings:

{
    "Pipeline.novelIDprefix": "A prefix for novel transcript discoveries.",
    "Pipeline.executeSampleWorkflow.howToFindGTAG": "How to find canonical splicing sites GT-AG - f: transcript strand; b: both strands; n: no attempt to match GT-AG.",
    "Pipeline.spliceJunctionsFile": "A pre-generated splice junction annotation file.",
    "Pipeline.talonDatabase": "A pre-generated TALON database file."
}

Sample configuration

Verification

All samplesheet formats can be verified using biowdl-input-converter. It can be installed with pip install biowdl-input-converter or conda install biowdl-input-converter (from the bioconda channel). Python 3.7 or higher is required.

With biowdl-input-converter --validate samplesheet.csv The file “samplesheet.csv” will be checked. Also the presence of all files in the samplesheet will be checked to ensure no typos were made. For more information check out the biowdl-input-converter readthedocs page.

CSV format

The sample configuration can be given as a csv file with the following columns: sample, library, readgroup, R1, R1_md5, R2, R2_md5.

column name	function
sample	sample ID
library	library ID. These are the libraries that are sequenced. Usually there is only one library per sample
readgroup	readgroup ID. Usually a library is sequenced on multiple lanes in the sequencer, which gives multiple fastq files (referred to as readgroups). Each readgroup pair should have an ID.
R1	The fastq file containing the first reads of the read pairs
R1_md5	Optional: md5sum for the R1 file.

The easiest way to create a samplesheet is to use a spreadsheet program such as LibreOffice Calc or Microsoft Excel, and create a table:

sample	library	read	R1	R1_md5	R2	R2_md5

NOTE: R1_md5, R2 and R2_md5 are optional do not have to be filled. And additional fields may be added (eg. for documentation purposes), these will be ignored by the pipeline.

After creating the table in a spreadsheet program it can be saved in csv format.

Example

The following is an example of what an inputs JSON might look like:

{
    "Pipeline.sampleConfigFile": "tests/samplesheets/GM12878.K562.csv",
    "Pipeline.outputDirectory": "tests/test-output",
    "Pipeline.annotationGTF": "tests/data/gencode.v29.annotation.gtf",
    "Pipeline.genomeBuild": "hg38",
    "Pipeline.annotationVersion": "gencode_v29",
    "Pipeline.referenceGenome": "tests/data/grch38.fasta",
    "Pipeline.sequencingPlatform": "Nanopore",
    "Pipeline.organismName": "Human",
    "Pipeline.pipelineRunName": "testRun",
    "Pipeline.dockerImagesFile": "dockerImages.yml",
    "Pipeline.runTranscriptClean": "true",
    "Pipeline.executeSampleWorkflow.presetOption": "splice",
    "Pipeline.executeSampleWorkflow.variantVCF": "tests/data/common.variants.vcf",
    "Pipeline.executeSampleWorkflow.howToFindGTAG": "f"
}

Dependency requirements and tool versions

Biowdl pipelines use docker images to ensure reproducibility. This means that biowdl pipelines will run on any system that has docker installed. Alternatively they can be run with singularity.

For more advanced configuration of docker or singularity please check the cromwell documentation on containers.

Images from biocontainers are preferred for biowdl pipelines. The list of default images for this pipeline can be found in the default for the dockerImages input.

Output

The workflow will output mapped reads by minimap2 in a .sam file, a cleaned .sam file and log information from TranscriptClean, a database containing transcript information together with a log file of read/transcript comparison and a abundance plus summary file of the database.

Contact

For any questions about running this pipeline and feature request (such as adding additional tools and options), please use the github issue tracker or contact the SASC team directly at: sasc@lumc.nl.