How to Create Your Bioinformatics Pipeline with Nextflow

Now that you know how to run bioinformatics software in Docker containers, it’s time to connect them up. If you’ve missed the last post the link is here: Getting started with Docker for bioinformatics.

Content Overview

What is a pipeline?
Nextflow vs Snakemake
Using Nextflow and Docker containers to create your pipeline
Summary

What is a pipeline anyway?

The term ‘pipeline’ is thrown around a lot in bioinformatics. In simple terms, it refers to the programs that have to be run in a certain order to complete the analysis. Some of these programs take the outputs of earlier programs and process them in order to achieve a specific objective.

This can be as simple as a couple of programs or it can become a messy spider web. A basic example would be: raw fastq files from the sequencer –> passing it through an aligner against the human genome –> variant calling.

NF core have a nice set of pipelines here including much more complex ones.

How to get started

There are two key frameworks that I will point you to: nextflow and snakemake. These are the programs that will run your programs for you and orchestrate the input and output files.

Nextflow vs Snakemake

Disclaimer: this is my personal opinion on this topic. Try both and make up your mind as to which suits your thought process better.

1. The programming language difference doesn’t matter

Snakemake is based on python whereas nextflow uses groovy (which is like python for java). Initially when I first came across these comparisons I immediately jumped to snakemake as I’m pretty comfortable with python.

Having tried both now, I would say that since you need to learn the pipeline syntax anyway, the actual difference between python and groovy is pretty minimal.

2. The biggest difference is the ’thinking direction’ of the pipeline

Let me explain what I mean. In snakemake you have to ’think backwards’ and work your way from the desired output towards the input. I find nextflow way more intuitive in that you think about your pipeline in sequence of how you run your programs.

Here’s how you have to think in snakemake: I need output C which comes from program B and to get program B to run you need the output of program A.

Compared to nextflow: With my input files I run it in program A and then take it to program B. Output C will result.

3. Why I switched to Nextflow: The separation of how the pipeline is run from what the pipeline is

The key reason why I switched from snakemake to nextflow is ’the decoupling of pipeline execution’ or in plain terms, the separation of where and how your pipeline is run and what the pipeline actually does. The series of steps is described in the main file while the execution can be controlled by the configuration. We will cover pipeline execution in a separate article.

Nextflow Pipeline

It is now time to go into more detail on setting up your pipeline in nextflow. Here’s an example pipeline which takes in Illumina paired reads, runs them through fastqc and bowtie2 and perform some simple statistics on the aligned file.

Prerequisites

Install Nextflow
Install Docker

1. Set up and define your input and outputs

First we set up a couple of parameters that can be easily changed depending on whether you are running test files vs actual data. These parameters can be modified in your config profiles.

// main.nf file
#!/usr/bin/env nextflow

// PIPELINE PARAMETERS HERE

// Input Files
params.reads = "$baseDir/data/*/*_{R1,R2}_*.fastq.gz"

// Report Directory
params.outdir = 'reports'

// Reference Genomes
params.bowtie2_reference_index = "$baseDir/reference_db/bowtie2/bt2_index.tar.gz"
bowtie2_db_ch = Channel.value(file("${params.bowtie2_reference_index}"))

2. Create ‘channels’ and feed your input into them

In nextflow, there is this concept of ‘channels’. The basic premise is simple - each channel represents one file that can only be consumed once (unless it’s a ‘value channel’). Think of channels as pipes that you can feed files into which can be used only once.

In this example, we generate the channels using the fromFilePairs method and create 2 channels - fastqc_reads and reads_for_alignment. *side note this is using the standard nextflow language rather than their newer DSL2.

reads = Channel.fromFilePairs(params.reads)
reads.into { fastqc_reads; reads_for_alignment }

3. Now let’s have a look at our first process and break it down

Here is the process block - this process is named fastqc_run but you can change this to whatever name suits.

process fastqc_run {
    publishDir "$params.outdir/fastqc/$sample_id/", mode: 'copy'
    container 'biocontainers/fastqc:v0.11.9_cv8'
    cpus 16

    input:
    tuple val(sample_id), file(reads_file) from fastqc_reads

    output:
    file "*_fastqc.{zip,html}"

    script:
    """
    fastqc $reads_file -o . --threads ${task.cpus}
    """
}

In general, there are 4 components to each process:

A configuration block
Process inputs
Process outputs
Script that runs the desired program

Let’s break down the above process codeblock into the 4 components.

Configuration

  publishDir "$params.outdir/fastqc/$sample_id/", mode: 'copy'
  container 'biocontainers/fastqc:v0.11.9_cv8'
  cpus 16

publishDir This allows you to copy the output files of this process to a desired location for easy access. Note that by default it creates a shortcut link to the actual file location and you have to explicitly specify for it to copy and give you the actual file.

container This is where the real magic happens. If you specify a publicly available Docker container, Nextflow will seamlessly pull the container in, run the script and generate the output that you’re looking for. *side note on the config you will need to specify docker enabled as true.

cpus This specifies the amount of CPUs you wish to run for this process. You can also specify memory options and will help in autoscaling your cloud computation needs. In general this option is useful if you are using VMs of varying size and works quite well with Google Cloud.

Inputs

  input:
  tuple val(sample_id), file(reads_file) from fastqc_reads

The Channel.fromFilePairs method generates a tuple [wildcard value, [the pair of files for processing]]. Read here for full documentation.

So when we call it into the input, we assign the variable name sample_id to the wildcard value and call the files reads_file. As a side note, if you need to access the first file you can call reads_file[0] to do so and the second file is reads_file[1].

Outputs

  output:
  file "*_fastqc.{zip,html}"

Here we are telling nextflow to expect a zip file and a html file from the output of fastqc. Since we declared publishDir above, nextflow will then copy these files to the report directory.

Script

Finally, let’s tell nextflow how to run fastqc!

  script:
  """
  fastqc $reads_file -o . --threads ${task.cpus}
  """

A couple of comments. -o . tells fastqc to place the output in the current working directory - this will be located in nextflow’s work directory. --threads ${task.cpus} is how we parallelise fastqc to take advantage of the available cpus for this process.

4. Here is how the pipeline continues

The code below is hopefully pretty straightforward. The output from bowtie2 is split into two channels aligned_ch and stats_ch. stat_ch is sent to samtools to run flagstat and aligned_ch continues on for onward processing in the next steps of the pipeline.

process bowtie2 {
    container 'biocontainers/bowtie2:v2.4.1_cv1'
    cpus 16

    input:
    tuple val(sample_id), file(reads_file) from reads_for_alignment
    file db from bowtie2_db_ch

    output:
    tuple val(sample_id), file('*.sam') into aligned_ch, stats_ch

    script:
    """
    tar -xvf $db
    bowtie2 -t -p ${task.cpus} -x bowtie2/GRCh38_bowtie2 -1 ${reads_file[0]} -2 ${reads_file[1]} -S ${sample_id}.sam
    """
}

process samtools_flagstat {
    publishDir "$params.outdir/samtools_flagstat/"
    container 'biocontainers/samtools:v1.9-4-deb_cv1'
    cpus 16
    tag "$sample_id"

    input:
    tuple val(sample_id), file(sam_file) from stats_ch

    output:
    path "${sample_id}_flagstat.txt"

    script:
    """
    samtools flagstat -@ ${task.cpus} $sam_file > ${sample_id}_flagstat.txt
    """
}

Summary

The above is a simple introduction to using nextflow to pull in containers and orchestrate a pipeline. It is well worth investing some time to automate your pipeline - write your code once and use it forever!

More importantly, this is one step closer to creating your cloud genomics supercomputer. We will cover this next.

Content Overview#

What is a pipeline anyway?#

How to get started#

Nextflow vs Snakemake#

1. The programming language difference doesn’t matter#

2. The biggest difference is the ’thinking direction’ of the pipeline#

3. Why I switched to Nextflow: The separation of how the pipeline is run from what the pipeline is#

Nextflow Pipeline#

Prerequisites#

1. Set up and define your input and outputs#

2. Create ‘channels’ and feed your input into them#

3. Now let’s have a look at our first process and break it down#

Configuration#

Inputs#

Outputs#

Script#

4. Here is how the pipeline continues#

Summary#