Nextflow

This module introduces Nextflow, arguably one of the most widely-used workflow managers for bioinformatics applications. Nextflow is designed to be scalable, perform reproducible data analysis, work on local hardware or on HPC/cloud. A major benefit of Nextflow is the community-developed nf-core library of workflows that are easy to download and run immediately.

After going through this module, students should be able to:

Describe core Nextflow concepts and terminology, and compare those to other workflow managers
Install and configure Nextflow and nf-core on an HPC cluster
Write, modify, and run simple Nextflow workflows
Retrieve workflows from nf-core and run them on an HPC cluster

About Nextflow

The developers of Nextflow took inspiration from Unix: As we know, Unix is a collection of of many simple command line tools that, together, can do very powerful thing. Similarly, Nextflow was designed around the concept of simple “processes” that, together, could be linked into powerful “workflows”. Each process is a standalone task that is language agnostic, and is associated with one or more inputs and outputs.

Another major concept in Nextflow is the idea of the “channel”. The channel is like a conveyor belt that shuttles data from one step to the next in the workflow. Channels can work through serial data flows, or support more complex parallel data flows.

The Nextflow syntax, technically a “workflow language”, is based on Java and Groovy. The language is fairly easy to read and may look somewhat familiar to Python developers. Although there is still a high learning curve associated with developing workflows in a new language, the time will be well worth the effort in the long run.

Community Nextflow Pipelines: nf-core

nf-core is a community effort to develop, curate, and maintain a set of best-practice analysis pipelines built with Nextflow, with a strong emphasis on reproducibility, portability, and standardized practices. If you have a general idea of the type of analysis you want to run (e.g., RNA-seq, variant calling, ChIP-seq), it is often best to first search nf-core for an existing, well-tested pipeline rather than starting from scratch.

nf-core also provides a dedicated command-line interface ( nf-core), which includes tools for discovering pipelines, checking compatibility, and managing configurations, as we will see below.

Core Terminology

Process: A computational step with inputs, outputs, and a script
Channel: A data stream that connects processes and passes data between them
Executor: The backend that runs the jobs (i.e. local, SLURM, cloud, etc.)
Profile: A named configuration bundle for running Nextflow

Installation

After logging in to Lonestar6, create a new directory to organize this work. Update the version of Java, and download and run the Nextflow installer script:

# organize the work
[ls6]$ cd $WORK
[ls6]$ mkdir nextflow-example && cd nextflow-example

# update java
[ls6]$ curl -s https://get.sdkman.io | bash
[ls6]$ source $HOME/.sdkman/bin/sdkman-init.sh
[ls6]$ sdk install java 17.0.10-tem

# install nextflow
[ls6]$ curl -s https://get.nextflow.io | bash
[ls6]$ chmod +x nextflow

# move nextflow into an executable path
[ls6]$ mkdir -p $HOME/.local/bin/
[ls6]$ mv nextflow $HOME/.local/bin/
[ls6]$ export PATH=$HOME/.local/bin/:$PATH

# verify the installation
[ls6]$ nextflow info

Note

It would be a good idea to add the following two lines to your ~/.bashrc in order to keep Nextflow and the right version of Java in your PATH:

source $HOME/.sdkman/bin/sdkman-init.sh
export PATH=$HOME/.local/bin/:$PATH

The installation process for nf-core is a little bit simpler. It is a Python package that can be installed with pip, so create a virtual environment and install with the following commands:

# create a virtual environment
[ls6]$ python -m venv .venv
[ls6]$ source .venv/bin/activate

# install and verify snakemake
(venv)[ls6]$ pip3 install nf-core
(venv)[ls6]$ nf-core --version
(venv)[ls6]$ nf-core pipelines list

Nextflow Usage

First Process

Create a new Nextflow file (called hello_world.nf). The first step is to write a process and a workflow that executes that process. Consider the following code:

process sayHello {
    output:
    path "hello_world.txt"

    script:
    """
    echo "Hello World!" > hello_world.txt
    """
}

workflow {
    main:
    sayHello()
}

There is one process that takes no inputs and writes one output - a text file called hello_world.txt. The process is orchestrated within the workflow block, which in this case contains only one item to execute.

After saving the file, run Nextflow with the following command:

[ls6]$ nextflow run hello_world.nf
 N E X T F L O W   ~  version 25.10.4

Launching `hello_world.nf` [suspicious_volhard] DSL2 - revision: 61b9a538a0

executor >  local (1)
[14/f3d424] sayHello [100%] 1 of 1 ✔

The console indicates success, so list the files in the current directory to confirm the expected output is there.

Tip

Having trouble finding the output file? Try tree -a . and examine all of the output files and folders you find.

Publish Outputs

After running the first Nextflow workflow once, try running it again with the exact same command. In contrast to Snakemake, Nextflow by default does not verify that the output is already there - rather it runs the workflow again and stores the output in a new work subdirectory. This is the default intended behavior of Nextflow. The work folder is meant to be pupulated with temporary or intermediate files. We can specify the output file as something we would like to keep by adding the following instruction:

process sayHello {
    output:
    path "hello_world.txt"

    script:
    """
    echo "Hello World!" > hello_world.txt
    """
}

workflow {
    main:
    sayHello()

    publish:
    first_output = sayHello.out
}

output {
    first_output {
        path "."
        mode "copy"
    }
}

The publish statement instructs Nextflow to copy the output from the sayHello process (sayHello.out) from the temporary folder to a user-defined path, in this case ".". Although, the output block specifies an output path as ".", that path will be relative to a new, local results folder that Nextflow will create. Run Nextflow again and verify that the output file is now in the expected location:

[ls6]$ nextflow run hello_world.nf

Next, try running Nextflow with the -resume flag (Note: only one hyphen) and it will use cached values if possible:

[ls6]$ nextflow run hello_world.nf -resume

This is excellent for long workflows that failed part way through - it will use cached values where possible and generate new outputs only if they are missing.

Variables

After creating a very simple workflow, the next step is to abstract away from hardcoding values and filenames where possible. Instead, replace them with variables. Consider the following changes:

process sayHello {
    input:
    val name

    output:
    path "hello_${name}.txt"

    script:
    """
    echo "Hello ${name}" > hello_${name}.txt
    """
}

workflow {
    main:
    sayHello(params.input)

    publish:
    first_output = sayHello.out
}

output {
    first_output {
        path "."
        mode "copy"
    }
}

We added a new input to the sayHello process called name. We pass a value into that input from within the workflow through a flag called params.input that in turn is passed on the command line:

[ls6]$ nextflow run hello_world.nf --input Joe

Tip

In order to to hard code a default, add a line like the following at the top of the Nextflow script:

params.input = "Joe"

Channels

Channels are a core component of Nextflow and are a powerful way to scale up input data. The values passed through a channel can be hardcoded in the workflow, can be dynamically generated, or can be read in from a file. Consider the following update to the workflow:

process sayHello {
    input:
    val name

    output:
    path "hello_${name}.txt"

    script:
    """
    echo "Hello ${name}" > hello_${name}.txt
    """
}

workflow {
    main:
    name_ch = channel.of("Alice", "Bob", "Carol")
    sayHello(name_ch)

    publish:
    first_output = sayHello.out
}

output {
    first_output {
        path "."
        mode "copy"
    }
}

Again, run the workflow and verify that the expected output files are generated. Notice that the process is executed three times, once for each value in the channel.

Multi-Step Workflows

To demonstrate the real potential of Nextflow, next implement a new process to form a two-step workflow. The second process will take the output from the first process, modify it, and write a new output. Consider the following code:

process sayHello {
    input:
    val name

    output:
    path "hello_${name}.txt"

    script:
    """
    echo "Hello ${name}" > hello_${name}.txt
    """
}

process sayGoodbye {
    input:
    path input_file

    output:
    path "goodbye_${input_file}"

    script:
    """
    sed s/Hello/Goodbye/ ${input_file} > goodbye_${input_file}
    """
}

workflow {
    main:
    name_ch = channel.of("Alice", "Bob", "Carol")
    sayHello(name_ch)
    sayGoodbye(sayHello.out)

    publish:
    first_output = sayHello.out
    second_output = sayGoodbye.out
}

output {
    first_output {
        path "."
        mode "copy"
    }
    second_output {
        path "."
        mode "copy"
    }
}

In this workflow, there is a new process called sayGoodbye which replaces the word Hello with Goodbye in the specified file, then renames the file with a “goodbye” prefix. The output from the sayHello process is passed directly into the sayGoodbye process. New blocks are also added to specify that the output from the second process should also be published to the results folder.

EXERCISE

Modify the script block in the sayGoodbye process to use the parameter instead of hardcoding the name in the output file. For example, the output file could be named goodbye_${name}.txt instead of goodbye_${input_file}. You will need to also modify how the process is called in the workflow to pass it a new parameter.

Summary of Important Nextflow Commands

A list of some of the more commonly used Nextflow flags and commands include:

nextflow run: run a Nextflow workflow
nextflow run -resume: use cached values if possible
nextflow log: print log of previous runs
nextflow clean: clean up files from previous runs
nextflow help: print help message

Hands-On Exercise: Variant Calling

Recall the workflow we implemented in Unit 11 for variant calling. It was a four-step workflow that took in a reference genome and a set of aligned SAM files as input, and output a VCF file as the final result.

Collect the following two inputs again:

[ls6]$ cp /work/03439/wallen/public/samtools_example/ecoli_reads_aligned.sam ./
[ls6]$ cp /work/03439/wallen/public/samtools_example/ecoli_NC_008253.fna ./

A summary of the commands that were required to run the original workflow (manually) are as follows:

module load biocontainers
module load samtools/ctr-1.20--h50ea8bc_0
module load bcftools/ctr-1.21--h3a4d415_1

samtools view -b -S -o ecoli_reads_aligned.bam ecoli_reads_aligned.sam
samtools sort -o ecoli_reads_aligned_sorted.bam ecoli_reads_aligned.bam
samtools index ecoli_reads_aligned_sorted.bam
bcftools mpileup -f ecoli_NC_008253.fna -o ecoli_variants.vcf ecoli_reads_aligned_sorted.bam

Modify the following template Nextflow workflow file to implement the same workflow as above using four processes, one for each step. Publish the final output only, the variant call file.

params.reads_sam = "${projectDir}/ecoli_reads_aligned.sam"
params.reference = "${projectDir}/ecoli_NC_008253.fna"

process convertToBam {
    input:
    path input_sam

    output:
    path "${input_sam}.bam"

    script:
    """
    samtools view -b -S -o ${input_sam}.bam ${input_sam}
    """
}

process sortBam {
    ...
}

process indexBam {
    ...
}

process variantCalling {
    ...
}

workflow {
    main:
    convertToBam(params.reads_sam)
    sortBam(...)
    indexBam(...)
    variantCalling(...)

    publish:
    vcf_out = ...
}

output
{
    vcf_out {
        path "."
        mode "copy"
    }
}

Hint

To specify which containers you need for each step, refer to the Nextflow documentation on containers. You will use a Profile (config file called nextflow.config) similar to:

process {
    withName:convertToBam {
        container = '/work/projects/singularity/TACC/bio_modules/biocontainers/samtools/samtools-1.20--h50ea8bc_0.sif'
    }
    withName:sortBam {
        container = '...'
    }
    withName:indexBam {
        container = '...'
    }
    withName:variantCalling {
        container = '...'
    }
}
apptainer {
    enabled = true
}

Run the workflow with nextflow run <workflow file> and verify the output.

Hands-On Exercise: nf-core

As mentioned previously, nf-core is a community effort to develop, curate, and maintain a set of best-practice analysis pipelines built with Nextflow. If you have a general idea of the type of analysis you want to run (e.g., RNA-seq, variant calling, ChIP-seq), it is often best to first search nf-core for an existing, well-tested pipeline rather than starting from scratch.

You can search the nf-core pipelines with the following command:

(venv)[ls6]$ nf-core pipelines list

And you can run a demo FastQC pipeline with the following command:

(venv)[ls6]$ nextflow run nf-core/demo -profile test,apptainer --outdir results

The -profile flag specifies to use some pre-canned test data, and to use Apptainer to automatically pull the necessary containers. The --outdir flag specifies where to put the output. After running, check the results folder to verify that the expected output is there.

Other Considerations and Next Steps

Nextflow has many other features not mentioned here. A few of the most important features to be aware of include:

Profiles: Profiles are a powerful way to manage different configurations for running the same workflow in different environments. For example, you could have one profile for running on your local machine, and another profile for running on an HPC cluster. Profiles can specify different executors, different resource requirements, and different parameters.
Executors: Nextflow supports a wide range of executors, including local execution, SLURM, SGE, LSF, Kubernetes, and more. This allows you to run the same workflow on different computing environments without changing the workflow code.
Error Handling: Nextflow has built-in error handling and retry mechanisms. If a process fails, Nextflow can automatically retry it a specified number of times. You can also specify custom error handling logic in your workflow.

Debugging workflows in an interactive session on a compute node can be very helpful for understanding how the workflow is executing and for troubleshooting issues. When ready to run a workflow in batch mode, simply put the Nextflow command in a SLURM script and submit it to the scheduler. Take care to source the appropriate environment and load any necessary modules in the SLURM script before running:

#!/bin/bash
#SBATCH -J nextflow_job
#SBATCH -o nextflow_job.o%j
#SBATCH -e nextflow_job.e%j
#SBATCH -p development
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 00:30:00
#SBATCH -A OTH24028

# assuming you are in working directory with your workflow file
source $HOME/.sdkman/bin/sdkman-init.sh
export PATH=$HOME/.local/bin/:$PATH

nextflow run hello_world.nf