nfcore RNASeq Pipeline
Introduction
Now that you know about Linux, containers, pixi, and Nextflow, we get to start with the really cool part of our course! In this section, we will create a pixi
environment containing nf-core
and nextflow
. Once we’ve done that, we will turn our attention to nf-core to set up the rnaseq pipeline. Finally, we will run the pipeline.
Setting Up Your Pixi Environment
In our course directory execute these commands, one after the other.
Let’s inititalise an environment for this. Again, please substitute your name in the name
part of the commands.
pixi init name_nextflow -c conda-forge -c bioconda
Change directory into the project you created, and just list the files there
cd name_nextflow
ls
Add nf-core and Nextflow
pixi add nextflow nf-core
While apptainer
is sticky loaded on this server, it won’t always be the case for other servers. So, if you are running within a Linux environment (and not otherwise), you can add apptainer
with the add
command.
And just check that everything worked, summon the help message from nf-core
pixi run nf-core --help
Preparing The Run
From the nf-core homepage, we can search for pipelines. We are going to demo the rnaseq pipeline. You can see that there is a lot going on in this pipeline! We will chat more about these things in class rather than including screenshots of everything.
Under the Usage
, you will find a lot of information that describes the pipeline, including input information, the samplesheet.csv
(this one we will make together in class).
Under the Parameters
tab you will find information on all of the things that we will set up in the next step.
Under the Output
tab, you will find information on the expected output generated from the pipeline. This is useful to help you interpret what the pipeline produces.
To set up our own analysis, we will click on the launch version 3.20.0
button. (This was the version on the website at the time of writing this session. The version number may change). We are then redirected to a page where we can fill in all of our information about input files, as well as selecting or deselecting certain parts of the pipeline. We will share the things here that you need to input each time, and go through some finer details based on the discussion with you.
Setting working and results directories
We recommend that you use absolute paths rather than relative paths for setting up your runs.
During the first part, you need to set a working and result directory. If you are using a server that has a profile established, you can put the name of the server there. If not, we will create our own configuration profile if we run into memory issues.
We will compile the input CSV together in class. This is entirely unique to each analysis.
To list all fastq files with their absolute path, one per line use:
find . -maxdepth 1 -type f -name "*.fastq.gz" -exec realpath {} \;
You can substitute the .
that indicates it’s only looking in the directory you’re currently in to any other path on your file system.
Configuration profiles
Since we are working on a server with a configuration profile established, but not available via nf-core, we have downloaded it and put it in the course folder. If you want to fetch it for yourself
wget https://raw.githubusercontent.com/hpc2n/intro-course/master/exercises/NEXTFLOW/INTERACTIVE/hpc2n.config
Here is the configuration profile on HPC2N from the above link. The most important things we need to pay attention to are the max_memory
, max_cpus
, and max_time
settings. If you want to create your own profile, you can adjust these to suit your system requirements.
// Config profile for HPC2N
params {
config_profile_description = 'Cluster profile for HPC2N'
config_profile_contact = 'Pedro Ojeda @pojeda'
config_profile_url = 'https://www.hpc2n.umu.se/'
project = null
clusterOptions = null
max_memory = 128.GB
max_cpus = 28
max_time = 168.h
email = 'pedroojeda2011@gmail.com'
}
singularity {
enabled = true
}
process {
executor = 'slurm'
clusterOptions = { "-A $params.project ${params.clusterOptions ?: ''}" }
}
slurm
is a job management tool installed on many servers to distribute resources evenly among users.
If you copy this config file to use for your own server, you need to remove the process
section, unless you have a slurm
job manager installed on your cluster. If you have slurm, you will defiinitely have a system administrator who can help you write this block to suit your system!
Setting all other inputs that are required
In this section, you set variables that are related to your reference genome. If you are using something listed on iGenomes, you can input that name. If you are working with your own reference genome, or something not listed, you need to input the absolute path of the reference genomes you have downloaded.
Depending on your strategy, you might need to input a corresponding gff as well. It really depends on the kind of analysis you are hoping to perform.
Obtaining your JSON file
Once everything is filled in, click on Launch
and you will be redirected to another page containing your JSON file that has information on your run. You can either run the analyses by copying the command at the top of the page (BUT DON’T PRESS ENTER JUST YET) or by copying the JSON file a bit lower on the screen and saving it as nf-params.json
in your folder on HPC2N.
You then have to add one more row to the file, specifying the projects compute project, so the computation time can be debited from the correct account.
"project": "account_name"
Starting The Run
submit directly via pixi
Now you can run the pipeline with the following command:
pixi run nextflow run nf-core/rnaseq -r 3.19.0 -resume -params-file nf-params.json -c /proj/nobackup/slubi_sida/nextflow_input/hpc2n.config
If the config file is within your project directory you can start the run with:
pixi run nextflow run nf-core/rnaseq -r 3.19.0 -resume -params-file nf-params.json -c hpc2n.config
There are several layers to this command:
First we invoke Pixi
and tell it to run
the following commands.
Then we say which program we want to run, namely Nextflow
.
The following commands are Nextflow/ nf-core commands:
- we want to
run
thenf-core/rnaseq
pipeline, version3.19.0
- we want to use the parameter file called
nf-params.json
- we want to use the hpc configuration file called
hpc2n.config
submit via sbatch
Alternatively, you can run nextflow via pixi using a batch script and slurm: copy the following text to file called name_submit_rnaseq.sh
where name
is your name.
#!/bin/bash -l
#SBATCH -A our_proj_allocation
#SBATCH -n 5
#SBATCH -t 24:00:00
/your_home_directory/.pixi/bin/pixi run nextflow run nf-core/rnaseq -r 3.19.0 -params-file /your_path/nf-params.json
And then submit it to slurm with
sbatch name_submit_rnaseq.sh
You can check the progress of your job with squeue -u your_username
And now we wait until the run is done!
Nextflow is notoriously bad at cleaning after itself. You can check previous runs with pixi run nextflow log
. And then clean up with for example pixi run nextflow clean -f -before <run_name>
. Here is an explanation of the command.