Analyzing RNAseq expression with Salmon

Based on the Eel-Pond RNAseq workflow protocol here.

Getting Started

If needed, start up a fresh Ubuntu 16.04 instance on Jetstream using the instructions here.

On a Jetstream instance, run the following commands to update the base software:

sudo apt-get update && \
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
        default-jre pkg-config libncurses5-dev r-base-core r-cran-gplots \
        python-matplotlib python-pip python-virtualenv sysstat fastqc \
        trimmomatic bowtie samtools blast2 wget bowtie2 openjdk-8-jre \
        hmmer ruby


We will use Salmon to quantify expression. Salmon is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration (Patro et al. 2015).

For further reading, see

  • Intro blog post:
  • A 2016 blog post evaluating and comparing methods here
  • Salmon github repo here

Download and make Salmon available in the path

tar xvfz Salmon-0.8.2_linux_x86_64.tar.gz
export PATH=$PATH:$HOME/Salmon-0.8.2_linux_x86_64/bin 

Then, let’s check we still have our reads from yesterday’s QC work

set -u
printf "\nMy trimmed data is in $PROJECT/quality/, and consists of $(ls -1 ${PROJECT}/quality/*.qc.fq.gz | wc -l) files\n\n"
set +u

where set -u should let you know if you have any unset variables, i.e. if the $PROJECT variable is not defined.

If you see -bash: PROJECT: unbound variable, then you need to set the $PROJECT variable.

export PROJECT=/mnt/work

and then re-run the printf code block.

NOTE: if you do not have files, please rerun quality trimming steps here

Run Salmon

Build an index for your new transcriptome:

   salmon index --index nema --transcripts trinity.nema.annot.fasta --type quasi

Next, link in the QC reads (produced in quality:

   ln -s ../quality/*R1*.qc.fq.gz .
   ln -s ../quality/*R2*.qc.fq.gz .

Then, run the salmon command:

  for R1 in *R1*.qc.fq.gz
    sample=$(basename $R1 extract.qc.fq.gz)
    echo sample is $sample, R1 is $R1
    echo R2 is $R2
    salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}quant

This will create a bunch of directories named something like 0Hour_ATCACG_L002001.quant, containing a bunch of files. Take a look at what files there are:

    find 0Hour_ATCACG_L002_R1_001* -type f

You should see::


The two most interesting files are salmon_quant.log and quant.sf. The latter contains the counts; the former contains the log information from running things.

Working with the counts

The quant.sf files actually contain the relevant information about expression – take a look

    head -20 0Hour_ATCACG_L002_R1_001.quant/quant.sf

The first column contains the transcript names, and the fifth column is what edgeR etc will want - the “raw counts”. However, they’re not in a convenient location / format for edgeR to use; let’s fix that.

Other useful tutorials and references