# Analyzing RNAseq expression with Salmon Based on the Eel-Pond RNAseq workflow protocol [here](http://eel-pond.readthedocs.io/en/latest/). ## Getting Started If needed, start up a fresh Ubuntu 16.04 instance on Jetstream using the instructions [here](jetstream/boot.html). On a Jetstream instance, run the following commands to update the base software: ``` sudo apt-get update && \ sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \ default-jre pkg-config libncurses5-dev r-base-core r-cran-gplots \ python-matplotlib python-pip python-virtualenv sysstat fastqc \ trimmomatic bowtie samtools blast2 wget bowtie2 openjdk-8-jre \ hmmer ruby ``` --- ## Installation We will use [Salmon](http://salmon.readthedocs.org/en/latest/) to quantify expression. Salmon is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration ([Patro et al. 2015](http://dx.doi.org/10.1038/nmeth.4197)). For further reading, see * Intro blog post: http://robpatro.com/blog/?p=248 * A 2016 blog post evaluating and comparing methods [here](https://cgatoxford.wordpress.com/2016/08/17/why-you-should-stop-using-featurecounts-htseq-or-cufflinks2-and-start-using-kallisto-salmon-or-sailfish/) * Salmon github repo [here](https://github.com/COMBINE-lab/salmon) ## Download and make Salmon available in the path ``` cd wget https://github.com/COMBINE-lab/salmon/releases/download/v0.8.2/Salmon-0.8.2_linux_x86_64.tar.gz tar xvfz Salmon-0.8.2_linux_x86_64.tar.gz export PATH=$PATH:$HOME/Salmon-0.8.2_linux_x86_64/bin ``` ## Then, let's check we still have our reads from yesterday's QC work ``` set -u printf "\nMy trimmed data is in $PROJECT/quality/, and consists of $(ls -1 ${PROJECT}/quality/*.qc.fq.gz | wc -l) files\n\n" set +u ``` where set -u should let you know if you have any unset variables, i.e. if the `$PROJECT` variable is not defined. If you see `-bash: PROJECT: unbound variable`, then you need to set the $PROJECT variable. ``` export PROJECT=/mnt/work ``` and then re-run the `printf` code block. NOTE: if you do not have files, please rerun quality trimming steps [here](quality-trimming.html) ## Download or link an assembly We can download a full assembly to use for mapping. This assembly was made with all Nematostella vectensis reads, rather than the subset we used in the [assembly](assembly-trinity.html) tutorial. ``` cd ${PROJECT} mkdir -p quant cd quant ``` ``` curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz gunzip trinity-nematostella-raw.fa.gz ln -s trinity-nematostella-raw.fa trinity.nema.full.fasta ``` Note: if you prefer, you can use the annotated assembly we generated with the read subsets instead ``` #ln -s ${PROJECT}/annotation/trinity.nema.fasta.dammit/trinity.nema.fasta.dammit.fasta ./trinity.nema.annot.fasta ``` ## Run Salmon Build an index for your new transcriptome: ``` salmon index --index nema --transcripts trinity.nema.annot.fasta --type quasi ``` Next, link in the QC reads (produced in [quality](quality-trimming.html): ``` ln -s ../quality/*R1*.qc.fq.gz . ln -s ../quality/*R2*.qc.fq.gz . ``` Then, run the salmon command: ``` for R1 in *R1*.qc.fq.gz do sample=$(basename $R1 extract.qc.fq.gz) echo sample is $sample, R1 is $R1 R2=${R1/R1/R2} echo R2 is $R2 salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}quant done ``` This will create a bunch of directories named something like `0Hour_ATCACG_L002001.quant`, containing a bunch of files. Take a look at what files there are: ``` find 0Hour_ATCACG_L002_R1_001* -type f ``` You should see:: ``` 0Hour_ATCACG_L002_R1_001.extract.quant/quant.sf 0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias.gz 0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias_3p.gz 0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/fld.gz 0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/expected_bias.gz 0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/meta_info.json 0Hour_ATCACG_L002_R1_001.extract.quant/cmd_info.json 0Hour_ATCACG_L002_R1_001.extract.quant/logs/salmon_quant.log 0Hour_ATCACG_L002_R1_001.extract.quant/libParams/flenDist.txt 0Hour_ATCACG_L002_R1_001.extract.quant/lib_format_counts.json 0Hour_ATCACG_L002_R1_001.extract.quant.counts ``` The two most interesting files are `salmon_quant.log` and `quant.sf`. The latter contains the counts; the former contains the log information from running things. # Working with the counts The `quant.sf` files actually contain the relevant information about expression -- take a look ``` head -20 0Hour_ATCACG_L002_R1_001.quant/quant.sf ``` The first column contains the transcript names, and the fifth column is what edgeR etc will want - the "raw counts". However, they're not in a convenient location / format for edgeR to use; let's fix that. # Other useful tutorials and references https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst http://angus.readthedocs.io/en/2016/rob_quant/tut.html https://2016-aug-nonmodel-rnaseq.readthedocs.io/en/latest/quantification.html