This workflow will take us through the Trinity package and we can get a feel for using another tool to complete RNAseq work. Also we are using the same exact data, as was used with the Advanced RNA-Seq Analysis Using TopHat exercise, so we can see how using a different workflow can change the results. In this instance we will also perform a de novo transcriptome assembly as part of the exercise.
1. Data preparation
a. Create a new directory ‘Trinity’ in your Desktop and change to the Trinity directory;
cd ~/Desktop mkdir Trinity cd Trinity
b. Combine all the left and right reads into single files.
For the left reads:
cat ~/Desktop/rnaseq_a/input/tophat/*left.fq > ALL.LEFT.fq
For the right reads:
cat ~/Desktop/rnaseq_a/input/tophat/*right.fq > ALL.RIGHT.fq
2. Run Trinity
Run trinity on the new files;
Trinity seqType fq SS_lib_type RF left ALL.LEFT.fq \ right ALL.RIGHT.fq CPU 4 \ max_memory 4G
3. Trinity Results
a. Examine the Trinity output:
b. We can get stats on the trinity run with the following command:
TrinityStats.pl trinity_out_dir/Trinity.fasta > Trinity_stats.txt
# Viola! You have just performed your first de novo assembly of a transcriptome!
c. Examine the stats file.
- How many genes did you assemble?
- How many total transcripts?
- What was the average contig length for your assembly?
- Based upon what you know about how de bruin graphs work do you think we all have the exact same assembly? Why do you feel this way, explain?
4. Abundance Estimates using RSEM
a. Next we will use RSEM to do abundance estimates for these newly assembled transcripts. This process takes the original reads and maps them to the transcripts.
Trinity has built in scripts to process these types of workflows:
align_and_estimate_abundance.pl prep_reference \ seqType fq \ est_method RSEM \ aln_method bowtie \ left ~/Desktop/rnaseq_a/input/tophat/Sp_ds.left.fq \ right ~/Desktop/rnaseq_a/input/tophat/Sp_ds.right.fq \ transcripts trinity_out_dir/Trinity.fasta \ output_dir Sp_ds
5. RSEM outputs
a. Once finished, RSEM will have generated two files: in the directory Sp_ds RSEM.isoforms.results and RSEM.genes.results. These files contain the Trinity transcript and component (the Trinity analogs to Isoform and gene respectively). RNA-seq fragment counts and normalized expression values. Examine the format of the RSEM.isoforms.results file by looking at the top few lines of the file:
Now run the same command on the remaining three samples.
NOTE: use the up arrow to get the last command and simply change the left and right file names as well as the prefix. You can use tab complete to prevent typos:
Use the !!:gs tool to make this much faster and saves typing
Rerun the command from step 4.a.
!!:gs/ds/hs !!:gs/hs/plat !!:gs/plat/log
6. Differential Expression Using edgeR
a. To run edgeR and identify differentially expressed transcripts, we need a data table containing the raw RNA-seq fragment counts for each transcript and sample analyzed. We can combine the RSEM-computed isoform fragment counts into a matrix file.
abundance_estimates_to_matrix.pl est_method RSEM \ Sp_ds/RSEM.isoforms.results \ Sp_hs/RSEM.isoforms.results \ Sp_log/RSEM.isoforms.results \ Sp_plat/RSEM.isoforms.results \ name_sample_by_basedir \ out_prefix isoforms
b. Also for the gene files:
abundance_estimates_to_matrix.pl est_method RSEM \ Sp_ds/RSEM.genes.results \ Sp_hs/RSEM.genes.results \ Sp_log/RSEM.genes.results \ Sp_plat/RSEM.genes.results \ name_sample_by_basedir \ out_prefix genes
c. This matrix step also performs a TPM TMM normalization step that normalizes TPM across samples. Normalization will account for differences in RNA composition (ex. highly expressed transcripts in one or more samples that skew the relative proportions of transcripts in each sample). Here, we applied TMM normalization (see: http://genomebiology.com/2010/11/3/r25) to generate a matrix of normalized TPM values across all samples.
d. Now we can run the edgeR pipeline on the isoforms and/or the genes:
run_DE_analysis.pl --matrix isoforms.counts.matrix \ --output isoforms_DE \ --dispersion 0.1 \ --DE_method edgeR
run_DE_analysis.pl --matrix genes.counts.matrix \ --output genes_DE \ --dispersion 0.1 \ --method edgeR
7. EdgeR outputs
a. Now we can do the analysis of these DE isoforms/genes:
cd isoforms_DE analyze_diff_expr.pl matrix ../isoforms.TMM.EXPR.matrix output DE_isoforms
b. And the genes:
cd ../genes_DE analyze_diff_expr.pl matrix ../genes.TMM.EXPR.matrix output DE_genes
c. This produces heat maps for the highly DE isoforms or genes.
d. This expression investigation can be continued through functional annotation and GO enrichment analysis by using the Trinotate (http://trinotate.github.io/). I encourage you to look into this if your interested.