Loading...
 

Basic RNA-Seq Analysis Using TopHat

1. Demo Data

a. Login to your MolBioCloud system using X2GO.

b. Create your own ‘mydemo’ folder and copy raw fastq files into it

mkdir -p mydemo/fastq
cp ~/Desktop/rnaseq/fastq/* mydemo/fastq

c. Do all personal testing in this folder

cd mydemo
ls fastq

d. All demo testing results are provided in... ~/Desktop/rnaseq
Compare with corresponding files to check your results

--
RNA-Seq data sets: ENCODE project, C2C12 myoblast cells

10,000 reads, 100 bp, paired-end reads

Two samples: single-end reads

Wt and deletion (10 Mb fragment of chr1)

~/mydemo/fastq/myoblast_wt.fastq

~/mydemo/fastq/myoblast_deletion.fastq

One sample: paired-end reads

~/mydemo/fastq/myoblast_wt_1.fastq

~/mydemo/fastq/myoblast_wt_2.fastq


2. View Raw Data

Top 10 lines of fastq file

head –n 10 fastq/myoblast_wt.fastq

Bottom 10 lines

tail –n 10 fastq/myoblast_wt_fastq

3. Run FastQC

Run fastqc on fastq files for analyzing the raw data quality

cd ~/mydemo
mkdir fastqc
fastqc -o fastqc fastq/*.fastq
ls fastqc

View html report in web browser
fastqc/myoblast_wt_fastqc/fastqc_report.html


4. Run Fastx-toolkit

4.1. Trimming reads

cd ~/mydemo/fastq
fastx_trimmer -t 10 -i myoblast_wt.fastq -o myoblast_wt_trim10.fastq
head -n 4 myoblast_wt.fastq

@HWI-ST0787:100:C02F9ACXX:2:1101:10006:188055_1:N:0:/1
CTCATACGCAGCAGCTTGAGCATTTACTCGAGTTTCTGTGCCCTCTACAGGCAAAGCGAAACTGTCCATGATGATCATGGTCTCGCCGTCGACTTTCCCG
+
CCCFFFFFHHHHHJJJJJIJJJJJJJJJJJJJJJIIJJIIJIIIIIJJIIIJIJIJJJJJJHHHHFFFFFFFEEFCEEEDDDDDCDDDDDDBDDDDDDDD

head –n 4 myoblast_wt_trim10.fastq

@HWI-ST0787:100:C02F9ACXX:2:1101:10006:188055_1:N:0:/1
CTCATACGCAGCAGCTTGAGCATTTACTCGAGTTTCTGTGCCCTCTACAGGCAAAGCGAAACTGTCCATGATGATCATGGTCTCGCCGTC
+
CCCFFFFFHHHHHJJJJJIJJJJJJJJJJJJJJJIIJJIIJIIIIIJJIIIJIJIJJJJJJHHHHFFFFFFFEEFCEEEDDDDDCDDDDD


5. Mapping reads

5.1. Files needed for tophat

  1. Fastq files of test data set
  2. Bowtie2 indexes
  3. Gene annotation: GTF file
  4. Genome sequence .fasta file

a. List of different available genomes at the bowtie2 website
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml 
‘Indexes’ section to the right of page

b. Files for several organisms available as part of illumina iGenomes
http://support.illumina.com/sequencing/sequencing_software/igenome.html

c. For our purposes:
All files already available in ‘Desktop/rnaseq’ folder

5.1.1. Bowtie2 indexes

ls -1 ~/Desktop/rnaseq/bowtie2_indexes/*.bt2

mm9_chr1.1.bt2
mm9_chr1.2.bt2
mm9_chr1.3.bt2
mm9_chr1.4.bt2
mm9_chr1.rev.1.bt2
mm9_chr1.rev.2.bt2

5.1.2. Gene annotation GTF

a. Gene transfer format (GTF) - Annotation of regions into exons, introns, UTR
b. GTF file fields:

c. Demo file: mm9_chr1.gtf

cp ~/Desktop/rnaseq/mm9_chr1.gtf ~/mydemo
cd ~/mydemo
head -n 4 mm9_chr1.gtf

chr1    unknown exon    3204563 3207049 .       -       .       gene_id “Xkr4”; gene_name “Xkr4”; p_id “P1231”; transcript_id “NM_001011874”; tss_id “TSS1903”;
chr1    unknown stop_codon      3206103 3206105 .       -       .       gene_id “Xkr4”; gene_name “Xkr4”; p_id “P1231”; transcript_id “NM_001011874”; tss_id “TSS1903”;
chr1    unknown CDS     3206106 3207049 .       -       2       gene_id “Xkr4”; gene_name “Xkr4”; p_id “P1231”; transcript_id “NM_001011874”; tss_id “TSS1903”;

5.1.3. Genome sequence FASTA

Raw sequence data
Demo file: mm9_chr1.fa

tail –n 5 ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1.fa

agaatttggaaaactaaaatatatacatgtttgtatcgatacttgttcag
gatttccttttgcttctctgcatacaggagaagcttctaaaaacgtaatt
gatcattgcctacaagcctttaatgccatgggattgcctaaatttattaa
gacagataacggccatcctattctagtaaaaattttatttcattctgtaa
agaatttggtattaaacttaaaactggaattc

5.2. Run tophat2

5.2.1. General format

tophat2 [options] bowtie-index input.fastq

-o     Directory to write all results

-G/--GTF     Annotation file supplied to tophat

--transcriptome-index     Directory to save transcriptome index. Use first time along with –G/GTF option. Omit –G/GTF option on subsequent runs to avoid rebuilding the index

-p     Number of processors to be used

5.2.2. Map single-end reads

cd ~/mydemo

tophat2 –o tophat_wt –G mm9_chr1.gtf --transcriptome-index mm9-chr1-transcriptome ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1 fastq/myoblast_wt.fastq

tophat2 –o tophat_deletion –G mm9_chr1.gtf --transcriptome-index mm9-chr1-
transcriptome ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1
fastq/myoblast_deletion.fastq

5.2.3. Map paired-end reads

tophat2 –o tophat_wt_PE –G mm9_chr1.gtf --transcriptome-index mm9-chr1-transcriptome ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1 fastq/myoblast_wt_1.fastq fastq/myoblast_wt_2.fastq

5.3. tophat output files

a. Tophat produces several files

ls -1 ~/mydemo/tophat_wt

logs/
accepted_hits.bam 
align_summary.txt
deletions.bed
insertions.bed
junctions.bed
prep_reads.info
unmapped.bam

b. Useful files

Each module has separate log
Can find number of reads mapping at different stages, etc
tophat.log – record of tophat output to command-line during run
run.log – record of individual commands that are part of tophat run
align_summary.txt – brief summary of alignment results

6. View Alignment

a. Alignment results: accepted_hits.bam

View with samtools

samtools view –h tophat_wt/accepted_hits.bam | head –n 5

@HD    VN:1.0    SO:coordinate
@SQ    SN:chr1    LN:197195432
@PG    ID:TopHat    VN:2.0.13    CL:/usr/bin/tophat -o tophat_wt -G mm9_chr1.gtf --transcriptome-index=mm9-chr1-transcriptome bowtie2_indexes/mm9_chr1 /media/sf_FAES/demo/rnaseq-demo-new/fastq/myoblast_wt.fastq

HWI-ST0787:100:C02F9ACXX:2:1102:10662:94956_2:N:0:    0    chr1    10012599    50    100M    *    0    0    GAAAACCCCAGACAAAACCCCAGGCTTTTACCAAGACAATAGATCGCTCTCCATAAACTGACAGCACGTCTCCATTGCTGAAGACAACACCTATACAACT    @@@FFFFFHHHGHJJJJJJJJJJJIIJJJJJJJIIJBGFHGIJIIIIJJJGGHIIGJIJGJFHHAHFFFDECCEEEDDDD@CCDCCCABBDDCCCDDDAC    AS:i:0    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:100    YT:Z:UU    NH:i:1
HWI-ST0787:100:C02F9ACXX:2:1102:10662:94956_1:N:0:    16    chr1    10012649    50    100M    *    0    0    CCATAAACTGACAGCACGTCTCCATTGCTGAAGACAACACCTATACAACTCCTTGAACATGGAGAAGTCATGCTAGTGCCTGCAGAGTACCTATGTCCTA    DCDEEFCCADDEDDDEFFFFFFHEGGGHGHGFDJJJHFBEIIIGFIHDD;JIGGJJJJIHGJJJJJJJJIJJJIIIIJIIIJJJJJJHFHDFFDEDF@@B    AS:i:0    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:100    YT:Z:UU    NH:i:1

b. Index and visualize alignments with samtools tview

cd ~/mydemo/tophat_wt
samtools index accepted_hits.bam
samtools tview -d C -p chr1:13615979 accepted_hits.bam

 

7. Alignment statistics

7.1. Uniquely mapping reads

Calculate number of uniquely mapping reads using SAM mapping quality (-q option)

samtools view -q 5 tophat_wt/accepted_hits.bam | wc -l

7.2. Alignment statistics

Get alignment statistics with samtools flagstat

samtools flagstat tophat_wt_PE/accepted_hits.bam