1. Demo Data
a. Login to your MolBioCloud system using X2GO.
b. Create your own ‘mydemo’ folder and copy raw fastq files into it
mkdir -p mydemo/fastq cp ~/Desktop/rnaseq/fastq/* mydemo/fastq
c. Do all personal testing in this folder
cd mydemo ls fastq
d. All demo testing results are provided in... ~/Desktop/rnaseq
Compare with corresponding files to check your results
--
RNA-Seq data sets: ENCODE project, C2C12 myoblast cells
10,000 reads, 100 bp, paired-end reads
Two samples: single-end reads
Wt and deletion (10 Mb fragment of chr1)
~/mydemo/fastq/myoblast_wt.fastq
~/mydemo/fastq/myoblast_deletion.fastq
One sample: paired-end reads
~/mydemo/fastq/myoblast_wt_1.fastq
~/mydemo/fastq/myoblast_wt_2.fastq
2. View Raw Data
Top 10 lines of fastq file
head –n 10 fastq/myoblast_wt.fastq
Bottom 10 lines
tail –n 10 fastq/myoblast_wt_fastq
3. Run FastQC
Run fastqc on fastq files for analyzing the raw data quality
cd ~/mydemo mkdir fastqc fastqc -o fastqc fastq/*.fastq ls fastqc
View html report in web browser
fastqc/myoblast_wt_fastqc/fastqc_report.html
4. Run Fastx-toolkit
4.1. Trimming reads
cd ~/mydemo/fastq fastx_trimmer -t 10 -i myoblast_wt.fastq -o myoblast_wt_trim10.fastq head -n 4 myoblast_wt.fastq
@HWI-ST0787:100:C02F9ACXX:2:1101:10006:188055_1:N:0:/1
CTCATACGCAGCAGCTTGAGCATTTACTCGAGTTTCTGTGCCCTCTACAGGCAAAGCGAAACTGTCCATGATGATCATGGTCTCGCCGTCGACTTTCCCG
+
CCCFFFFFHHHHHJJJJJIJJJJJJJJJJJJJJJIIJJIIJIIIIIJJIIIJIJIJJJJJJHHHHFFFFFFFEEFCEEEDDDDDCDDDDDDBDDDDDDDD
head –n 4 myoblast_wt_trim10.fastq
@HWI-ST0787:100:C02F9ACXX:2:1101:10006:188055_1:N:0:/1
CTCATACGCAGCAGCTTGAGCATTTACTCGAGTTTCTGTGCCCTCTACAGGCAAAGCGAAACTGTCCATGATGATCATGGTCTCGCCGTC
+
CCCFFFFFHHHHHJJJJJIJJJJJJJJJJJJJJJIIJJIIJIIIIIJJIIIJIJIJJJJJJHHHHFFFFFFFEEFCEEEDDDDDCDDDDD
5. Mapping reads
5.1. Files needed for tophat
- Fastq files of test data set
- Bowtie2 indexes
- Gene annotation: GTF file
- Genome sequence .fasta file
a. List of different available genomes at the bowtie2 website
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
‘Indexes’ section to the right of page
b. Files for several organisms available as part of illumina iGenomes
http://support.illumina.com/sequencing/sequencing_software/igenome.html
c. For our purposes:
All files already available in ‘Desktop/rnaseq’ folder
5.1.1. Bowtie2 indexes
ls -1 ~/Desktop/rnaseq/bowtie2_indexes/*.bt2
mm9_chr1.1.bt2
mm9_chr1.2.bt2
mm9_chr1.3.bt2
mm9_chr1.4.bt2
mm9_chr1.rev.1.bt2
mm9_chr1.rev.2.bt2
5.1.2. Gene annotation GTF
a. Gene transfer format (GTF) - Annotation of regions into exons, introns, UTR
b. GTF file fields:
c. Demo file: mm9_chr1.gtf
cp ~/Desktop/rnaseq/mm9_chr1.gtf ~/mydemo cd ~/mydemo head -n 4 mm9_chr1.gtf
chr1 unknown exon 3204563 3207049 . - . gene_id “Xkr4”; gene_name “Xkr4”; p_id “P1231”; transcript_id “NM_001011874”; tss_id “TSS1903”;
chr1 unknown stop_codon 3206103 3206105 . - . gene_id “Xkr4”; gene_name “Xkr4”; p_id “P1231”; transcript_id “NM_001011874”; tss_id “TSS1903”;
chr1 unknown CDS 3206106 3207049 . - 2 gene_id “Xkr4”; gene_name “Xkr4”; p_id “P1231”; transcript_id “NM_001011874”; tss_id “TSS1903”;
5.1.3. Genome sequence FASTA
Raw sequence data
Demo file: mm9_chr1.fa
tail –n 5 ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1.fa
agaatttggaaaactaaaatatatacatgtttgtatcgatacttgttcag
gatttccttttgcttctctgcatacaggagaagcttctaaaaacgtaatt
gatcattgcctacaagcctttaatgccatgggattgcctaaatttattaa
gacagataacggccatcctattctagtaaaaattttatttcattctgtaa
agaatttggtattaaacttaaaactggaattc
5.2. Run tophat2
5.2.1. General format
tophat2 [options] bowtie-index input.fastq
-o Directory to write all results
-G/--GTF Annotation file supplied to tophat
--transcriptome-index Directory to save transcriptome index. Use first time along with –G/GTF option. Omit –G/GTF option on subsequent runs to avoid rebuilding the index
-p Number of processors to be used
5.2.2. Map single-end reads
cd ~/mydemo tophat2 –o tophat_wt –G mm9_chr1.gtf --transcriptome-index mm9-chr1-transcriptome ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1 fastq/myoblast_wt.fastq tophat2 –o tophat_deletion –G mm9_chr1.gtf --transcriptome-index mm9-chr1- transcriptome ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1 fastq/myoblast_deletion.fastq
5.2.3. Map paired-end reads
tophat2 –o tophat_wt_PE –G mm9_chr1.gtf --transcriptome-index mm9-chr1-transcriptome ~/Desktop/rnaseq/bowtie2_indexes/mm9_chr1 fastq/myoblast_wt_1.fastq fastq/myoblast_wt_2.fastq
5.3. tophat output files
a. Tophat produces several files
ls -1 ~/mydemo/tophat_wt
logs/
accepted_hits.bam
align_summary.txt
deletions.bed
insertions.bed
junctions.bed
prep_reads.info
unmapped.bam
b. Useful files
Each module has separate log
Can find number of reads mapping at different stages, etc
tophat.log – record of tophat output to command-line during run
run.log – record of individual commands that are part of tophat run
align_summary.txt – brief summary of alignment results
6. View Alignment
a. Alignment results: accepted_hits.bam
View with samtools
samtools view –h tophat_wt/accepted_hits.bam | head –n 5
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:197195432
@PG ID:TopHat VN:2.0.13 CL:/usr/bin/tophat -o tophat_wt -G mm9_chr1.gtf --transcriptome-index=mm9-chr1-transcriptome bowtie2_indexes/mm9_chr1 /media/sf_FAES/demo/rnaseq-demo-new/fastq/myoblast_wt.fastq
HWI-ST0787:100:C02F9ACXX:2:1102:10662:94956_2:N:0: 0 chr1 10012599 50 100M * 0 0 GAAAACCCCAGACAAAACCCCAGGCTTTTACCAAGACAATAGATCGCTCTCCATAAACTGACAGCACGTCTCCATTGCTGAAGACAACACCTATACAACT @@@FFFFFHHHGHJJJJJJJJJJJIIJJJJJJJIIJBGFHGIJIIIIJJJGGHIIGJIJGJFHHAHFFFDECCEEEDDDD@CCDCCCABBDDCCCDDDAC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST0787:100:C02F9ACXX:2:1102:10662:94956_1:N:0: 16 chr1 10012649 50 100M * 0 0 CCATAAACTGACAGCACGTCTCCATTGCTGAAGACAACACCTATACAACTCCTTGAACATGGAGAAGTCATGCTAGTGCCTGCAGAGTACCTATGTCCTA DCDEEFCCADDEDDDEFFFFFFHEGGGHGHGFDJJJHFBEIIIGFIHDD;JIGGJJJJIHGJJJJJJJJIJJJIIIIJIIIJJJJJJHFHDFFDEDF@@B AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
b. Index and visualize alignments with samtools tview
cd ~/mydemo/tophat_wt samtools index accepted_hits.bam samtools tview -d C -p chr1:13615979 accepted_hits.bam
7. Alignment statistics
7.1. Uniquely mapping reads
Calculate number of uniquely mapping reads using SAM mapping quality (-q option)
samtools view -q 5 tophat_wt/accepted_hits.bam | wc -l
7.2. Alignment statistics
Get alignment statistics with samtools flagstat
samtools flagstat tophat_wt_PE/accepted_hits.bam