1. Extract subsets of rows
1.1. grep
Search and extract patterns of data
wc -l ncbiRefSeq.txt grep stat5 ncbiRefSeq.txt #case sensitive grep -i stat5 ncbiRefSeq.txt #ignore case, non word search grep -in stat5a ncbiRefSeq.txt #ignore case, non word, line number grep -iv chr1 ncbiRefSeq.txt | head #ignore case, invert results grep -iv chr1 ncbiRefSeq.txt | grep -i chr10 grep -ivw chr1 ncbiRefSeq.txt | grep -i chr10 | head #ignore case,invert,word bound grep -P 'chr10\s\+' ncbiRefSeq.txt #Perl style regular expression
Below from man grep
A regular expression may be followed by one of several repetition operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{,m} The preceding item is matched at most m times. This is a GNU extension.
{n,m} The preceding item is matched at least n times, but not more than m times.
2. Extract subsets of columns
2.1. cut
Cut and extract specific columns of data
head peaks.txt cut -c4-5 peaks.txt | head #Print only the chromosome numbers cut -c4-5 peaks.txt | uniq | sort -n #Extract, sort unique chromosome numbers head ncbiRefSeq.txt cut -f13 ncbiRefSeq.txt | head #Cut column 13 cut -f2,13 ncbiRefSeq.txt | head #Cut column 2,13 cut -d$'\t' -f2,13 ncbiRefSeq.txt | head #Cut column 2,13, with tab delimiter cut -f2 ncbiRefSeq.txt | cut -d'.' -f1 > refseqids #Cut & save refseqids in new file cut -f2,13 ncbiRefSeq.txt > refseqnames #Cut,save ids,refseqnames in new file
3. Merge
3.1. paste
Paste columns from different files
paste refseqids refseqnames | head #paste ids and names side-by-side paste refseqids refseqnames > refseqidnames #paste ids,names into new file
3.2. join
Integrate two files based on key column
head refseqidnames head knownToRefSeq.txt sort -k2 knownToRefSeq.txt > knownToRefSeqSorted.txt #Sort on 2nd column sort -k1 refseqidnames > refseqidnamesSorted.txt #Sort on 1st column join -1 2 -2 1 knownToRefSeqSorted.txt refseqidnamesSorted.txt | head #Join using 2nd key column on first file & first key column on the second file
3.3. sort
Sort lines of text files
head peaks.txt sort -k5 peaks.txt | head #Sort on column 5 sort -k5n peaks.txt | head #Sort on column 5, numerically sort -k5nr peaks.txt | head #Sort on column 5, numeric, reverse
3.4. cat
Concatenate file contents
cat f1.txt #View f1 content cat f2.txt #View f2 content cat f1.txt f2.txt #View f1 and f2 contents cat f1.txt > f3.txt #Write f1 to f3 cat f3.txt cat f2.txt > f3.txt #Write f2 to f3 cat f1.txt >> f3.txt #Append f1 to f3 cat f3.txt
4. Compare data
4.1. uniq
Write only unique entries
cut -f13 ncbiRefSeq.txt | sort | head #Gene names alone cut -f13 ncbiRefSeq.txt | sort | uniq | head #Unique Gene names cut -f13 ncbiRefSeq.txt | sort | uniq -c | head #Count of unique gene names cut -f13 ncbiRefSeq.txt | sort | uniq -cu | head #Only uniquely found names cut -f13 ncbiRefSeq.txt | sort | uniq -cd | head #Only duplicated names
4.2. comm
Compare file contents side-by-side
cut -f13 ncbiRefSeq.txt| sort | uniq | head -n 200 | tail -n 100 > glist2 cut -f13 ncbiRefSeq.txt| sort | uniq | head -n 150 > glist1 comm glist1 glist2 #Compare two lists
-
The first column (zero tabs) is lines that only appear in the first file.
-
The second column (one tab) is lines that only appear in the second file.
-
The third column (two tabs) is lines that appear in both files.
comm -12 glist1 glist2 #Only lines appearing in both files comm -23 glist1 glist2 #Lines only in file 1 comm -13 glist1 glist2 #Lines only in file 2
4.3. diff
Compare file contents
diff glist1 glist2 diff -y glist1 glist2
5. Datamash
Datamash Performs numeric/string operations on input
a. Count number of isoforms per gene.
#Sort,group by 13 (genes), count 2 (isoforms) datamash -s -g 13 count 2 < ncbiRefSeq.txt | head
b. Count number of isoforms per gene, list isoforms
#Sort,group by 13 (genes), cound 2 (isoforms), collapse 2 (isoforms) datamash -s -g 13 count 2 collapse 2 < ncbiRefSeq.txt | head cat ncbiRefSeq.txt | datamash -s -g 13 count 2 | head cat ncbiRefSeq.txt | datamash -s -g 13 count 2 | awk '$2>5' | head
c. Total number of transcripts in each chromosomes
datamash -s -g 3 count 2 < ncbiRefSeq.txt | head
d. Total number of transcripts in each chromosome in each strand
datamash -s -g 3,4 count 2 < ncbiRefSeq.txt | head