Loading...
 

Data Manipulation Using Linux Shell

1. Extract subsets of rows

1.1. grep

Search and extract patterns of data

wc -l ncbiRefSeq.txt
grep stat5 ncbiRefSeq.txt		#case sensitive
grep -i stat5 ncbiRefSeq.txt		#ignore case, non word search
grep -in stat5a ncbiRefSeq.txt		#ignore case, non word, line number
grep -iv chr1 ncbiRefSeq.txt | head		#ignore case, invert results
grep -iv chr1 ncbiRefSeq.txt | grep -i chr10
grep -ivw chr1 ncbiRefSeq.txt | grep -i chr10 | head		#ignore case,invert,word bound
grep -P 'chr10\s\+' ncbiRefSeq.txt		#Perl style regular expression

 

Below from man grep

      A regular expression may be followed by one of several repetition operators:

      ?      The preceding item is optional and matched at most once.

      *      The preceding item will be matched zero or more times.

      +      The preceding item will be matched one or more times.

      {n}    The preceding item is matched exactly n times.

      {n,}   The preceding item is matched n or more times.

      {,m}   The preceding item is matched at most m times.  This is a GNU extension.

      {n,m}  The preceding item is matched at least n times, but not more than m times.


2. Extract subsets of columns

2.1. cut

Cut and extract specific columns of data

head peaks.txt 
cut -c4-5 peaks.txt | head  #Print only the chromosome numbers
cut -c4-5 peaks.txt | uniq | sort -n  #Extract, sort unique chromosome numbers

head ncbiRefSeq.txt
cut -f13 ncbiRefSeq.txt | head  #Cut column 13
cut -f2,13 ncbiRefSeq.txt | head  #Cut column 2,13
cut -d$'\t' -f2,13 ncbiRefSeq.txt | head  #Cut column 2,13, with tab delimiter
	
cut -f2 ncbiRefSeq.txt | cut -d'.' -f1 > refseqids  #Cut & save refseqids in new file
cut -f2,13 ncbiRefSeq.txt > refseqnames  #Cut,save ids,refseqnames in new file

3. Merge

3.1. paste

Paste columns from different files

paste refseqids refseqnames | head  #paste ids and names side-by-side
paste refseqids refseqnames > refseqidnames  #paste ids,names into new file

3.2. join

Integrate two files based on key column

head refseqidnames
head knownToRefSeq.txt
sort -k2 knownToRefSeq.txt > knownToRefSeqSorted.txt  #Sort on 2nd column
sort -k1 refseqidnames > refseqidnamesSorted.txt  #Sort on 1st column
join -1 2 -2 1 knownToRefSeqSorted.txt refseqidnamesSorted.txt | head  #Join using 2nd key column on first file & first key column on the second file

3.3. sort

Sort lines of text files

head peaks.txt
sort -k5 peaks.txt | head  #Sort on column 5
sort -k5n peaks.txt | head  #Sort on column 5, numerically
sort -k5nr peaks.txt | head  #Sort on column 5, numeric, reverse

3.4. cat

Concatenate file contents

cat f1.txt  #View f1 content
cat f2.txt  #View f2 content
cat f1.txt f2.txt  #View f1 and f2 contents
cat f1.txt > f3.txt  #Write f1 to f3
cat f3.txt
cat f2.txt > f3.txt  #Write f2 to f3
cat f1.txt >> f3.txt  #Append f1 to f3
cat f3.txt

4. Compare data

4.1. uniq

Write only unique entries

cut -f13 ncbiRefSeq.txt | sort | head  #Gene names alone
cut -f13 ncbiRefSeq.txt | sort | uniq | head  #Unique Gene names
cut -f13 ncbiRefSeq.txt | sort | uniq -c | head  #Count of unique gene names
cut -f13 ncbiRefSeq.txt | sort | uniq -cu | head  #Only uniquely found names
cut -f13 ncbiRefSeq.txt | sort | uniq -cd | head  #Only duplicated names

4.2. comm

Compare file contents side-by-side

cut -f13 ncbiRefSeq.txt| sort | uniq | head -n 200 | tail -n 100 > glist2
cut -f13 ncbiRefSeq.txt| sort | uniq | head -n 150 > glist1
comm glist1 glist2  #Compare two lists

 

  • The first column (zero tabs) is lines that only appear in the first file.

  • The second column (one tab) is lines that only appear in the second file.

  • The third column (two tabs) is lines that appear in both files.

 

comm -12 glist1 glist2  #Only lines appearing in both files
comm -23 glist1 glist2  #Lines only in file 1
comm -13 glist1 glist2  #Lines only in file 2

4.3. diff

Compare file contents

diff glist1 glist2
diff -y glist1 glist2

5. Datamash

Datamash Performs numeric/string operations on input

a. Count number of isoforms per gene.

#Sort,group by 13 (genes), count 2 (isoforms)
datamash -s -g 13 count 2 < ncbiRefSeq.txt | head

b. Count number of isoforms per gene, list isoforms

#Sort,group by 13 (genes), cound 2 (isoforms), collapse 2 (isoforms)
datamash -s -g 13 count 2 collapse 2 < ncbiRefSeq.txt | head
cat ncbiRefSeq.txt | datamash -s -g 13 count 2 | head 
cat ncbiRefSeq.txt | datamash -s -g 13 count 2 | awk '$2>5' | head

c. Total number of transcripts in each chromosomes

datamash -s -g 3 count 2 < ncbiRefSeq.txt | head

d. Total number of transcripts in each chromosome in each strand

datamash -s -g 3,4 count 2 < ncbiRefSeq.txt | head