Population genomics
  • Home
  • Linux
  • qc, read mapping
  • vcf manipulation

On this page

  • 1 Introduction to Linux and Servers
  • 2 Basic Structure of Commands
  • 3 Short commands
  • 4 Grep commands
  • 5 File-based commands
  • 6 Replacing patterns with other patterns with sed
  • 7 Some Useful Commands for Bioinformatics
  • 8 File manipulation with awk
  • 9 A basic for loop
  • 10 Linux Exercise Quiz

1 Introduction to Linux and Servers

Author: Lizel Potgieter, adapted by Amrei Binzer-Panchal

Linux is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds (https://en.wikipedia.org/wiki/Linux). Most servers run on a Linux-based operating system.

If you have no, or not much, experience with working with the command line please take some time to follwo the software carpentry course on the Unix shell.

If you have some experience with the command line you can have a look at the commands below (Section 2 and onwards) to refresh you knowledge.

Either way, to make sure that you are on an adequate level of proficiency jump to the last part of this page and take the Linux Exercise Quiz (Section 10) there.

And last but not least, here are some other resources with many other cool tips and tricks for all of your bioinformatics needs. For the full cheat sheets and other commands, please see:

  • Cheatography
  • Stephen Turner’s GitHub
  • Ming Tang’s GitHub

2 Basic Structure of Commands

cmd refers to a command. Input of cmd from file

cmd < file

Output of cmd2 as file input to cmd1

cmd1 <(cmd2)

Standard output (stdout) of cmd to file

cmd > file

Append stdout to file

cmd >> file

stdout of cmd1 to cmd2

cmd1 | cmd2

Run cmd1 then cmd2

cmd1 ; cmd2

Run cmd2 if cmd1 is successful

cmd1 && cmd2

Run cmd2 if cmd1 is not successful

cmd1 || cmd2

3 Short commands

Stop current command

CTRL-c

Go to start of line

CTRL-a

Go to end of line

CTRL-e

Cut from start of line

CTRL-u

Cut to end of line

CTRL-k

Search history

CTRL-r 

Run previous command, replacing abc with 123

^abc^123

4 Grep commands

Case insens­itive search

grep -i

Recursive search

grep -r

Inverted search

grep -v

Show matched part of file only

grep -o

5 File-based commands

Create file1

touch file1

Concat­enate files and output

cat file1 file2

View and paginate file1

less file1

Get type of file1

file file1

Copy file1 to file2

cp file1 file2

Move file1 to file2

mv file1 file2

Delete file1

rm file1

Show first 10 lines of file1

head file1

Show first 50 lines of file1

head -n 50 file1

Show last 10 lines of file1

tail file1

Output last lines of file1 as it changes

tail -F file1

6 Replacing patterns with other patterns with sed

Replacing a pattern and writing to a new file (use this until you are certain you know what you are doing)

sed "s/foo/bar/g" $infile > $outfile

Replacing a pattern in the same file (there is no going back)

sed -i "s/foo/bar/g" $infile

Replacing a pattern in a line that contains a string (here just foo)

sed -i "/foo/s/bar/foobar/g" $infile

7 Some Useful Commands for Bioinformatics

Count the entries in a fasta file. You can substitute the header (>) for any pattern to count the number of occurrences in your file

grep ">" $infile | wc -l

8 File manipulation with awk

Print columns 2, 4, and 5 to new file

awk '{print $2,$4,$5}' input.txt > outfile

Print columns where value in column 3 is larger than in column 5

awk '$3>$5' file.txt

Print sum of column 1

awk '{sum+=$1} END {print sum}' file.txt

Compute the mean of column 2

awk '{x+=$2}END{print x/NR}' file.txt

Remove duplicates while keeping the order of the file

awk '!visited[$0]++' file.txt

Split multi-fasta into individual fasta files

awk '/^>/{s=++d".fa"} {print > s}' multi.fa

Length of each sequence in a multi-fasta file

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa

Sort VCF with header

cat my.vcf | awk '$0~"^#" { print $0; next } { print $0 | "sort -k1,1V -k2,2n" }'

9 A basic for loop

Often we wish to run the same code for all files that are in a folder, have the same extension (like .fq), or have a similar string in the filename. Instead of changing the name in the code and rerunning it manually, we use for loops. You can write this directly into the terminal, or save it into a bash file (extension .sh) This line of code uses i as the variable for all files that have a .fq extension in the folder, and runs fastqc for each of them. The -o ${i}_fastqc indicates that the original file name will be kept, and appended with _fastqc.

for i in *.fq ; do fastqc ${i} -o ${i}_fastqc ; done

10 Linux Exercise Quiz

Please try to complete each task without looking at the answer first.

  1. Make a folder in the proj folder with your name
solution:
mkdir your_name
  1. Navigate to your folder
solution:
 cd yourname
  1. Create an empty file
solution:
 touch randomfile
  1. Rename randomfile
solution:
 mv randomfile randomfile2
  1. Delete random file
solution:
 rm randomfile2
  1. Create a directory
solution:
 mkdir randomdir
  1. Delete the directory
solution:
 rm -r randomdir
  1. Create a symbolic link (symlink) from the source data to your own folder. Please do not copy it to your own directories! There will be a new folder for each subsection of the workshop. This example is only for the fastq files we will use for read mapping
solution:
 ln -s /1_fastqc/*fq
  1. Listing the contents of your directory. The symlinks should have a different colour from than white
solution:
 ls
  1. Load the bwa module on the server
solution:
 module load bwa/0.7.4