This is an old revision of the document!

Comparing genes and genomes using bioinformatics

The BLAST general program BLAST and its specialized derivative programs (Primer-BLAST, conserved domains, vector contamination, Align two sequences, Global Sequence Alignment Tool, WGS sequences) may be the most widely used tools in bioinformatics. The BLAST algorithm 1) is optimized for speed and is used to search protein and DNA databases for sequence similarities.

Useful knowledge about BLAST

Access BLAST through the internet or standalone on your computer

The National Center for Biotechnology Information (NCBI) offers a web interface to search by BLAST within its exhaustive databases for sequences that could be similar to the user’s query sequence.

The BLAST program can also be installed on a personal computer, rather be accessed through the internet. This is necessary when one wishes to search the similarity of its query sequence to a set of sequences (the database) that are not yet included in GenBank or any other public sequence databases (DDBJ, EMBL). The BLAST programs to install on a personal computer (standalone) available at the NCBI Blast and from BlastStation. On February 2011, the current version was ncbi-blast-2.2.24, and the 2.2.26 version was released in March 2012. A pages from NCBI is useful to setup of Command Line BLAST, and a the NCBI help pages about Blast are resourceful.

Batch blast

Whether you access the BLAST through NCBI website or from your personal computer, you may compare multiple query sequences at a time, in a Batch blast. In that case, the resulting BLAST output file will be very long. You can save the output file in various format, .txt .xml .csv for instance.

  • Query: The sequence a user wants to get more information about
  • Database: a large set of sequences the query is compared to
  • e-value: Is a number to assess how a similarity is likely to arise by chance. It involves a model of random sequences. The lower the number the less likely the similarity occurred by chance. The NCBI Blast tutorial explains: “Expect value. The E-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially with the score (S) that is assigned to a match between two sequences. Essentially, the E-value describes the random background noise that exists for matches between sequences. For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size, one might expect to see one match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to “0”, the higher is the “significance” of the match. However, it is important to note that searches with short sequences can be virtually identical and have relatively high E-value. This is because the calculation of the E-value also takes into account the length of the query sequence. This is because shorter sequences have a high probability of occurring in the database purely by chance.“ Find more information, on the NCBI online books.
  • HSP: high-scoring segment pairs; all segment pairs whose scores can not be improved by extension or trimming
  • Identities = Proportion of identical residues between the query and the hit from database,
  • More terms are defined in the NCBI Blast Handbook and in the general NCBI glossary

Examples of uses of BLAST program are described here and here on the wiki.

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402