Comparing genes and genomes using bioinformatics

The BLAST program and its specialized derivative programs (Primer-BLAST, conserved domains, vector contamination, Align two sequences, Global Sequence Alignment Tool, WGS sequences) may some of the most widely used tools in bioinformatics. The BLAST algorithm 1) is optimized for speed and is used to search protein and DNA databases for sequence similarities.

Access BLAST through the internet or standalone on your computer

The National Center for Biotechnology Information (NCBI) offers a web interface to search by BLAST within its exhaustive databases for sequences that could be similar to the user’s query sequence.

The BLAST program can also be installed on a personal computer, rather be accessed through the internet. This is necessary when one wishes to search the similarity of its query sequence to a set of sequences (the database) that are not yet included in GenBank or any other public sequence databases (DDBJ, EMBL). The standalone version of BLAST, to be installed on a personal computer (standalone) is available on NCBI website. As of March 2015, the current version is ncbi-blast-2.2.30. NCBI offers information regarding setup, help pages and FAQs. setup of Command Line BLASTNCBI help pages about Blast.

Batch BLAST

Whether you access the BLAST through NCBI website or from your personal computer, you may need to compare multiple query sequences at a time, in a Batch blast. In that case, resulting BLAST output file can be very long. You can save the output file in various format, .txt .xml .csv for instance.

Glossary
  • Query: The sequence a user wants to get more information about
  • Database: a large set of sequences the query is compared to
  • e-value: Is a number to assess how a similarity is likely to arise by chance. It involves a model of random sequences. The smaller the number the less likely the similarity occurred by chance. The NCBI Blast tutorial explains: “Expect value. The E-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially with the score (S) that is assigned to a match between two sequences. Essentially, the E-value describes the random background noise that exists for matches between sequences. For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size, one might expect to see one match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to “0”, the higher is the “significance” of the match. However, it is important to note that searches with short sequences can be virtually identical and have relatively high E-value. This is because the calculation of the E-value also takes into account the length of the query sequence. This is because shorter sequences have a high probability of occurring in the database purely by chance.“ Find more information, on the NCBI online books.
  • HSP: high-scoring segment pairs; all segment pairs whose scores can not be improved by extension or trimming
  • Identities = Proportion of identical residues between the query and the hit from database,
  • More terms are defined in the NCBI Blast Handbook and in the general NCBI glossary

Examples of uses of BLAST program are described here and here on the wiki.

1)
Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402