This is an old revision of the document!


Sequence searches using standalone BLAST

In January 2011, Annie Archambault (research professional at the QCBS) and Christopher Cameron (professor at Université de Montréal) set up a sequence similarity between 451 spicule matrix proteins from the sea urchin (Strongylocentrotus purpuratus, an echinoderm) that are involved in biomineralization 1) and the genome of Saccoglossus kowlevskii (a hemichordate that forms biominerals) and Ciona (a hemichordate that do not form biominerals).

That the Saccoglossus genome is partially sequenced, but not available from GenBank and requires to run the BLAST algorithm locally (standalone) represents a challenge. Another challenge resides in organizing the BLAST output, and then in organizing the large number of searches results (451 similarity searches for each sea urchin protein sequences) into functional categories.

Parsing the Blast output file

The output file from the 451 sequences was too large to be easily understood. We developed a small script in R to parse the blast output file, and kept only the best match: the one hit that has the minimum e-value, for each query sequence. The Blast parser in R for tab delimited blast result files is available here, and was inspired by a forum post.

Method Here are the steps to quickly parse that large file

  • Save your blast result in tab delimited format
  • Install the R package on you computer from the R homepage, see our Wiki section on R for more resources.
  • When installing on a Mac, you need to install beforehand a GNU Fortran compiler (that you can find on the R tools page) and XCode, which you can find on your original installation CD or online at Mac Developers tools.
  • Make a folder that will include your blast output file and the R script
  • In a terminal, go to the directory where are the two files
  • Run the script by typing:

Rscript unique_lowest_Evalue.R <inputfile> <outputfile>

where you will type the name of your blast result file instead of <inputfile>, and you will type a name you wish for you output file instead of <outputfile>

Warnings:

  1. Make sure the number and the order of arguments are correct because an existing file will be overwritten if given the same name as your new output file.
  2. That script currently does not keep the headings of the columns, any improvement is welcome.

References

1) Mann K, Wilt FH, Poustka AJ (2010) Proteomic analysis of sea urchin (Strongylocentrotus purpuratus) spicule matrix. Proteome Science 8 DOI 33 10.1186/1477-5956-8-33