Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
standalone_blast [2014/11/04 15:59]
sebastien.renaut
standalone_blast [2014/11/21 14:43] (current)
sebastien.renaut
Line 1: Line 1:
- +===== Install ​BLAST+ locally. ​===== 
- +Find and install ​the latest version ​that corresponds to your operating system ​(MACWindowsLinux): ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/
-===== Sequence searches using standalone ​BLAST ===== +
-In January 2011, Annie Archambault (research professional at the [[http://​qcbs.ca/​|QCBS]]) ​and [[http://​www.bio.umontreal.ca/​personnel/​CAMERON_Christopher/​index.html|Christopher Cameron]] (professor at Université de Montréal) set up a sequence similarity between 451 spicule matrix proteins from the sea urchin (Strongylocentrotus purpuratus, an echinoderm) ​that are involved in biomineralization ((Mann KWilt FHPoustka AJ (2010Proteomic analysis of sea urchin (Strongylocentrotus purpuratus) spicule matrix. Proteome Science 8 DOI 33 10.1186/​1477-5956-8-33)) and the genome of //​Saccoglossus kowlevskii//​ (a hemichordate that forms biominerals) and //Ciona// (a hemichordate that do not form biominerals). ​  +
- +
-That the //​Saccoglossus//​ genome is partially sequenced, but not available from GenBank and requires to run the BLAST algorithm locally (standalone) represents a challenge. Another challenge resides in organizing the BLAST output, and then in organizing the large number of searches results (451 similarity searches for each sea urchin protein sequences) into functional categories. +
- +
- +
-===Parsing the Blast output file=== +
-The output file from the 451 sequences was too large to be easily understood. We developed a small script in R to parse the blast output file, and kept only the best matchthe one hit that has the minimum e-value, for each query sequence. The {{:​unique_lowest_evalue.r|Blast parser in R}} for tab delimited blast result files is available here, and was inspired by a [[http://seqanswers.com/​forums/​showthread.php?​t=9052|forum post]]. ​  +
- +
- +
-**Method** +
-Here are the steps to quickly parse that large file +
-  *Save your blast result in tab delimited format +
-  *Install the R package on you computer [[http://​cran.r-project.org/|from the R homepage]], see our [[http://qcbs.ca/wiki/​resources_for_r|Wiki section on R]] for more resources.  +
-  *When installing on a Mac, you need to install beforehand a GNU Fortran compiler (that you can find on the [[http://​r.research.att.com/​tools/​|R tools page]]) and XCode, which you can find on your original installation CD or online at [[http://​developer.apple.com/​technologies/​xcode.html|Mac Developers tools]].  +
-  *Make a folder that will include your blast output file and the R script +
-  *In a terminal, go to the directory where are the two files  +
-  *Run the script by typing:  +
-Rscript unique_lowest_Evalue.R <​inputfile>​ <​outputfile>​  +
- +
-where you will type the name of your blast result file instead of <​inputfile>,​ and you will type a name you wish for you output file instead of <​outputfile>​  +
- +
-Warnings:  +
-  -Make sure the number and the order of arguments are correct because an existing file will be overwritten if given the same name as your new output file.  +
-  -That script currently does not keep the headings of the columns, any improvement is welcome.  +
- +
-===== More examples on how to use standalone BLAST+ ===== +
-This assumes that you've already installed BLAST+ locally.+
  
 ### DOWNLOAD A PREFORMATED DATABASE FROM NCBI### ### DOWNLOAD A PREFORMATED DATABASE FROM NCBI###
 +  *download the protein nr database
 <​code>​ <​code>​
-$ update_blastdb.pl nr #download the protein nr database+$ update_blastdb.pl nr
 </​code>​ </​code>​
-#the database is in *tar.gz format which needs to be unzipped like this: +  *The database is in *tar.gz format which needs to be unzipped like this: 
 <​code>​ <​code>​
 $ tar -xvzf *tar.gz $ tar -xvzf *tar.gz
Line 43: Line 16:
 <​code>​ <​code>​
 $ makeblastdb -dbtype nucl -in genes1_fas_pathway.txt $ makeblastdb -dbtype nucl -in genes1_fas_pathway.txt
 +</​code>​
 OR OR
 +<​code>​
 $ makeblastdb -dbtype prot -in Plantcyc_Enzymes_Without_Tags_BLASTset.fasta $ makeblastdb -dbtype prot -in Plantcyc_Enzymes_Without_Tags_BLASTset.fasta
 </​code>​ </​code>​
Line 77: Line 52:
 ### BLASTp with gi restriction###​ ### BLASTp with gi restriction###​
   *protein against protein   *protein against protein
-  *gi_viriplantae ​contains a list of all GI for all viriplantae. This speeds things up a lot since it restrict the search to plants only.+  *gi_viridiplantae ​contains a list of all GI for all viriplantae. This speeds things up a lot since it restrict the search to plants only.
 <​code>​ <​code>​
-$ blastp -query mygenes.fasta -db ~/​blast/​database/​ncbi_nr/​nr -gilist ​~/​blast/​database/​green_plants/​gi_viriplantae ​-outfmt 4 -out mygenes.blast.out+$ blastp -query mygenes.fasta -db ~/​blast/​database/​ncbi_nr/​nr -gilist ​gi_viridiplantae ​-outfmt 4 -out mygenes.blast.out
 </​code>​ </​code>​
 +  *To get a GI list, go to http://​www.ncbi.nlm.nih.gov/​
 +  *Search for "​viridiplantae"​ in "​protein"​ database.
 +  *Download all GI (top right, >send to, >choose destination file, >format gilist, >create file)
  
-===References===+===== More examples ​===== 
 +###Sequence searches using standalone BLAST 
 +In January 2011, Annie Archambault (research professional at the [[http://​qcbs.ca/​|QCBS]]) and [[http://​www.bio.umontreal.ca/​personnel/​CAMERON_Christopher/​index.html|Christopher Cameron]] (professor at Université de Montréal) set up a sequence similarity between 451 spicule matrix proteins from the sea urchin (Strongylocentrotus purpuratus, an echinoderm) that are involved in biomineralization ((Mann K, Wilt FH, Poustka AJ (2010) Proteomic analysis of sea urchin (Strongylocentrotus purpuratus) spicule matrix. Proteome Science 8 DOI 33 10.1186/​1477-5956-8-33)) and the genome of //​Saccoglossus kowlevskii//​ (a hemichordate that forms biominerals) and //Ciona// (a hemichordate that do not form biominerals).  ​
  
 +That the //​Saccoglossus//​ genome is partially sequenced, but not available from GenBank and requires to run the BLAST algorithm locally (standalone) represents a challenge. Another challenge resides in organizing the BLAST output, and then in organizing the large number of searches results (451 similarity searches for each sea urchin protein sequences) into functional categories.
  
  
 +===Parsing the Blast output file===
 +The output file from the 451 sequences was too large to be easily understood. We developed a small script in R to parse the blast output file, and kept only the best match: the one hit that has the minimum e-value, for each query sequence. The {{:​unique_lowest_evalue.r|Blast parser in R}} for tab delimited blast result files is available here, and was inspired by a [[http://​seqanswers.com/​forums/​showthread.php?​t=9052|forum post]].  ​
 +
 +
 +**Method**
 +Here are the steps to quickly parse that large file
 +  *Save your blast result in tab delimited format
 +  *Install the R package on you computer [[http://​cran.r-project.org/​|from the R homepage]], see our [[http://​qcbs.ca/​wiki/​resources_for_r|Wiki section on R]] for more resources. ​
 +  *When installing on a Mac, you need to install beforehand a GNU Fortran compiler (that you can find on the [[http://​r.research.att.com/​tools/​|R tools page]]) and XCode, which you can find on your original installation CD or online at [[http://​developer.apple.com/​technologies/​xcode.html|Mac Developers tools]]. ​
 +  *Make a folder that will include your blast output file and the R script
 +  *In a terminal, go to the directory where are the two files 
 +  *Run the script by typing: ​
 +Rscript unique_lowest_Evalue.R <​inputfile>​ <​outputfile> ​
 +
 +where you will type the name of your blast result file instead of <​inputfile>,​ and you will type a name you wish for you output file instead of <​outputfile> ​
 +
 +Warnings: ​
 +  -Make sure the number and the order of arguments are correct because an existing file will be overwritten if given the same name as your new output file. 
 +  -That script currently does not keep the headings of the columns, any improvement is welcome. ​
 +
 +===References===