Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
standalone_blast [2014/11/04 15:56]
sebastien.renaut
standalone_blast [2014/11/21 14:43] (current)
sebastien.renaut
Line 1: Line 1:
- +===== Install ​BLASTlocally. ===== 
- +Find and install ​the latest version ​that corresponds to your operating system (MACWindowsLinux)ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/
-===Sequence searches using standalone BLAST==+
-In January 2011, Annie Archambault (research professional at the [[http://​qcbs.ca/​|QCBS]]) and [[http://​www.bio.umontreal.ca/​personnel/​CAMERON_Christopher/​index.html|Christopher Cameron]] (professor at Université de Montréal) set up a sequence similarity between 451 spicule matrix proteins from the sea urchin (Strongylocentrotus purpuratus, an echinoderm) that are involved in biomineralization ((Mann K, Wilt FH, Poustka AJ (2010) Proteomic analysis of sea urchin (Strongylocentrotus purpuratus) spicule matrix. Proteome Science 8 DOI 33 10.1186/​1477-5956-8-33)) and the genome of //​Saccoglossus kowlevskii//​ (a hemichordate that forms biominerals) and //Ciona// (a hemichordate that do not form biominerals). ​  +
- +
-That the //​Saccoglossus//​ genome is partially sequenced, but not available from GenBank and requires to run the BLAST algorithm ​locally ​(standalone) represents a challengeAnother challenge resides in organizing the BLAST output, and then in organizing the large number of searches results (451 similarity searches for each sea urchin protein sequences) into functional categories. +
- +
- +
-==Parsing the Blast output file== +
-The output file from the 451 sequences was too large to be easily understood. We developed a small script in R to parse the blast output file, and kept only the best match: ​the one hit that has the minimum e-valuefor each query sequence. The {{:​unique_lowest_evalue.r|Blast parser in R}} for tab delimited blast result files is available hereand was inspired by a [[http://​seqanswers.com/​forums/​showthread.php?​t=9052|forum post]]. ​  +
- +
- +
-**Method** +
-Here are the steps to quickly parse that large file +
-  *Save your blast result in tab delimited format +
-  *Install the R package on you computer [[http://cran.r-project.org/|from the R homepage]], see our [[http://​qcbs.ca/​wiki/​resources_for_r|Wiki section on R]] for more resources +
-  *When installing on a Mac, you need to install beforehand a GNU Fortran compiler (that you can find on the [[http://r.research.att.com/tools/|R tools page]]) and XCode, which you can find on your original installation CD or online at [[http://​developer.apple.com/​technologies/​xcode.html|Mac Developers tools]].  +
-  *Make a folder that will include your blast output file and the R script +
-  *In a terminal, go to the directory where are the two files  +
-  *Run the script by typing:  +
-Rscript unique_lowest_Evalue.R <​inputfile>​ <​outputfile>​  +
- +
-where you will type the name of your blast result file instead of <​inputfile>,​ and you will type a name you wish for you output file instead of <​outputfile>​  +
- +
-Warnings:  +
-  -Make sure the number and the order of arguments are correct because an existing file will be overwritten if given the same name as your new output file.  +
-  -That script currently does not keep the headings of the columns, any improvement is welcome.  +
- +
-=== Examples on how to use standalone BLAST+ === +
-This assumes that you've already installed BLAST+ locally.+
  
 ### DOWNLOAD A PREFORMATED DATABASE FROM NCBI### ### DOWNLOAD A PREFORMATED DATABASE FROM NCBI###
 +  *download the protein nr database
 <​code>​ <​code>​
-$ update_blastdb.pl nr #download the protein nr database+$ update_blastdb.pl nr
 </​code>​ </​code>​
-#the database is in *tar.gz format which needs to be unzipped like this: +  *The database is in *tar.gz format which needs to be unzipped like this: 
 <​code>​ <​code>​
 $ tar -xvzf *tar.gz $ tar -xvzf *tar.gz
Line 40: Line 13:
  
 ### MAKE A REFERENCE DATABASE ### ### MAKE A REFERENCE DATABASE ###
-#this is to generate your own reference database from your own fasta file (either nucleotide or protein)+  *this is to generate your own reference database from your own fasta file (either nucleotide or protein)
 <​code>​ <​code>​
-$makeblastdb -dbtype nucl -in genes1_fas_pathway.txt+$ makeblastdb -dbtype nucl -in genes1_fas_pathway.txt 
 +</​code>​
 OR OR
-$makeblastdb -dbtype prot -in Plantcyc_Enzymes_Without_Tags_BLASTset.fasta+<​code>​ 
 +$ makeblastdb -dbtype prot -in Plantcyc_Enzymes_Without_Tags_BLASTset.fasta
 </​code>​ </​code>​
  
 ### BLASTn ### ### BLASTn ###
-#BLASTn (nucleotide against nucleotide) with discontiguous megablast, usually for distant interspecific relationships. +  *BLASTn (nucleotide against nucleotide) with discontiguous megablast, usually for distant interspecific relationships. 
-#Mininum e-value to report result set as 1e-10 +  *Mininum e-value to report result set as 1e-10 
-#report a maximum of 100 target sequences +  *report a maximum of 100 target sequences 
-#output format is in XML format (5), (tabular ouput: 6, 0: “web-like” output, default) +  *output format is in XML format (5), (tabular ouput: 6, 0: “web-like” output, default) 
-#blast is run on 12 CPUs simultaneously+  *blast is run on 12 CPUs simultaneously
 <​code>​ <​code>​
 $ blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 100 -query mygenes.fasta -db nr -num_threads 12 -outfmt 5 -out mygenes.blast.out $ blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 100 -query mygenes.fasta -db nr -num_threads 12 -outfmt 5 -out mygenes.blast.out
 </​code>​ </​code>​
 ### BLASTn ### ### BLASTn ###
-#with tabular output (outfmt 6)+  *with tabular output (outfmt 6)
 <​code>​ <​code>​
 $ blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 5 -query mygenes.fasta -db database/​ara_pathway/​genes1_fas_pathway.txt -num_threads 4 -outfmt 6 -out mygenes.blast.out $ blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 5 -query mygenes.fasta -db database/​ara_pathway/​genes1_fas_pathway.txt -num_threads 4 -outfmt 6 -out mygenes.blast.out
Line 63: Line 38:
  
 ### BLASTn – short sequences ### ### BLASTn – short sequences ###
-#short sequences option +  *short sequences option 
-#run it with a nohup command (i.e run it in the background and continue performing other tasks without the risk of session interrupting)+  *run it with a nohup command (i.e run it in the background and continue performing other tasks without the risk of session interrupting)
 <​code>​ <​code>​
 $ nohup blastn -task blastn-short -evalue 1e-5 -max_target_seqs 20 -query mygenes.fasta -db ../​../​blast/​database/​all_mito_sequences_e50 -num_threads 8 -outfmt 6 -out mygenes.blast.out > log & $ nohup blastn -task blastn-short -evalue 1e-5 -max_target_seqs 20 -query mygenes.fasta -db ../​../​blast/​database/​all_mito_sequences_e50 -num_threads 8 -outfmt 6 -out mygenes.blast.out > log &
Line 70: Line 45:
  
 ### BLASTx with nohup### ### BLASTx with nohup###
-#Blastx translate your sequence in all 6 possible framework and searches a protein databases.+  *Blastx translate your sequence in all 6 possible framework and searches a protein databases.
 <​code>​ <​code>​
 $ nohup blastx -evalue 1e-10 -max_target_seqs 5 -query mygenes.fasta -db database/​ara_pathway/​Plantcyc_Enzymes_Without_Tags_BLASTset.fasta -num_threads 6 -outfmt 6 -out mygenes.blast.out > log & $ nohup blastx -evalue 1e-10 -max_target_seqs 5 -query mygenes.fasta -db database/​ara_pathway/​Plantcyc_Enzymes_Without_Tags_BLASTset.fasta -num_threads 6 -outfmt 6 -out mygenes.blast.out > log &
Line 76: Line 51:
  
 ### BLASTp with gi restriction###​ ### BLASTp with gi restriction###​
-#protein against protein +  *protein against protein 
-#​gi_viriplantae ​contains a list of all GI for all viriplantae. This speeds things up a lot since it restrict the search to plants only.+  ​*gi_viridiplantae ​contains a list of all GI for all viriplantae. This speeds things up a lot since it restrict the search to plants only.
 <​code>​ <​code>​
-$ blastp -query mygenes.fasta -db ~/​blast/​database/​ncbi_nr/​nr -gilist ​~/​blast/​database/​green_plants/​gi_viriplantae ​-outfmt 4 -out mygenes.blast.out+$ blastp -query mygenes.fasta -db ~/​blast/​database/​ncbi_nr/​nr -gilist ​gi_viridiplantae ​-outfmt 4 -out mygenes.blast.out
 </​code>​ </​code>​
 +  *To get a GI list, go to http://​www.ncbi.nlm.nih.gov/​
 +  *Search for "​viridiplantae"​ in "​protein"​ database.
 +  *Download all GI (top right, >send to, >choose destination file, >format gilist, >create file)
  
-===References===+===== More examples ​===== 
 +###Sequence searches using standalone BLAST 
 +In January 2011, Annie Archambault (research professional at the [[http://​qcbs.ca/​|QCBS]]) and [[http://​www.bio.umontreal.ca/​personnel/​CAMERON_Christopher/​index.html|Christopher Cameron]] (professor at Université de Montréal) set up a sequence similarity between 451 spicule matrix proteins from the sea urchin (Strongylocentrotus purpuratus, an echinoderm) that are involved in biomineralization ((Mann K, Wilt FH, Poustka AJ (2010) Proteomic analysis of sea urchin (Strongylocentrotus purpuratus) spicule matrix. Proteome Science 8 DOI 33 10.1186/​1477-5956-8-33)) and the genome of //​Saccoglossus kowlevskii//​ (a hemichordate that forms biominerals) and //Ciona// (a hemichordate that do not form biominerals).  ​
  
 +That the //​Saccoglossus//​ genome is partially sequenced, but not available from GenBank and requires to run the BLAST algorithm locally (standalone) represents a challenge. Another challenge resides in organizing the BLAST output, and then in organizing the large number of searches results (451 similarity searches for each sea urchin protein sequences) into functional categories.
  
  
 +===Parsing the Blast output file===
 +The output file from the 451 sequences was too large to be easily understood. We developed a small script in R to parse the blast output file, and kept only the best match: the one hit that has the minimum e-value, for each query sequence. The {{:​unique_lowest_evalue.r|Blast parser in R}} for tab delimited blast result files is available here, and was inspired by a [[http://​seqanswers.com/​forums/​showthread.php?​t=9052|forum post]].  ​
 +
 +
 +**Method**
 +Here are the steps to quickly parse that large file
 +  *Save your blast result in tab delimited format
 +  *Install the R package on you computer [[http://​cran.r-project.org/​|from the R homepage]], see our [[http://​qcbs.ca/​wiki/​resources_for_r|Wiki section on R]] for more resources. ​
 +  *When installing on a Mac, you need to install beforehand a GNU Fortran compiler (that you can find on the [[http://​r.research.att.com/​tools/​|R tools page]]) and XCode, which you can find on your original installation CD or online at [[http://​developer.apple.com/​technologies/​xcode.html|Mac Developers tools]]. ​
 +  *Make a folder that will include your blast output file and the R script
 +  *In a terminal, go to the directory where are the two files 
 +  *Run the script by typing: ​
 +Rscript unique_lowest_Evalue.R <​inputfile>​ <​outputfile> ​
 +
 +where you will type the name of your blast result file instead of <​inputfile>,​ and you will type a name you wish for you output file instead of <​outputfile> ​
 +
 +Warnings: ​
 +  -Make sure the number and the order of arguments are correct because an existing file will be overwritten if given the same name as your new output file. 
 +  -That script currently does not keep the headings of the columns, any improvement is welcome. ​
 +
 +===References===