$update_blastdb.pl nr • The database is in *tar.gz format which needs to be unzipped like this: $ tar -xvzf *tar.gz

### MAKE A REFERENCE DATABASE ###

• this is to generate your own reference database from your own fasta file (either nucleotide or protein)
$makeblastdb -dbtype nucl -in genes1_fas_pathway.txt OR $ makeblastdb -dbtype prot -in Plantcyc_Enzymes_Without_Tags_BLASTset.fasta

### BLASTn ###

• BLASTn (nucleotide against nucleotide) with discontiguous megablast, usually for distant interspecific relationships.
• Mininum e-value to report result set as 1e-10
• report a maximum of 100 target sequences
• output format is in XML format (5), (tabular ouput: 6, 0: “web-like” output, default)
• blast is run on 12 CPUs simultaneously
$blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 100 -query mygenes.fasta -db nr -num_threads 12 -outfmt 5 -out mygenes.blast.out ### BLASTn ### • with tabular output (outfmt 6) $ blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 5 -query mygenes.fasta -db database/ara_pathway/genes1_fas_pathway.txt -num_threads 4 -outfmt 6 -out mygenes.blast.out

### BLASTn – short sequences ###

• short sequences option
• run it with a nohup command (i.e run it in the background and continue performing other tasks without the risk of session interrupting)
$nohup blastn -task blastn-short -evalue 1e-5 -max_target_seqs 20 -query mygenes.fasta -db ../../blast/database/all_mito_sequences_e50 -num_threads 8 -outfmt 6 -out mygenes.blast.out > log & ### BLASTx with nohup### • Blastx translate your sequence in all 6 possible framework and searches a protein databases. $ nohup blastx -evalue 1e-10 -max_target_seqs 5 -query mygenes.fasta -db database/ara_pathway/Plantcyc_Enzymes_Without_Tags_BLASTset.fasta -num_threads 6 -outfmt 6 -out mygenes.blast.out > log &

### BLASTp with gi restriction###

• protein against protein
• gi_viridiplantae contains a list of all GI for all viriplantae. This speeds things up a lot since it restrict the search to plants only.
\$ blastp -query mygenes.fasta -db ~/blast/database/ncbi_nr/nr -gilist gi_viridiplantae -outfmt 4 -out mygenes.blast.out
• To get a GI list, go to http://www.ncbi.nlm.nih.gov/
• Search for “viridiplantae” in “protein” database.
• Download all GI (top right, >send to, >choose destination file, >format gilist, >create file)

###Sequence searches using standalone BLAST In January 2011, Annie Archambault (research professional at the QCBS) and Christopher Cameron (professor at Université de Montréal) set up a sequence similarity between 451 spicule matrix proteins from the sea urchin (Strongylocentrotus purpuratus, an echinoderm) that are involved in biomineralization 1) and the genome of Saccoglossus kowlevskii (a hemichordate that forms biominerals) and Ciona (a hemichordate that do not form biominerals).

That the Saccoglossus genome is partially sequenced, but not available from GenBank and requires to run the BLAST algorithm locally (standalone) represents a challenge. Another challenge resides in organizing the BLAST output, and then in organizing the large number of searches results (451 similarity searches for each sea urchin protein sequences) into functional categories.

Parsing the Blast output file

The output file from the 451 sequences was too large to be easily understood. We developed a small script in R to parse the blast output file, and kept only the best match: the one hit that has the minimum e-value, for each query sequence. The Blast parser in R for tab delimited blast result files is available here, and was inspired by a forum post.

Method Here are the steps to quickly parse that large file

• Save your blast result in tab delimited format
• Install the R package on you computer from the R homepage, see our Wiki section on R for more resources.
• When installing on a Mac, you need to install beforehand a GNU Fortran compiler (that you can find on the R tools page) and XCode, which you can find on your original installation CD or online at Mac Developers tools.
• Make a folder that will include your blast output file and the R script
• In a terminal, go to the directory where are the two files
• Run the script by typing:

Rscript unique_lowest_Evalue.R <inputfile> <outputfile>

where you will type the name of your blast result file instead of <inputfile>, and you will type a name you wish for you output file instead of <outputfile>

Warnings:

1. Make sure the number and the order of arguments are correct because an existing file will be overwritten if given the same name as your new output file.
2. That script currently does not keep the headings of the columns, any improvement is welcome.

References

1)
Mann K, Wilt FH, Poustka AJ (2010) Proteomic analysis of sea urchin (Strongylocentrotus purpuratus) spicule matrix. Proteome Science 8 DOI 33 10.1186/1477-5956-8-33