====Mining genomic data for tandem repeats==== ===Abstract=== Microsatellites are short sequences repeated in tandem in a genome, where the unit is generally from 2 to 6 base pairs (bp). These genomic regions tend to rapidly accumulate mutations in the form of duplication and deletion of repeat units, due to specific molecular mechanisms such as replication slippage, unequal crossing-over, unequal sister chromatide exchange and slipped-strand mis-pairing. Subsequent point mutation will also occur during DNA replication, repair, or recombination. The elevated length polymorphism (in base pairs) level makes the microsatellites appropriate markers for estimating within-population genetic diversity, which is an important indicator in biodiversity science. Traditionally, microsatellites loci discovery involved long and costly molecular biology experiments using magnetic beads for enrichment in specific repeats, followed by cloning, clone screening and sequencing. The level of polymorphism in the discovered loci is then assessed by PCR amplification from several individual of a population using primers specific for conserved regions flanking the microsatellite loci, followed by length estimation in a capillary-based or acrylamide gel electrophoresis instrument. Depending on the flanking regions mutation level, the primers may amplify successfully in genomes of related species of the same genus, or even in more distantly related genera. With the advent of genome sequencing projects, many bioinformatics tools have been developed for mining available sequence data for presence of microsatellite loci. These tools have methodological bias, and mining one set of sequences with different algorithms commonly results in widely different results (detected loci). In all case, they represent a useful, fast and low-cost alternative for discovering novel microsatellites loci in a genome. A few of these bioinformatics tools are presented here. Table 1 Non-exhaustive list of bioinformatic algorithms and pipelines available for detecting tandem repeat sites from genomic sequence data; updated from Sharma et al 2007(([[http://linkinghub.elsevier.com/retrieve/pii/S0167779907002399|Sharma, P. C., Grover, A., and Kahl, G. (2007). Mining microsatellites in eukaryotic genomes. Trends in Biotechnology 25, 490-498.]])) and Lerat 2009(([[http://www.nature.com/doifinder/10.1038/hdy.2009.165|Lerat, E. (2009). Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104, 520-533.]])). WARNING: The table is still in construction, any inaccuracy can be reported to the [[mailto:annie.archambault@mail.mcgill.ca|QCBS team]]. ^ Name ^ Year released ^ Latest update ^ Number of citations ^ Original publication ^ Repeats detected ^ Algorithm ^ Parameters ^ Input file ^ Output file ^ Detect imperfect or compound loci ^ Design primers ^ Online access ^ Interface (GUI or command-line) ^ Platform ^ language ^ Speed ^ Special Features ^ Limitations ^ ^Sputnik and Sputnik II | 1994 | 2005 | [[http://scholar.google.ca/scholar?cites=2036014199810014882&as_sdt=2005&sciodt=0,5&hl=fr|124]] | Abajian, 1994, with modifications in La Rota et al. 2005(([[http://www.biomedcentral.com/1471-2164/6/23|La Rota, M., Kantety, R., Yu, J.-K., and Sorrells, M. (2005). Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley. BMC Genomics 6, 23.]])) | 2 to 5 bp, or 1 to 5 bp in Morgante et al 2002(([[http://www.nature.com/doifinder/10.1038/ng822|Morgante, M., Hanafey, M., and Powell, W. (2002). Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nature Genetics 30, 194-200.]])) | Recursive | Match bonus and mismatch penalty, the validation score, the fail score, the maximum number of recursions, the minimum percentage of perfection, and the period size. | One file with one sequence | Resulting hits are written to stdout along with their position in the sequence, length, and score | Yes, Insertions, mismatches and deletions and compound | ? | ? | ? [[http://espressosoftware.com/sputnik/index.html|Download]], SputnikII modified in 2005 [[http://wheat.pw.usda.gov/ITMI/EST-SSR/LaRota/|Download]] | Windows | C | Execution time is dependent on the sequence composition | | Automated statistical analysis files not generated. | | ^Not named | 1998 | ? | [[http://scholar.google.ca/scholar?cites=2876768191293615039&as_sdt=2005&sciodt=0,5&hl=fr|37]] | Sagot and Myers 1998(([[http://www.liebertonline.com/doi/abs/10.1089/cmb.1998.5.539|Sagot, M.-F., and Myers, E. W. (1998). Identifying Satellites and Periodic Repetitions in Biological Sequences. Journal of Computational Biology 5, 539-553.]])) | Detect repeats of fixed length, of 2 to 45 bp long per unit. | Exhaustive | Minimal size of a repeat (min_repeat), minimum number of repeats within a TR (max-range), ? (max-jump) | ? | ? | Yes, substitutions and indels | No | No | ? | ? | ? | Execution time increases with the number of TR detected | Can be adapted to finding TR in protein sequences; and to identifying mixed direct-inverse TR. | Does not detect duplicated genes | | ^TEIRESIAS | 1998 | ? | [[http://scholar.google.ca/scholar?cites=8485486606296648212&as_sdt=2005&sciodt=0,5&hl=fr|502]] |Rigoustos et al 1998(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/14.1.55|Rigoutsos, I., and Floratos, A. (1998). Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14, 55-67.]])) | Detects nested TR | Heuristic | L (length of interrogated sequence) and W (length of one unit) and threshold K (length of the unit going to verification step) | One file with multiple sequences. | ? | Yes, only substitutions (not indels) | no | No | ? Available upon request to the [[mailto:rigoutso@us.ibm.com;gustavo@us.ibm.com|authors]] | ? | ? | Execution time increases quasi-linearly with length of units (W) and length of interrogated sequence (L) | Motifs are guaranteed to be maximal in length and composition. Can identify duplicated genes. Can search in amino acid sequences. | ? | | ^TRF | 1999 | 2004 | [[http://scholar.google.ca/scholar?cites=2912855801419178739&as_sdt=2005&sciodt=0,5&hl=fr|1487]] | Benson 1999(([[http://www.nar.oupjournals.org/cgi/doi/10.1093/nar/27.2.573|Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27, 573-580.]])) | >6 bp repeats | Heuristic; based on the alignment procedure and k-tuples. | a) alignment weights for match, mismatch and indels; b) pM and pI; c) minimum size for patterns to report; d) minimum alignment score (threshold). | One file with one sequence | A summary table file with location and statistical properties of the TR found. The other file is alignment of each repeat with its consensus sequence | Yes, imperfect and compound | | [[http://tandem.bu.edu/trf/trf.submit.options.html|Yes]] | ? [[http://tandem.bu.edu/trf/trf.download.html|Download]] | platform independent | Source code not available | Slow. Execution time is exponential to the number of repeats detected | Able to detect TR in a wide range of size and copy number | Does not accept sequences longer than 5 Mb; Automated statistical analysis files not generated; output files are difficult to manage. | | ^Not named | 2001 | ? | [[http://scholar.google.ca/scholar?cites=1469843543430168168&as_sdt=2005&sciodt=0,5&hl=fr|94]] | Landau et al 2001(([[http://www.liebertonline.com/doi/abs/10.1089/106652701300099038|Landau, G. M., Schmidt, J. P., and Sokol, D. (2001). An algorithm for approximate tandem repeats. Journal of Computational Biology 8, 1-18.]])) | ? | Exhaustive; uses Hamming distance (k mismatches) and edit distance (k differences) | ? | ? | ? | Yes, substitutions (but not indels) | | No | ? Available upon request to the [[mailto:landau@cs.haifa.ac.il|authors]] | ? | ? | ? | ? | ? | | ^REPuter | 2001 | 2005 | [[http://scholar.google.ca/scholar?cites=16580208599321300202&as_sdt=2005&sciodt=0,5&hl=fr|341]] | Kurtz et al 2001(([[http://www.nar.oupjournals.org/cgi/doi/10.1093/nar/29.22.4633|Kurtz, S. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Research 29, 4633-4642.]])) | Detect repeats of fixed length. Repeats need not to be in tandem | Heuristic; uses the Hamming distance model and the edit distance model | sequencetype, smap, direct, palindromic, length, hamming, edit, best, content, identity, evalue (error threshold k ≥ 0 and a length threshold l > 0 are given) | Limit of 5 Mb in the online version, DNA or protein | Statistical and graphical analysis | Yes, imperfect including substitutions and compound | | [[http://bibiserv.techfak.uni-bielefeld.de/reputer/submission.html |Yes]] | Command-line, [[http://bibiserv.techfak.uni-bielefeld.de/reputer/|Download]] | Linux, OSX, Solaris, Irix and Alpha | ? | ? | Nucleic acid or amino acid sequence; combine linear time efficiency with exhaustive analysis | Is not primarily designed for microsatellites | | ^SSRIT (and CUGIssr) | 2001 | ? | [[http://scholar.google.ca/scholar?cites=13806189873679233839&as_sdt=2005&sciodt=0,5&hl=fr|666]] | Temnykh et al 2001(([[http://www.genome.org/cgi/doi/10.1101/gr.184001|Temnykh, S. (2001). Computational and experimental analysis of microsatellites in rice (//Oryza sativa// L.): frequency, length variation, transposon associations, and genetic marker potential. Genome Research 11, 1441-1452.]])) | 2 to >6 bp long repeats | Uses regular expressions and similarity searches | ? | Multiple files with one sequence in a fasta format. Limit of 1 Mb per sequence analyzed. | Reports the GenBank ID, SSR motif, number of repeats, sequence coordinates for each SSR and GC% in DNA sequences (up to 500 bp in length) immediately adjoining SSR | No, perfect only | Yes, using Primer 0.5 | [[http://www.gramene.org/db/markers/ssrtool|Yes]] | Command-line [[ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl|Download]] | platform independent | Perl script | ? | ? | Automated statistical analysis files not generated; | | ^ComplexTR | 2002 | 2005 | [[http://scholar.google.ca/scholar?cites=10462322261601920203&as_sdt=2005&sciodt=0,5&hl=fr|33]] | Hauth and Joseph 2002(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/18.suppl_1.S31|Hauth, A. M., and Joseph, D. A. (2002). Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics 18, S31-S37.]])) | Long repeats, variable length TR (VLTRs) and multiperiod TR (MPTRs) | Seed-extension technique; analyses k-length substrings | ? | One file with one sequence | HTML page with detailed characterization of each TR region | | | The [[http://megasun.bch.umontreal.ca/People/ahauth/tools/ |web access]] is not functioning | Command-line [[http://megasun.bch.umontreal.ca/People/ahauth/tools/|Download]] | ? | C++, perl. Tandem repeat identification code (C++); Web Interface via CGI scripts (perl) | ? | Identifies duplicated genes and pseudogenes | ? | | ^TROLL | 2002 | ? | [[http://scholar.google.ca/scholar?cites=12253928681016535900&as_sdt=2005&sciodt=0,5&hl=fr|125]] | Castelo et al 2002(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/18.4.634|Castelo, A. T., Martins, W., and Gao, G. R. (2002). TROLL--Tandem Repeat Occurrence Locator. Bioinformatics 18, 634-636.]])) and Martins et al. 2006(([[http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gnj030|Martins, W. (2006). New softwares for automated microsatellite marker development. Nucleic Acids Research 34, e31-e31.]])) | Only searches for predefined motifs | Dictionary approach and the failure links for mismatches | Minimum length desired; maximum motif length; file containg the motif list; | One file with one sequence | Can be easily imported to other applications | Yes, imperfect and compound (indels?) | Yes, with Primer3 | No | Command-line [[http://finder.sourceforge.net|Download]] | Linux | C++ (Tcp/Tk script) | ? | ? | ? | | ^ATRHunter | 2004 | Not updated on a regular basis. | [[http://scholar.google.ca/scholar?cites=11187065548778505423&as_sdt=2005&sciodt=0,5&hl=fr|47]] | [[http://bioinfo.cs.technion.ac.il/atrhunter/ATR.pdf|Wexler et al 2004]] or Wexler et al 2005(([[http://www.liebertonline.com/doi/abs/10.1089/cmb.2005.12.928|Wexler, Y., Yakhini, Z., Kashi, Y., and Geiger, D. (2005). Finding approximate tandem repeats in genomic sequences. Journal of Computational Biology 12, 928-942.]])) | ? | Heuristic | 1) Alignment parameters (match, mismatch, gap, terminal gap) 2) Maximum motif length 3) Similarity level between adjacent copies 4) Average similarity level between adjacent copies 5) Minimum alignment score with a repeating copy. | One sequence, no limit in length. | ? | Yes | no | [[http://bioinfo.cs.technion.ac.il/atrhunter/|Yes]] | Command-line [[http://bioinfo.cs.technion.ac.il/atrhunter/ATRinformation.htm|Download]] | ? | ? | ? | ? | ? | | ^[[http://pgrc-35.ipk-gatersleben.de/portal/page/portal/PG_BICGH/P_BICGH/P_BICGH_TOOLS/P_BICGH_ITOOLS_MISA|MISA]] | 2003 | 2010 | [[http://scholar.google.ca/scholar?cites=17518492932948765286&as_sdt=2005&sciodt=0,5&hl=fr|468]] | Thiel et al 2003(([[http://www.springerlink.com/content/mv768txyrmxm4kga/|Thiel, Michalek, Varshney, and Graner (2003). Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (//Hordeum vulgare// L.). TAG Theoretical and Applied Genetics 106, 411-422.]])) | One to six bp motifs | ? | Unit sizes; lower threshold of repeats for that specific unit; maximal number of bases between two adjacent microsatellites to be recognised as being a compound microsatellite | Individual or group of sequences in fasta format | Two files: one table file with localization and type of identified microsatellite(s); one file with statistics (e.g. frequency of a specific microsatellite type) | Yes, compound and imperfect (but no indels) (other source mention it recognizes only perfect) | Yes, with Primer3 | No | Command-line [[http://pgrc.ipk-gatersleben.de/misa/misa.html|Download]] | Platform independent | perl script | ? | ? | Detection of interrupted TR not efficient; inappropriate classification of different motifs; automated statistical analysis files not generated. | | ^Mreps | 2003 | ? | [[http://scholar.google.ca/scholar?cites=7808444266556719865&as_sdt=2005&sciodt=0,5&hl=fr|156]] | Kolpakov et al 2003(([[http://www.nar.oupjournals.org/cgi/doi/10.1093/nar/gkg617|Kolpakov, R., Bana, G., and Kucherov, G. (2003). mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research 31, 3672-3678.]])) | Sites shorter than period + 9 are automatically discarded. | Mixed combinatorial/heuristic paradigm | Start and end positions of the region to be processed; length interval, period interval, minimal exponent of the repetitions to report; resolution level, use or not of the sliding window. | One file with multiple sequences in the fasta format. | list of all repeats, with start and end positions of the repeat in the sequence, overall size of the repeat, period, exponent, error level, the repeat sequence itself | Yes, imperfect and compound (not indels) | Yes | [[http://bioinfo.lifl.fr/mreps/mreps.php|Yes]] (or [[http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::mreps|here]]) | Command-line [[http://bioinfo.lifl.fr/mreps//|Download]] | Linux, SunOS, Digital Unix and Windows systems. Online? | ANSI C | Should be fast. linear in the sequence length | ? | Automated statistical analysis files not generated; | | ^STRING | 2003 | 2003 | [[http://scholar.google.ca/scholar?cites=7212951020758895252&as_sdt=2005&sciodt=0,5&hl=fr|32]] | Parisi et al 2003(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btg268|Parisi, V., De Fonzo, V., and Aluffi-Pentini, F. (2003). STRING: finding tandem repeats in DNA sequences. Bioinformatics 19, 1733-1738.]])) | No limit in repeat length | Heuristic; dynamic programming procedure. | ? | One file with one sequence, no limit in length. | Length of the consensus word; first and last position of the TR; number of repeated units; score; consensus word; flanking sequences; alignment between the model TR and the given sequence; number of indels; number of matches and mismatches; TR base composition percentages; flag indicating a likely nested expansion. | Yes, imperfect and compound (and indels?) | No but may be possible | A web interface is mentioned in the original publication, but cannot be found | Command-line [[http://www.caspur.it/~castri/STRING/|Download]] | Unix, Windows, MacOS | C | Slowly increases as a function of the sequence length, while it increases more quickly as a function of the number of TRs. | ? | Automated statistical analysis files not generated. | | ^W-SSRF | 2003 | ? | [[http://scholar.google.ca/scholar?cites=6592174860693683120&as_sdt=2005&sciodt=0,5&hl=fr|8]] | Sreenu et al 2003(([[http://www.ncbi.nlm.nih.gov/pubmed/15130803|Sreenu, V. B., Ranjitkumar, G., Swaminathan, S., Priya, S., Bose, B., Pavan, M. N., Thanu, G., Nagaraju, J., and Nagarajaram, H. A. (2003). MICAS: a fully automated web server for microsatellite extraction and analysis from prokaryote and viral genomic sequences. Appl. Bioinformatics 2, 165-168.]])) | 1 to 10 bp long | Scans a nucleotide sequence | ? | One file with one sequence. Upload limit is a 20 kb file. | Sequence content of the motif, repeat numbers, start and end position of the tract in the sequence | No, perfect only | Yes, using Autoprimer (in MICAS) | No | The user friendly GUI (graphical user interface) is MICAS, Available upon request to the [[mailto:han@cdfd.irg.in|authors]] | MICAS is [[http://micas.cdfd.org.in:8080/MIC/|web-only]] | Java | ? | ? | ? | | ^IRF program | 2004 | 2007 | [[http://scholar.google.ca/scholar?cites=7789233353140150696&as_sdt=2005&sciodt=0,5&hl=fr|80]] | Warburton et al 2004(([[http://www.genome.org/cgi/doi/10.1101/gr.2542904|Warburton, P. E. (2004). Inverted repeat structure of the human genome: The X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Research 14, 1861-1869.]])) | ? | ? | ? | ? | ? | | | ? | Command-line [[http://tandem.bu.edu/news.html#mar2007|Download]] | ? | ? | ? | ? | ? | | ^ExTRS | 2004 | ? | [[http://scholar.google.ca/scholar?cites=10980601425678591553&as_sdt=2005&sciodt=0,5&hl=fr|33]] | Krishnan and Tang 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth311|Krishnan, A., and Tang, F. (2004). Exhaustive whole-genome tandem repeats search. Bioinformatics 20, 2702-2710.]])) | ? | Exhaustive | ? | One file with one sequence, no limit in length. | Redundancy in the output is reduced | Yes, substitutions (not indels) | | ? | ? Available upon request to the [[mailto:arun@bii.a-star.edu.sg|authors]] | ? | Source code available only on request | Near-proportional to the number of TR found | ? | ? | | ^SRF | 2004 | 2004 | [[http://scholar.google.ca/scholar?cites=7521918802336115515&as_sdt=5&sciodt=0&hl=fr|55]] | Sharma et al. 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth103|Sharma, D., Issac, B., Raghava, G. P. S., and Ramaswamy, R. (2004). Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics 20, 1405-1412.]])) | 2-300 bp (default is 2-10 bp) | Spectral technique | Minimum repeat length; Maximum repeat length; Minimum % Match; FFT peak cut-off; (optional) Minimum number of copies. | One file with one sequence, no limit in length. In fasta or genbank format. | Information about the repeat unit, consensus pattern, region, copy number and score; as well as Fourier spectrum and the detailed analysis of any particular repeat unit. | Yes, imperfect and indels | No | [[http://www.imtech.res.in/raghava/srf/submit.html|Yes]] | Available upon request to the [[mailto:r.ramaswamy@mail.jnu.ac.in|author]] | Linux and Mac | Perl | Execution time increases as a function of repeat/sequence length | Repeats can be tandem, dispersed and/or imperfect | ? | | ^STAR | 2004 | ? | [[http://scholar.google.ca/scholar?cites=6796575892001734157&as_sdt=2005&sciodt=0,5&hl=fr|52]] | Delgrange and Rivals 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth335|Delgrange, O., and Rivals, E. (2004). STAR: an algorithm to search for tandem approximate repeats. Bioinformatics 20, 2812-2820.]])) | Only searches for predefined motifs >2 bp, but allows approximate repeats. | Exhaustive. As searching for microsatellites is a special cases of approximate TR, it can be done globally in a single run using the MotifFile parameter. Contact the [[http://www.lirmm.fr/~rivals/|authors]] for more informations. | File with one motif per line, and then each motif is searched independently. Position offset | One file with one sequence | ? | Yes, substitutions and indels | | As a service on an [[http://www.atgc-montpellier.fr/star/|online platform]] | Command-line [[http://www.atgc-montpellier.fr/star/binaries.php|Download]] | Linux, SunOS, Mac OSX, and Windows systems. | Source code not available | Slow. The overall time complexity needed by STAR to find all ATR of a given motif of length p, in a sequence of length n, is O(np+n log n) | Does not detect TR that would appear in a random sequence (false positives) | | | ^TRA | 2004 | 2004 | [[http://scholar.google.ca/scholar?cites=13056073161710952438&as_sdt=5&sciodt=0&hl=fr|21]] | Bilgen et al 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth410|Bilgen, M., Karaca, M., Onus, A. N., and Ince, A. G. (2004). A software program combining sequence motif searches with keywords for finding repeats containing DNA sequences. Bioinformatics 20, 3379-3386.]])) | ? | Heuristic | ? | Multiple files with multiple sequences each (Max 1 Mb sequence) from ESTs. | ? | Yes, searches for exact–inexact TRs and exact–inexact compound repeats | No? | ? | ? [[ftp://ftp.akdeniz.edu.tr/Araclar/TRA/|Download]] | Windows | C++ (with Microsoft Visual C++) | ? | Searches among the organisms, organs, tissue types and development stages | ? | | ^MsatFinder | 2005 | 2007 | 48 | Thurston and Field 2005(([[http://www.genomics.ceh.ac.uk/msatfinder/|Thurston, M. I., and Field, D. (2006). Msatfinder, (Oxford, UK: Centre for Ecology and Hydrology. Computer program)]])) | One to 6 bp long | ? | 1) length of repeat 2) number repeat unit in the site 3) search engine (regex, multipass or iterative search) | Limit of 10 Mb of sequence in the online access. Accepts GenBank, EMBL, Swissprot, FASTA, ASCII. | Repeats, GFF, Counts, Msat_tabs, Flank_tabs, Fasta, MINE, Primers | No, but detects compound perfect repeats | | [[http://www.genomics.ceh.ac.uk/cgi-bin/msatfinder/msatfinder.cgi|Yes]] | Command-line [[http://www.genomics.ceh.ac.uk/msatfinder/#download|Download]] | Unix (may work on Mac OSX) | perl script | ? | Nucleic acid or amino acid sequence | ? | | ^FireµSat | 2006 | 2011 | [[http://scholar.google.ca/scholar?cites=17060705618169575807&as_sdt=2005&sciodt=0,5&hl=fr|5]] and [[http://scholar.google.ca/scholar?cites=66353776030260478&as_sdt=5&sciodt=0&hl=fr|this]] | de Ridder at al 2006(([[http://portal.acm.org/citation.cfm?doid=1216262.1216289|de Riddera, C., Kourie, D. G., and Watson, B. W. (2006). FireµSat. In Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries - SAICSIT ’06 (Somerset West, South Africa), pp. 247-256.]])) and de Ridder at al 2013(([[http://www.sciencedirect.com/science/article/pii/S1570866712001657|De Ridder, C., D.G. Kourie, B.W. Watson, T.R. Fourie, and P.V. Reyneke (2013). Fine-tuning the search for microsatellites. Journal of Discrete Algorithms 20: 21–37.]])) | 1 to 5 bp. Length set by the user. The next update should allow for detection of 6 to 100 bp repeats. | Uses Counting Finite Automata (which are regular language acceptors) | Max Motif Error (per motif); Max adjacent ATR elements; Motif Range Options; Min required TR elements; Max substring error (a threshold); Mismatch penalty (m_p); Delete penalty (d_p); Insert penalty (i_p). | One fasta file with one sequence | File in .csv format. | Yes, substitutions and indels, but not compound loci. | No | No | GUI and command-line, [[http://www.dna-algo.co.za/|Download]] | Windows; Linux in progress | C++ and MatLab | Run time increases linearly with the sequence length; does not increase with longer motif lengths. | Designed for microsatellites, but can detect any type of TR | Fast, simple and flexible. | | ^Phobos | 2006 | 2010 | ? | Mayer 2010(([[http://www.ruhr-uni-bochum.de/spezzoo/cm/cm_phobos.htm |Mayer, C. (2010). Phobos: Highly accurate search for perfect and imperfect tandem repeats in complete genomes by Christoph Mayer, (Bochum, Germany: Ruhr-Universität Bochum,Faculty of Biological Sciences and Biotechnology). Computer program.]])) | Perfect and imperfect TR, with a pattern size of 1 - 10 000 bp | Exhaustive, uses alignment scores | Mismatch score, indel score, minimum score, minimum length, minimum perfection, and others. | One file in fasta format, with multiple sequence. No limit in sequence length. | Text file, different formats, including gff and fasta | Yes. Substitutions and indels. | Not in itself, but yes as implemented in STAMP or Geneious. | No | User friendly GUI and easily scriptable Command-line program [[http://www.ruhr-uni-bochum.de/ecoevo/cm/cm_phobos.htm|Download]] | MacOSX, Linux, Windows. | C++ | Execution time increases with pattern size range. Very fast in the size range 1-10 bp, slow for patterns in the size range above 10-20 bp. | Can be incorporated into pipelines. Implemented in STAMP and Geneious. | Free only for academic users. | | ^SSRscanner | 2006 | ? | [[http://scholar.google.ca/scholar?cites=15246530144073441813&as_sdt=2005&sciodt=0,5&hl=fr|3]] | Anwar and Khan 2006(([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1891659/|Anwar, T., and Khan, A. (2006). SSRscanner: a program for reporting distribution and exact location of simple sequence repeats. Bioinformation 1, 89-91.]])) | Only searches for predefined motifs | Exhaustive, uses dictionary approach. | File containing motifs of different repeat types; number of times for the motifs to be repeated | One file with one sequence | Motifposition.txt (gives the frequency of each repeat provided in the motif file) and (2) Motifresult.exe (gives the specific location of each repeat) | No, perfect only | No | No | Command-line Availability unknown, contact the [[mailto:huzzi99@hotmail.com|author]] | Platform independent | perl script | ? | ? | ? | | ^TandemSWAN | 2006 | 2006 | [[http://scholar.google.ca/scholar?cites=9674288721814272003&as_sdt=2005&sciodt=0,5&hl=fr|35]] | Boeva et al 2006(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btk032|Boeva, V., Regnier, M., Papatsenko, D., and Makeev, V. (2006). Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22, 676-684.]])) | Fuzzy TR (degenerate TR without indels, but with a high substitution rate), > 3 bp long | Heuristic | Degeneracy level; the minimal and the maximal period sizes; the mode of statistical significance calculation (i.e. the "motif" mode or the "mask" mode); in plain, fasta, EMBL or GenBank format. | One file with one sequence, no limit in length. In plain, EMBL, fasta or GenBank format. | A file with: the start, end, and length of TR; motif size (period); the number of copies; the motif consensus sequence; the number of words in the motif that satisfy the consensus; the probability, P-value and statistical significance for the "motif" and the "mask"; the TR itself. | Yes, only imperfect (no indels) | No | [[http://favorov.imb.ac.ru/swan/form_TS.html?web_form=Query+the+new+version+of+TandemSWAN+web+form+now!|Yes]] | Command-line [[http://favorov.imb.ac.ru/swan/|Download]] | Platform independent | C | ? | Can simultaneously identify both short- and large period repeats with approximately the same fuzziness | ? | | ^OMWSA | 2007 | ? | [[http://scholar.google.ca/scholar?cites=8389825704026262517&as_sdt=2005&sciodt=0,5&hl=fr|10]] | Du et al 2007(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btm008|Du, L., Zhou, H., and Yan, H. (2007). OMWSA: detection of DNA repeats using moving window spectral analysis. Bioinformatics 23, 631-633.]])) | From 3 bp long to unlimited length | Spectral technique | Window size | ? | A spectrogram (an image file) | Yes, imperfect compound and indels | No | ? | Command-line [[http://www.hy8.com/~tec/papers/omwsa01.zip|Download]] | ? | ? | ? | Graphically displays the potential repeats in the location-frequency plane | ? | | ^IMex | 2007 | 2009 | [[http://scholar.google.ca/scholar?cites=15709318602016110348&as_sdt=5&sciodt=0&hl=fr|27]] | Mudunuri et al 2007(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btm097|Mudunuri, S. B., and Nagarajaram, H. A. (2007). IMEx: Imperfect Microsatellite Extractor. Bioinformatics 23, 1181-1187.]])) | One to 6 bp long | Simple string-matching algorithm | Number of edit operations/motif (k); percentage imperfection for the entire tract (_p); minimum repeat number (n); coding information file | One file with one sequence, no limit in length. | Pairwise alignment between the identified tract and its perfect counter part is produced to indicate the matches, mismatches and gaps | Yes, imperfects with substitutions and indels. | Yes, linked to Primer3 | [[http://imex.cdfd.org.in/IMEX/extract.html|Yes]] | User friendly GUI (graphical user interface) standalone. [[http://imex.cdfd.org.in/IMEX/download.html|Download]] | ? | Standard C language. Web server using CGI-Perl | Fast. Execution time is linear (directly proportional) to the sequence length rather than the number of repeats detected | Tells if in coding or non-coding region. | ? | | ^SciRoKo | 2007 | 2008 | [[http://scholar.google.ca/scholar?cites=16996584693111364287&as_sdt=2005&sciodt=0,5&hl=fr|44]] | Kofler et al 2007(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btm157|Kofler, R., Schlotterer, C., and Lelley, T. (2007). SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics 23, 1683-1685.]])) | One to 6 bp long | ? | Hits (identity with a virtual perfect microsatellite), number of mismatches (mm), mismatch penalty (mmP) and the length of the SSR motif (mL). | One file with multiple sequences in the fasta format. | ? | Yes, imperfect and compound (indels?) | | No | User friendly GUI (graphical user interface) standalone. [[http://www.kofler.or.at/bioinformatics/SciRoKo/index.html|Download]] | Windows. Should be platform independent, but Mac users have not been able install | C# | Fast | ? | Depends on .NET framework | | ^Msatcommander | 2008 | 2011 | [[http://scholar.google.ca/scholar?cites=3185583603820687905&as_sdt=2005&sciodt=0,5&hl=fr|102]] | Faircloth 2008(([[http://doi.wiley.com/10.1111/j.1471-8286.2007.01884.x|Faircloth, B. C. (2008). msatcommander: detection of microsatellite repeat arrays and automated, locus-specific primer design. Molecular Ecology Resources 8, 92-94.]])) | ? | Uses regular expressions | ? | One file with multiple sequences in the fasta format. | Either a summary file (array detection only) or a directory at a user-selectable location | No, but accepts N | Yes, with Primer3, includes 5'-tailing | No | User friendly GUI (graphical user interface) standalone. [[http://code.google.com/p/msatcommander/|Download]] | MacOS X, Windows, Unix. | Python | ? | Rapid and automated microsatellite array detection, locus-specific primer design, and 5'-tailing of designed primers | ? | | ^ReRep | 2008 | 2008 | [[http://scholar.google.ca/scholar?cites=255635357253003447&as_sdt=2005&sciodt=0,5&hl=fr|5]] | Otto et al 2008(([[http://www.biomedcentral.com/1471-2105/9/366|Otto, T. D., Gomes, L. H. F., Alves-Ferreira, M., de Miranda, A. B., and Degrave, W. M. (2008). ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS). BMC Bioinformatics 9, 366.]])) | ? | Uses self-similarity searches | ? | Genome survey sequences(GSS) files, including 454-reads | ? | Yes, substitutions and indels | No | No | Command-line [[http://www.dbbm.fiocruz.br/labwim/bioinfoteam/index.pl?action=services|Download]] | Linux | Perl | ? | Can detect de novo repeats in Genome Sequence Survey sequence data | ? | ^T-REKS | 2009 | ? | [[http://scholar.google.ca/scholar?cites=208721574341442296&as_sdt=2005&sciodt=0,5&hl=fr|5]] | Jorda and Kajava 2009(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp482|Jorda, J., and Kajava, A. V. (2009). T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics 25, 2632-2638.]])) | No limits in repeat length | Short string extension and K-means algorithm. | delta-l Allowed % of length variability, P*sim—similarity threshold and an option to allow or not the detection of overlaping TR. | Sequences in FASTA format. | Output with start, end, length of TR and multiple alignment of the repeats. | Yes, substitutions and indels. | No | [[http://bioinfo.montp.cnrs.fr/?r=t-reks|Yes]] | User friendly GUI (graphical user interface) standalone. [[http://bioinfo.montp.cnrs.fr/?r=t-reks|Download]] | Platform independent | Java | Fast. Execution time is linear (directly proportional) to the sequence length. | Can be applied to nucleic acid, amino acid or any text sequence. | ? | ^BwTRS | 2010 | 2009 | [[http://scholar.google.ca/scholar?cites=7603849771779688492&as_sdt=2005&sciodt=0,5&hl=fr|2]] | Pokrzywa and Polanski 2010(([[http://linkinghub.elsevier.com/retrieve/pii/S0888754310001758|Pokrzywa, R., and Polanski, A. (2010). BWtrs: A tool for searching for tandem repeats in DNA sequences based on the Burrows–Wheeler transform. Genomics 96, 316-321.]])) | ? | Exhaustive; uses efficient data compression algorithm | "Minimum motif size", "Maximum motif size", "Minimum repeat size" and "Minimum repeat ratio". | One file with multiple sequences in the fasta format; or GenBank id. Accepts nucleotides and amino acids sequences. | List of all TR with: Start and End (position of the TR in the sequence); Motif length; Ratio (between the motif length and the consensus repeat length); the motif itself. HTML or text. | No | No | [[http://bwtools.polsl.pl/BWtrs/input.jsp|Yes]], runs on a standard Tomcat servlet container without any local database | ? Availability unknown, contact the [[mailto: rafal.pokrzywa@polsl.pl|authors]]. | ? | Java | Depends on sequence length | Nucleic acid or amino acid sequence | ? | ^QDD, QDD2 | 2010 | 2011 | [[http://scholar.google.ca/scholar?cites=146872267015635519&as_sdt=5&sciodt=0&hl=fr|16]] | Meglecz et al 2010(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp670|Meglecz, E., Costedoat, C., Dubut, V., Gilles, A., Malausa, T., Pech, N., and Martin, J.-F. (2009). QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinformatics 26, 403-404.]])) | 2 to 6 bp motifs | First step: detects all perfect TR with at least 5 repetitions as target. Later (step 3, primer design), TR of 3-4 repeats are also detected and blocks of repetitions are pooled into a complex microsatellite if the distance between them is not greater than the footprint of the longest motif of two neighbors. | Not parametrable | One fasta file with multiple short sequences (< ~2Mb). Multiple files can be run in batch. | (a) Fasta file with sequences containing a perfect TR of at least 5 repeats. (b) Text file with sequence code and length, number of microsatellites, TR motif, first position, last position, and number of repeats. csv file with primer information. | Yes if it contains at least 5 perfect repeats. Imperfect TR will be detected as composite site. | Yes | [[http://gsite.univ-provence.fr/gsite/Local/egee/dir/meglecz/QDD.html|Coming soon]] | Command-line. Availability unknown, contact the [[mailto:emese.meglecz@univ-provence.fr|authors]]. | Linux, Windows | Perl script (with Bioperl) | ? | Sorts raw sequences by tag, removes adapters, detects TR sites, redundancy, selects target microsatellites, and designs primer. Detects contamination. | ? | ^TRStalker (in TReaDS) | 2010 | ? | [[http://scholar.google.ca/scholar?cites=10471014057375486653&as_sdt=5&sciodt=0&hl=fr|1]] | Pellegrini et al 2010(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btq209|Pellegrini, M., Renda, M. E., and Vecchio, A. (2010). TRStalker: an efficient heuristic for finding fuzzy tandem repeats. Bioinformatics 26, i358-i366.]])) | ? | ? | ? | ? | ? | Yes, imperfect and indels | | [[http://yucca.iit.cnr.it:8080/treads/|Yes]] | ? Availability unknown, contact the [[mailto:marco.pellegrini@iit.cnr.it|authors]] | ? | ? | ? | ? | ? | ^Pipeline name ^ year released ^ Latest update ^ Number of citations ^ Original publication ^ Repeats detected ^ Algorithm used ^ Parameters ^ Input file ^ Output file ^ Detect imperfect or compound loci ^ Design primers ^ Online access ^ Interface (GUI or commandline) ^ Platform ^ language ^ Speed ^ Special Features ^ Limitations | ^repeatfinder | 2001 | ? | ? | Volfovsky et al 2001(([[http://genomebiology.com/2001/2/8/research/0027|Volfovsky, N., Haas, B., and Salzberg, S. (2001). A clustering method for repeat analysis in DNA sequences. Genome Biology 2, research0027.1 - research0027.11.]])) | ? | Uses REPuter | ? | ? | ? | ? | ? | No | Command-line [[http://www.cbcb.umd.edu/software/RepeatFinder/|Download]] | Linux RedHat 6.x+, Sun Solaris, and Alpha OSF1 | ? Open Source | ? | ? | ? | ^MsatMiner | 2005 | ? | | Thurston and Field 2005(([[http://www.genomics.ceh.ac.uk/msatminer/|Thurston, M. I. (2005). Msatminer - scripts for processing msatfinder output, (Oxford, UK: Centre for Ecology and Hydrology). Computer program.]])) | ? | Uses msatfinder for motif discovery | | Different format of sequence file | More statistical analysis are possible. | Yes, imperfect and compound (indels?) | Possible when using additional scripts | ? | Command-line | Unix and MacOS | Collection of perl scripts | ? | ? | Running scripts is possibly complicated | ^E-TRA | 2005 | 2004 | [[http://scholar.google.ca/scholar?cites=6055835177193667419&as_sdt=5&sciodt=0&hl=fr|12]] | Karaca et al 2005(([[http://www.springerlink.com/index/10.1007/BF02715889|Karaca, M., Bilgen, M., Onus, A. N., Ince, A. G., and Elmasulu, S. Y. (2005). Exact tandem repeats analyzer (E-TRA): A new program for DNA sequence mining. J Genet 84, 49-54.]])) | 1 to 1000 bp repeat | Uses TRA | ? | Multiple files with multiple sequences each (maximum of 1 Mb long) | | Yes, compound and imperfect | Yes | No | user friendly GUI (graphical user interface) [[ftp://ftp.akdeniz.edu.tr/Araclar/TRA/|Download]] | Windows | C++ (with Microsoft Visual C++) | | Searches among the organisms, organs, tissue types and development stages | Only 1 Mb of input sequence length | ^SSRprimerII | 2006 | 2009 | 29 | Robinson et al. 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth104|Robinson, A. J., Love, C. G., Batley, J., Barker, G., and Edwards, D. (2004). Simple sequence repeat marker loci discovery using SSR primer. Bioinformatics 20, 1475-1476.]])) and Jewell et al. 2006(([[http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gkl083|Jewell, E. et al. (2006). SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery. Nucleic Acids Research 34, W656-W659.]])) | 2 to >6 bp long repeats | Uses Sputnik | | Limit of 4000 bp sequence | | | Yes uses Primer3 | [[http://flora.acpfg.com.au/ssrprimer2/cgi-bin/index|Yes]] | ? Contact the [[http://www.appliedbioinformatics.com.au/|group]] | ? | Perl scripts | ? | ? | ? | ^TRAP | 2006 | 2005 | 10 | Sobreira et al 2006(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bti809|Sobreira, T. J. P., Durham, A. M., and Gruber, A. (2005). TRAP: automated classification, quantification and annotation of tandemly repeated sequences. Bioinformatics 22, 361-362.]])) | 1 to 2000 bp repeat | Uses TRF | All TRF parameters, as well as: min and max number of motifs; min and max motif size (period), min size of flanking regions and min match % between adjacent motifs. | One file with multiple sequences in fasta format. | Many formats (csv, HTML, flat files, and GFF) | Yes, imperfect and compound | No, but generates a fasta file with TR regions masked with Ns | no | Command-line [[http://www.coccidia.icb.usp.br/trap/trapRegister/|Download]] | Unix; MacOSX | Perl scripts | Near-proportional to the number of TRs found by TRF | Selection, classification, quantification and automated annotation of TR sequences | Not compatible with MS Windows version of TRF | ^cid | 2008 | ? | 11 | Freita et al 2008(([[http://doi.wiley.com/10.1111/j.1471-8286.2007.01950.x|Freitas, P. D., Martins, D. S., and Galetti, P. M. (2008). cid: a rapid and efficient bioinformatic tool for the detection of SSRs from genomic libraries. Molecular Ecology Resources 8, 107-108.]])) | Same as MISA | Uses MISA for tandem repeat detection, and other external programs for other steps | | Set of chromatograms or multiFASTA file | List of useful primers | | Yes, using Primer3 | ? | Web environment, Availability unknown, contact the [[mailto: patdf@iris.ufscar.br|authors]] | ? | perl and php to connect the different tools | ? | Can mask vectors and adaptors regions of cloned sequences | ? | ^Etandem (in EMBOSS) | 2008 | ? | [[http://scholar.google.ca/scholar?cites=2718801218157611554&as_sdt=2005&sciodt=0,5&hl=fr|2119]] | Rice et al 2010(([[http://linkinghub.elsevier.com/retrieve/pii/S0168952500020242|Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics 16, 276-277.]])) | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ^STAMP | 2009 | 2010 | [[http://scholar.google.ca/scholar?cites=595056583974417882&as_sdt=2005&sciodt=0,5&hl=fr|16]] | Kraemer et al 2009(([[http://www.biomedcentral.com/1471-2105/10/41|Kraemer, L., Beszteri, B., Gäbler-Schwarz, S., Held, C., Leese, F., Mayer, C., Pöhlmann, K., and Frickenhaus, S. (2009). STAMP: Extensions to the STADEN sequence analysis package for high throughput interactive microsatellite marker design. BMC Bioinformatics 10, 41.]])) | Imperfect and perfect, 1-unlimited | Based on STADEN, uses Phobos (exhaustive search) to detect all TR that match user-defined options. TROLL was previously in STADEN. | Mismatch score, indel score, minimum score, minimum length, minimum perfection, and others. | Fasta file with one or multiple sequences and no limit in sequence length, Staden experimental files. | Text files, different formats | Yes (substitutions and indels), but only those defined by user. | Yes, also for multiplexing microsatellites markers. Integrates Primer3 into GAP4 | No | Plugged into the GUI offered by STADEN, [[http://www.awi.de/en/research/scientific_computing/research_topics/bioinformatics/software/#c6699|Download]] | Mac Os X, Linux, Windows. Others upon request. | TCL, C++ | Execution time increases with pattern size range. Very fast in the size range 1-10 bp, slow for patterns in the size range above 20 bp | Automatic pipeline, interaction in primer design step possible. | Free only for academic users. | ^WebSat | 2009 | ? | [[http://scholar.google.ca/scholar?cites=16509104044085592718&as_sdt=5&sciodt=0&hl=fr|22]] | Martins et al 2009(([[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2646864/|Martins, W. S., Soares Lucas, D., de Souza Neves, K. F., and Bertioli, D. J. (2009). WebSat ‐ A web software for microsatellite marker development. Bioinformation 3, 282-283.]])) | Same as TROLL | Uses TROLL | Motif length and minimum number of motifs | Individual sequences, in raw or FASTA format, or a group of sequences in a multi-FASTA format, maximum of 150 000 characters. | ? | ? | ? | [[http://wsmartins.net/websat/|Yes]] Uses Ajax techniques | Command-line [[http://sourceforge.net/projects/satfinder/|Download]] | ? | Written in PHP and JavaScript, making use of Ajax techniques | ? | ? | ? | ^TReaDS | 2010 | ? | ? | Renda et al 2010(([[http://bioalgo.iit.cnr.it/papers/TReaDS-IIT-TR06-08.pdf|Renda, E., Vecchio, A., and Pellegrini, M. (2011). TReaDS: Tandem Repeats Discovery Service. In BITS’11 (Pisa, Italy), pp. 102-104.]])) | ? | Includes Approximate Tandem Repeat Hunter (ATRHunter), mreps, Tandem Repeat finder (TRF), TandemSwan, TRStalker (new algorithm, only available on TReaDS) | ? | ? | ? | ? | ? | [[http://yucca.iit.cnr.it:8080/treads/|Yes]] | ? Availability unknown, contact the [[mailto:a.vecchio@ing.unipi.it;elena.renda@iit.cnr.it;marco.pellegrini@iit.cnr.it|authors]] | ? | ? | ? | ? | ? | ===Algorithms details=== ==Sputnik== Abadjian 1994 and La Rota et al. 2005(([[http://www.biomedcentral.com/1471-2164/6/23|La Rota, M., Kantety, R., Yu, J.-K., and Sorrells, M. (2005). Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley. BMC Genomics 6, 23.]])) Sputnik uses a recursive algorithm and the performance depends on the recursion depth of the program. Sputnik reads through the entire sequence, assumes the existence of a repeat at every position, compares subsequent nucleotides and applies a simple scoring rule. If the resulting score rises above a preset threshold, the region along with its position and score is written out. Score is determined by the length of the repeat and the number of errors. Each nucleotide that matches the value predicted (by assuming a repeat) adds to the score. Each "error" subtracts from the score. When an error is encountered, the three possible kinds of errors (mismatch, insertion and deletion) are assumed and recursive calls to the comparison routine are made. If the resulting score from one of these is above the cutoff threshold, it is returned and the best of three pursued. ==Unnamed 1998== Sagot and Myers 1998(([[http://www.liebertonline.com/doi/abs/10.1089/cmb.1998.5.539|Sagot, M.-F., and Myers, E. W. (1998). Identifying Satellites and Periodic Repetitions in Biological Sequences. Journal of Computational Biology 5, 539-553.]])) It is an exhaustive search, but with a pre-filtering using a heuristic. It uses a general combinatorial framework of “consensus repeat” and uses of some heuristic filtering steps to avoid exponential increase in time complexity. The algorithm exhaustively searches for all possible consensus models whose edit distance to the real repeated units is no bigger than an upper bound. The algorithm works by extending prefix models. The first step is a filter that eliminates all regions whose probability of containing a satellite is less than one in a certain threshold (?), when percentage of sequence variation between units is 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. ==TEIRESIAS== Rigoustos et al 1998(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/14.1.55|Rigoutsos, I., and Floratos, A. (1998). Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14, 55-67.]])) Is a heuristic variation on TRF algorithm. It is a model less algorithm that replaces the TRF contiguous k-tuples with patterns of not necessarily contiguous k characters. The first scanning step detects potential elementary patterns, using the heuristics that the number of equi-spaced patterns reported is larger in TR regions than elsewhere in the sequence. The second step is a convolution phase, it is a verification step where elementary patterns are combined into larger patterns and false positives are discarded using a scoring system. ==TRF== Benson 1999(([[http://www.nar.oupjournals.org/cgi/doi/10.1093/nar/27.2.573|Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27, 573-580.]])) It was originally designed to search for long tandem repeats. TRF is based on the alignment procedure and k-tuples. TRF uses a probabilistic algorithm that includes a “detection step” to identify the candidate repeats (seeds of miminum of 5 bp) and an “analysis” step, according to statistically-founded criteria to filter the candidate repeats. ==Unnamed 2001== Landau et al 2001(([[http://www.liebertonline.com/doi/abs/10.1089/106652701300099038|Landau, G. M., Schmidt, J. P., and Sokol, D. (2001). An algorithm for approximate tandem repeats. Journal of Computational Biology 8, 1-18.]])) The algorithm finds all approximate single repeats within a sequence. Considers two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). In each iteration i, the input string is divided into substrings. Each substring is searched individually for all repeats that span its centre character. ==REPuter== Kurtz et al 2001(([[http://www.nar.oupjournals.org/cgi/doi/10.1093/nar/29.22.4633|Kurtz, S. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Research 29, 4633-4642.]])) The algorithm uses two different distance models: the Hamming distance model and the edit distance model. It then assesses the significance of a repeat by computing its E-value, i.e. the number of repeats of the same length or longer and with the same number of errors or fewer that one would expect to find in a random DNA of the same length. ==SSRIT== Temnykh et al 2001(([[http://www.genome.org/cgi/doi/10.1101/gr.184001|Temnykh, S. (2001). Computational and experimental analysis of microsatellites in rice (//Oryza sativa// L.): frequency, length variation, transposon associations, and genetic marker potential. Genome Research 11, 1441-1452.]])) The script uses regular expressions to locate SSR patterns in sequence files. In a second step, it eliminates redundancy by locating candidate sites to original sequence using BLAST search. ==ComplexTR== Hauth and Joseph 2002(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/18.suppl_1.S31|Hauth, A. M., and Joseph, D. A. (2002). Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics 18, S31-S37.]])) The algorithm both locates and characterizes TR regions. Three main tasks are performed: 1) isolate a tandem repeat by determining its period and its approximate sequence location; 2) determine the pattern associated with a region period; 3) characterize the region using the pattern. As in TRF(([[http://www.nar.oupjournals.org/cgi/doi/10.1093/nar/27.2.573|Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27, 573-580.]])), the algorithm uses k-length substrings in a sequence by finding recurring distances between identical substrings. The ComplexTR differs from TRF in that a statistical model is not used to locate interesting periods, but a simple and accurate filter. In the first step, a distance arrays D, parallel to S is constructed to record a distance d between identical occurrences of k-length substrings (termed words). The algorithm requires at least five identical d occurrences to be present in a run. ComplexTR uses word and distances similarities to determine significant periods within a region. The general procedure for the second step (construct region base pattern) is to select a segment of sequence corresponding to all or a portion of a copy, to align the segment to several copies and to for a pattern using the consensus. This constitutes the region based pattern. The third step (characterize using region pattern) uses wraparound dynamic programming to handle regular expressions un the context of TR identifications. The final alignment is displayed as a series of copies, each representing a pattern occurrence in S, and a consensus copy is formed. ==TROLL== Castelo et al 2002(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/18.4.634|Castelo, A. T., Martins, W., and Gao, G. R. (2002). TROLL--Tandem Repeat Occurrence Locator. Bioinformatics 18, 634-636.]])) and Martins et al. 2006(([[http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gnj030|Martins, W. (2006). New softwares for automated microsatellite marker development. Nucleic Acids Research 34, e31-e31.]])) The algorithm uses the dictionary approach to find tandem repeats of pre-selected motifs. Find all occurrences from a list of patterns in a text string (the Aho–Corasick algorithm). At a mismatch character, it uses the failure links to continue to search without having to re-sample characters in the text. It uses "repeat buffer" operation to avoid missing repeats or counting the same site multiple times. Reads and write are done in constant time ==ATRHunter== [[http://bioinfo.cs.technion.ac.il/atrhunter/ATR.pdf|Wexler et al 2004]] or Wexler et al 2005(([[http://www.liebertonline.com/doi/abs/10.1089/cmb.2005.12.928|Wexler, Y., Yakhini, Z., Kashi, Y., and Geiger, D. (2005). Finding approximate tandem repeats in genomic sequences. Journal of Computational Biology 12, 928-942.]])) The algorithm is heuristic search originally designed to search for long tandem repeats. First, the screening phase identifies substrings that have an unusually high probability of being an ATR using three similarity criteria. Second, the verification phase determines which of these candidates are in fact ATRs of the desired type. ==MISA== Thiel et al 2003(([[http://www.springerlink.com/content/mv768txyrmxm4kga/|Thiel, Michalek, Varshney, and Graner (2003). Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (//Hordeum vulgare// L.). TAG Theoretical and Applied Genetics 106, 411-422.]])) The algorithm is not described ==Mreps== Kolpakov et al 2003(([[http://www.nar.oupjournals.org/cgi/doi/10.1093/nar/gkg617|Kolpakov, R., Bana, G., and Kucherov, G. (2003). mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research 31, 3672-3678.]])) The algorithm is a mixed combinatorial/heuristic paradigm. First an exhaustive combinatorial algorithm finds all approximate tandem repeats. Those repeats are then submitted to a heuristic treatment in order to obtain more biologically relevant representation of the repeats. This heuristic step includes trimming edges, computing the best period and merging, filtering out statistically expected repeats, gathering the results. The user specifies a resolution parameter that determines the ‘fuzziness’ of found repeats instead of introducing a scoring function and specifying a threshold score value for found repeats. ==STRING== Parisi et al 2003(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btg268|Parisi, V., De Fonzo, V., and Aluffi-Pentini, F. (2003). STRING: finding tandem repeats in DNA sequences. Bioinformatics 19, 1733-1738.]])) The algorithm is a dynamic programming procedure originally designed to search for long tandem repeats. The general strategy is the following: (a) Instead of studying the whole sequence we examine only the tracts (interesting zones) that we consider to be the more promising ones. (b) Instead of studying all possible consensus words we examine only the words that we consider to be the more promising ones. Detailed steps: 1) First using a local alignment (autoalignment) procedure; 2) Second, by group in suitable clusters all the autoalignments obtained in phase 1, by putting in the same cluster all the autoalignments whose augmented extensions are not disjoint. ==W-SSRF== Sreenu et al 2003(([[http://www.ncbi.nlm.nih.gov/pubmed/15130803|Sreenu, V. B., Ranjitkumar, G., Swaminathan, S., Priya, S., Bose, B., Pavan, M. N., Thanu, G., Nagaraju, J., and Nagarajaram, H. A. (2003). MICAS: a fully automated web server for microsatellite extraction and analysis from prokaryote and viral genomic sequences. Appl. Bioinformatics 2, 165-168.]])) Scans a nucleotide sequence for the presence of perfect simple sequence repeats from motifs of 1 to 10 bp long. ==IRF== Warburton et al 2004(([[http://www.genome.org/cgi/doi/10.1101/gr.2542904|Warburton, P. E. (2004). Inverted repeat structure of the human genome: The X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Research 14, 1861-1869.]])) Algorithm not known ==ExTRS== Krishnan and Tang 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth311|Krishnan, A., and Tang, F. (2004). Exhaustive whole-genome tandem repeats search. Bioinformatics 20, 2702-2710.]])) It is an exhaustive search. The algorithm takes a definition of tandem repeat that uses the Hamming distance measure, and provably finds all tandem repeats. The definition is parameterized by a mismatch ratio p which allows for more mismatches in longer tandem repeats (and fewer in shorter), thereby avoiding the problems of a fixed mismatch count. Additionally, the algorithm uses a filtering algorithm that prunes the resultant exhaustive set to a smaller one with fewer redundancies. The different algorithms are parallel (can independently search for tandem repeats for different pattern lengths k) and well suited for Beowulf-class clusters. ==SRF== Sharma et al. 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth103|Sharma, D., Issac, B., Raghava, G. P. S., and Ramaswamy, R. (2004). Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics 20, 1405-1412.]])) The algorithm uses the following steps: 1) Input a DNA sequence of length n. 2) Compute the power spectrum, S( f ), and the spectral average, Š, of the entire sequence. 3) Identify all peaks with S(fi)/ Š > T (the threshold, here chosen to be 4). For each frequency fi so identified, there are potential repeats of length Ni = 1/fi. 4) For each of these, compute the Pm( j ) = S(fi)/ Š in a sliding window of length m centered on position j in the sequence. Regions containing the repeat of length Ni can be identified directly as those where Pm( j ) exceeds the threshold. 5) Since both the repeat length, Ni, and its location are known, an exact method (step 6) is used to identify the repeat units. 6) Consider all Ni-mers in the repeat region and identify those occurring most frequently by local alignment. This automatically makes it possible to allow for insertions and deletions to any desired level. ==STAR== Delgrange and Rivals 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth335|Delgrange, O., and Rivals, E. (2004). STAR: an algorithm to search for tandem approximate repeats. Bioinformatics 20, 2812-2820.]])) The algorithm detects all significant Approximate Tandem Repeats (ATR) of a given motif, where significance is assessed using the Minimum Description Length (MDL) criterion. MDL provides an absolute measure of the significance of an ATR independently of the motif. It evaluates how many mutations are allowed in an ATR when compared to an ETR of the best possible length. The STAR algorithm needs no threshold value and optimally locates ATR of any input motif with respect to the MDL criterion. ==TRA== Bilgen et al 2004(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/bth410|Bilgen, M., Karaca, M., Onus, A. N., and Ince, A. G. (2004). A software program combining sequence motif searches with keywords for finding repeats containing DNA sequences. Bioinformatics 20, 3379-3386.]])) Is a useful algorithm for expressed sequences. There are two different algorithms implemented in two modules. The first module searches for exact and compound motifs. It searches for Sn, a string of repeated units in a DNA sequence wn- S1 = w1 [i1, j1] symbolizes the S1 starting with the i1-th and ending with the j1-th bases of the DNA sequence w1. The distance between i1 and j1 will therefore be equal to m1 × r1 where m1 and r1 refer to a type of DNA motif length and the number of repeats in S1 string of each w of a fixed length, respectively. When applicable, strings in a sequence of w are referred to as S1, S2, S3 . . . Sn for each consecutive string in a sequence w. The distance between S1 and S2 is referred to as d1, the distance between S2 and S3 is d2 and so on till Sn, dn- . The second module that searches for inexact and compound motifs with STRING algorithm (Parisi). Overall, program goes as follows: (i) searches the user defined organism (s) and/or keywords (organs, cell lines, tissue types or development stages) analyzing the whole data set provided in a data folder; (ii) isolates simple and non-simple (compound, imperfect and extended compound) tandem repeats by determining their type, lengths, and sequence location in a given DNA strings within DNA sequences; (iii) characterizes the repeats containing sequences based on the user defined parameters/ options, and; (iv) displays the results according to the user’s parameters/options. ==MsatFinder== Thurston and Field 2005(([[http://www.genomics.ceh.ac.uk/msatfinder/|Thurston, M. I., and Field, D. (2006). Msatfinder, (Oxford, UK: Centre for Ecology and Hydrology. Computer program)]])) No published description of the algorithm could be found. According to the online access, regular expression (regex), multipass or iterative search engines can be used. ==FireµSat== de Ridder at al 2006(([[http://portal.acm.org/citation.cfm?doid=1216262.1216289|de Riddera, C., Kourie, D. G., and Watson, B. W. (2006). FireµSat. In Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing couuntries - SAICSIT ’06 (Somerset West, South Africa), pp. 247-256.]])) and de Ridder at al 2013(([[http://www.sciencedirect.com/science/article/pii/S1570866712001657|De Ridder, C., D.G. Kourie, B.W. Watson, T.R. Fourie, and P.V. Reyneke (2013). Fine-tuning the search for microsatellites. Journal of Discrete Algorithms 20: 21–37.]])), Is a combination of straightforward FA technology combined with a flavour of Moore machine technology. It uses Counting Finite Automata, which are regular language acceptors. The parameters details are the following: * Max Motif Error: Is the number of motif errors the user wants to allow per motif (mutations/ motif errors allowed: deletions, mismatches, insertions). * Max adjacent ATR elements: The number of ATREs that the user allows next to each other. * Motif Range Options: The motif range option enables the user to specify a range of motifs FireµSat should search for. * Min required TR elements: Indicates the minimum number of TREs that should occur before a TR is output – this parameter serves as a length filter. * Mismatch penalty (m_p): The penalty value allocated to a mismatch. * Delete penalty (d_p): The penalty value assigned to a deletion. * Insert penalty (i_p): The penalty value allocated to an insertion. * Max substring error: Is a threshold value that the user enters. The substring error should always be smaller than the Max substring error. The calculation of the substring-error (α) is simply: α = (m_p x n_m) + (i_p x n_i) + (p_d + n_d) where Mismatch count (n_m) is the number of mismatches; delete count (n_d) is the number of deletions and insert count (n_i) is the number of insertions. ==Phobos== Mayer 2010(([[http://www.ruhr-uni-bochum.de/spezzoo/cm/cm_phobos.htm |Mayer, C. (2010). Phobos: Highly accurate search for perfect and imperfect tandem repeats in complete genomes by Christoph Mayer, (Bochum, Germany: Ruhr-Universität Bochum,Faculty of Biological Sciences and Biotechnology). Computer program.]])) Uses the alignment score as an optimality criterion to decide whether to extend a satellite up or downstream; is able to find alternative alignments with different units for the same satellite. ==SSRscanner== Anwar and Khan 2006(([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1891659/|Anwar, T., and Khan, A. (2006). SSRscanner: a program for reporting distribution and exact location of simple sequence repeats. Bioinformation 1, 89-91.]])) Uses dictionary approach to find simple sequence repeats of pre-selected motifs. ==TandemSWAN== Boeva et al 2006(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btk032|Boeva, V., Regnier, M., Papatsenko, D., and Makeev, V. (2006). Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22, 676-684.]])) The algorithm is based on calculation of the repeat’ statistical significance, and identifies the length of the repeated unit and the number of repetitions. It consider only the repeated structures whose number of copies is integer and greater than two. Since algorithm does not search for tandem repeats with period size less than 3, it does not detect poly-A or TATA-like sequences. ==OMWSA== Du et al 2007(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btm008|Du, L., Zhou, H., and Yan, H. (2007). OMWSA: detection of DNA repeats using moving window spectral analysis. Bioinformatics 23, 631-633.]])) The optimized moving window spectral analysis uses Fourier spectrogram, is a text-free digital signal processing based method. The moving window spectral analysis procedure produces the spectrogram at each frequency k and location n in the location-frequency plane. Therefore, the weight vector W can be set dependent on the spectrum at the frequency k and location n. If it shows the periodic components as highlighted regions in the spectrogram in the location-frequency plane, then from the coordinates (n, k) of the regions, it can obtain the information of both the periods and locations of DNA repeats. ==IMEX== Mudunuri et al 2007(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btm097|Mudunuri, S. B., and Nagarajaram, H. A. (2007). IMEx: Imperfect Microsatellite Extractor. Bioinformatics 23, 1181-1187.]])) It is a simple string-matching algorithm that scans the entire sequence using sliding window approach and reports the results in a single run. Is a two-step procedure: (a) identification of microsatellite nucleation sites (b) extension of the nucleation sites on both sides. IMEx progressively scans for nucleation sites starting from the longest repeat unit. The repeating motif at every iteration can harbor up to ‘k’ number of point mutations (substitutions or indels of nucleotides) and the user can set a value for ‘k’ between 0 and m where m = repeat motif size. ==SciRoKo== Kofler et al 2007(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btm157|Kofler, R., Schlotterer, C., and Lelley, T. (2007). SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics 23, 1683-1685.]])) The programs contain two main modules: an SSR search module, which supports five different SSR search modes and a module for SSR-statistics, notably for mismatch frequency and compound microsatellite analysis. A nucleotide at position i is tested for identity with the nucleotide at position i + t, where t is the motif length (1–6). Upon identity i is increased i = i + 1 until no further identity can be found. ==Msatcommander== Faircloth 2008(([[http://doi.wiley.com/10.1111/j.1471-8286.2007.01884.x|Faircloth, B. C. (2008). msatcommander: detection of microsatellite repeat arrays and automated, locus-specific primer design. Molecular Ecology Resources 8, 92-94.]])) It uses regular expression pattern matching within each DNA sequence to locate microsatellite arrays within user-selected repeat classes. Is used to locate all microsatellite repeats fitting these designations. DNA sequences are first scanned in the 5'–3' orientation. The program then takes a second pass through the complement of the sequence in the 3'–5'orientation. ==ReRep== Otto et al 2008(([[http://www.biomedcentral.com/1471-2105/9/366|Otto, T. D., Gomes, L. H. F., Alves-Ferreira, M., de Miranda, A. B., and Degrave, W. M. (2008). ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS). BMC Bioinformatics 9, 366.]])) It is a pipeline that is based on similarity searches, the interpretation of sequence landscapes (SL), the assembly of clustered sequences and in-house Perl scripts. First, all reads are compared to each other with BLAST, with a word size of 8 and an e-value cut-off of 10-20, or with NUCMER tool, with a word size of 11. Each result is pre-processed by joining overlapping hits and by deleting self-hits. For each read, an SL is constructed by counting how often each base of the read is part of a hit with another read. Generate the graph using external tools. The minimal length that an alignment must have to enter into the analysis was defined as l. Runs with different values for l can be performed. ==T-REKS== Jorda and Kajava 2009(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp482|Jorda, J., and Kajava, A. V. (2009). T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics 25, 2632-2638.]])) Mainly designed for amino acid repeats. Based on clustering of lengths between identical short strings by using a K-means algorithm and on the idea that in a tandem repeat region, the most frequently occurred length between identical SSs should be equal to the repeat length. Therefore, detection of regions of an analyzed sequence where certain lengths between identical SSs have anomalously high occurrence may lead to the localization of the tandem repeats. Steps are as follow: 1) Short string probes and K-means clustering (Use K-means algorithm for unsupervised classification) 2) Establishment of tandem repeat lengths 3) Contiguity filtering, results in identification of hypothetical repeats 4) Extension and bridging of runs 5) Similarity filtering using a build-in program based on “center-star algorithm”, CLUSTALW and MUSCLE. ==BwTRS== Pokrzywa and Polanski 2010(([[http://linkinghub.elsevier.com/retrieve/pii/S0888754310001758|Pokrzywa, R., and Polanski, A. (2010). BWtrs: A tool for searching for tandem repeats in DNA sequences based on the Burrows–Wheeler transform. Genomics 96, 316-321.]])) It is an efficient data compression algorithm based on the idea of backward search with the Burrows–Wheeler Transform (BWT) algorithm. It allows listing all occurrences of exact tandem repeats in a given string of length n in O(n log n) time. It then uses the efficient string indexing structure by Ferragina and Manzini for searching for the occurrences of so called rearmost tandem repeats that are then used to list the locations of the desire preferred tandem repeats, namely, the maximal tandem repeats of the primitive motif. ==QDD, QDD2== Meglecz et al 2010(([[http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp670|Meglecz, E., Costedoat, C., Dubut, V., Gilles, A., Malausa, T., Pech, N., and Martin, J.-F. (2009). QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinformatics 26, 403-404.]])) The algorithm takes the following steps: - Sequence cleaning and microsatellite detection (Sort according to sequence tags, trims vector and adaptors sequences). - Sequence similarity detection (Is time consuming, is meant to remove redundancy, and sequences that are part of a repetitive region of the genome). - Iterative primer design using Primer3. Iterations (a) from designing primers only for perfect microsatellites with no short repeats in the flanking region till allowing multiple target microsatellites, short repetitions and homopolymers in the amplified region (b) produce primer pairs with PCR products in different size ranges. - BLASTs selected sequences to Genbank to detect serious contaminations or mixing up samples.