bioinformatic_tools_to_detect_microsatellites_loci_from_genomic_data [CSBQ-QCBS Wiki]

Mining genomic data for tandem repeats

Abstract

Microsatellites are short sequences repeated in tandem in a genome, where the unit is generally from 2 to 6 base pairs (bp). These genomic regions tend to rapidly accumulate mutations in the form of duplication and deletion of repeat units, due to specific molecular mechanisms such as replication slippage, unequal crossing-over, unequal sister chromatide exchange and slipped-strand mis-pairing. Subsequent point mutation will also occur during DNA replication, repair, or recombination. The elevated length polymorphism (in base pairs) level makes the microsatellites appropriate markers for estimating within-population genetic diversity, which is an important indicator in biodiversity science.

Traditionally, microsatellites loci discovery involved long and costly molecular biology experiments using magnetic beads for enrichment in specific repeats, followed by cloning, clone screening and sequencing. The level of polymorphism in the discovered loci is then assessed by PCR amplification from several individual of a population using primers specific for conserved regions flanking the microsatellite loci, followed by length estimation in a capillary-based or acrylamide gel electrophoresis instrument. Depending on the flanking regions mutation level, the primers may amplify successfully in genomes of related species of the same genus, or even in more distantly related genera.

With the advent of genome sequencing projects, many bioinformatics tools have been developed for mining available sequence data for presence of microsatellite loci. These tools have methodological bias, and mining one set of sequences with different algorithms commonly results in widely different results (detected loci). In all case, they represent a useful, fast and low-cost alternative for discovering novel microsatellites loci in a genome. A few of these bioinformatics tools are presented here.

Table 1 Non-exhaustive list of bioinformatic algorithms and pipelines available for detecting tandem repeat sites from genomic sequence data; updated from Sharma et al 2007¹⁾ and Lerat 2009²⁾. WARNING: The table is still in construction, any inaccuracy can be reported to the QCBS team.

Name	Year released	Latest update	Number of citations	Original publication	Repeats detected	Algorithm	Parameters	Input file	Output file	Detect imperfect or compound loci	Design primers	Online access	Interface (GUI or command-line)	Platform	language	Speed	Special Features	Limitations
Sputnik and Sputnik II	1994	2005	124	Abajian, 1994, with modifications in La Rota et al. 2005³⁾	2 to 5 bp, or 1 to 5 bp in Morgante et al 2002⁴⁾	Recursive	Match bonus and mismatch penalty, the validation score, the fail score, the maximum number of recursions, the minimum percentage of perfection, and the period size.	One file with one sequence	Resulting hits are written to stdout along with their position in the sequence, length, and score	Yes, Insertions, mismatches and deletions and compound	?	?	? Download, SputnikII modified in 2005 Download	Windows	C	Execution time is dependent on the sequence composition		Automated statistical analysis files not generated.
Not named	1998	?	37	Sagot and Myers 1998⁵⁾	Detect repeats of fixed length, of 2 to 45 bp long per unit.	Exhaustive	Minimal size of a repeat (min_repeat), minimum number of repeats within a TR (max-range), ? (max-jump)	?	?	Yes, substitutions and indels	No	No	?	?	?	Execution time increases with the number of TR detected	Can be adapted to finding TR in protein sequences; and to identifying mixed direct-inverse TR.	Does not detect duplicated genes
TEIRESIAS	1998	?	502	Rigoustos et al 1998⁶⁾	Detects nested TR	Heuristic	L (length of interrogated sequence) and W (length of one unit) and threshold K (length of the unit going to verification step)	One file with multiple sequences.	?	Yes, only substitutions (not indels)	no	No	? Available upon request to the authors	?	?	Execution time increases quasi-linearly with length of units (W) and length of interrogated sequence (L)	Motifs are guaranteed to be maximal in length and composition. Can identify duplicated genes. Can search in amino acid sequences.	?
TRF	1999	2004	1487	Benson 1999⁷⁾	>6 bp repeats	Heuristic; based on the alignment procedure and k-tuples.	a) alignment weights for match, mismatch and indels; b) pM and pI; c) minimum size for patterns to report; d) minimum alignment score (threshold).	One file with one sequence	A summary table file with location and statistical properties of the TR found. The other file is alignment of each repeat with its consensus sequence	Yes, imperfect and compound		Yes	? Download	platform independent	Source code not available	Slow. Execution time is exponential to the number of repeats detected	Able to detect TR in a wide range of size and copy number	Does not accept sequences longer than 5 Mb; Automated statistical analysis files not generated; output files are difficult to manage.
Not named	2001	?	94	Landau et al 2001⁸⁾	?	Exhaustive; uses Hamming distance (k mismatches) and edit distance (k differences)	?	?	?	Yes, substitutions (but not indels)		No	? Available upon request to the authors	?	?	?	?	?
REPuter	2001	2005	341	Kurtz et al 2001⁹⁾	Detect repeats of fixed length. Repeats need not to be in tandem	Heuristic; uses the Hamming distance model and the edit distance model	sequencetype, smap, direct, palindromic, length, hamming, edit, best, content, identity, evalue (error threshold k ≥ 0 and a length threshold l > 0 are given)	Limit of 5 Mb in the online version, DNA or protein	Statistical and graphical analysis	Yes, imperfect including substitutions and compound		Yes	Command-line, Download	Linux, OSX, Solaris, Irix and Alpha	?	?	Nucleic acid or amino acid sequence; combine linear time efficiency with exhaustive analysis	Is not primarily designed for microsatellites
SSRIT (and CUGIssr)	2001	?	666	Temnykh et al 2001¹⁰⁾	2 to >6 bp long repeats	Uses regular expressions and similarity searches	?	Multiple files with one sequence in a fasta format. Limit of 1 Mb per sequence analyzed.	Reports the GenBank ID, SSR motif, number of repeats, sequence coordinates for each SSR and GC% in DNA sequences (up to 500 bp in length) immediately adjoining SSR	No, perfect only	Yes, using Primer 0.5	Yes	Command-line Download	platform independent	Perl script	?	?	Automated statistical analysis files not generated;
ComplexTR	2002	2005	33	Hauth and Joseph 2002¹¹⁾	Long repeats, variable length TR (VLTRs) and multiperiod TR (MPTRs)	Seed-extension technique; analyses k-length substrings	?	One file with one sequence	HTML page with detailed characterization of each TR region			The web access is not functioning	Command-line Download	?	C, perl. Tandem repeat identification code (C++); Web Interface via CGI scripts (perl) ? \| Identifies duplicated genes and pseudogenes \| ? \| \| ^TROLL \| 2002 \| ? \| 125 \| Castelo et al 2002¹²⁾ and Martins et al. 2006¹³⁾ \| Only searches for predefined motifs \| Dictionary approach and the failure links for mismatches \| Minimum length desired; maximum motif length; file containg the motif list; \| One file with one sequence \| Can be easily imported to other applications \| Yes, imperfect and compound (indels?) \| Yes, with Primer3 \| No \| Command-line Download \| Linux \| C (Tcp/Tk script)	?	?	?
ATRHunter	2004	Not updated on a regular basis.	47	Wexler et al 2004 or Wexler et al 2005¹⁴⁾	?	Heuristic	1) Alignment parameters (match, mismatch, gap, terminal gap) 2) Maximum motif length 3) Similarity level between adjacent copies 4) Average similarity level between adjacent copies 5) Minimum alignment score with a repeating copy.	One sequence, no limit in length.	?	Yes	no	Yes	Command-line Download	?	?	?	?	?
MISA	2003	2010	468	Thiel et al 2003¹⁵⁾	One to six bp motifs	?	Unit sizes; lower threshold of repeats for that specific unit; maximal number of bases between two adjacent microsatellites to be recognised as being a compound microsatellite	Individual or group of sequences in fasta format	Two files: one table file with localization and type of identified microsatellite(s); one file with statistics (e.g. frequency of a specific microsatellite type)	Yes, compound and imperfect (but no indels) (other source mention it recognizes only perfect)	Yes, with Primer3	No	Command-line Download	Platform independent	perl script	?	?	Detection of interrupted TR not efficient; inappropriate classification of different motifs; automated statistical analysis files not generated.
Mreps	2003	?	156	Kolpakov et al 2003¹⁶⁾	Sites shorter than period + 9 are automatically discarded.	Mixed combinatorial/heuristic paradigm	Start and end positions of the region to be processed; length interval, period interval, minimal exponent of the repetitions to report; resolution level, use or not of the sliding window.	One file with multiple sequences in the fasta format.	list of all repeats, with start and end positions of the repeat in the sequence, overall size of the repeat, period, exponent, error level, the repeat sequence itself	Yes, imperfect and compound (not indels)	Yes	Yes (or here)	Command-line Download	Linux, SunOS, Digital Unix and Windows systems. Online?	ANSI C	Should be fast. linear in the sequence length	?	Automated statistical analysis files not generated;
STRING	2003	2003	32	Parisi et al 2003¹⁷⁾	No limit in repeat length	Heuristic; dynamic programming procedure.	?	One file with one sequence, no limit in length.	Length of the consensus word; first and last position of the TR; number of repeated units; score; consensus word; flanking sequences; alignment between the model TR and the given sequence; number of indels; number of matches and mismatches; TR base composition percentages; flag indicating a likely nested expansion.	Yes, imperfect and compound (and indels?)	No but may be possible	A web interface is mentioned in the original publication, but cannot be found	Command-line Download	Unix, Windows, MacOS	C	Slowly increases as a function of the sequence length, while it increases more quickly as a function of the number of TRs.	?	Automated statistical analysis files not generated.
W-SSRF	2003	?	8	Sreenu et al 2003¹⁸⁾	1 to 10 bp long	Scans a nucleotide sequence	?	One file with one sequence. Upload limit is a 20 kb file.	Sequence content of the motif, repeat numbers, start and end position of the tract in the sequence	No, perfect only	Yes, using Autoprimer (in MICAS)	No	The user friendly GUI (graphical user interface) is MICAS, Available upon request to the authors	MICAS is web-only	Java	?	?	?
IRF program	2004	2007	80	Warburton et al 2004¹⁹⁾	?	?	?	?	?			?	Command-line Download	?	?	?	?	?
ExTRS	2004	?	33	Krishnan and Tang 2004²⁰⁾	?	Exhaustive	?	One file with one sequence, no limit in length.	Redundancy in the output is reduced	Yes, substitutions (not indels)		?	? Available upon request to the authors	?	Source code available only on request	Near-proportional to the number of TR found	?	?
SRF	2004	2004	55	Sharma et al. 2004²¹⁾	2-300 bp (default is 2-10 bp)	Spectral technique	Minimum repeat length; Maximum repeat length; Minimum % Match; FFT peak cut-off; (optional) Minimum number of copies.	One file with one sequence, no limit in length. In fasta or genbank format.	Information about the repeat unit, consensus pattern, region, copy number and score; as well as Fourier spectrum and the detailed analysis of any particular repeat unit.	Yes, imperfect and indels	No	Yes	Available upon request to the author	Linux and Mac	Perl	Execution time increases as a function of repeat/sequence length	Repeats can be tandem, dispersed and/or imperfect	?
STAR	2004	?	52	Delgrange and Rivals 2004²²⁾	Only searches for predefined motifs >2 bp, but allows approximate repeats.	Exhaustive. As searching for microsatellites is a special cases of approximate TR, it can be done globally in a single run using the MotifFile parameter. Contact the authors for more informations.	File with one motif per line, and then each motif is searched independently. Position offset	One file with one sequence	?	Yes, substitutions and indels		As a service on an online platform	Command-line Download	Linux, SunOS, Mac OSX, and Windows systems.	Source code not available	Slow. The overall time complexity needed by STAR to find all ATR of a given motif of length p, in a sequence of length n, is O(np+n log n)	Does not detect TR that would appear in a random sequence (false positives)
TRA	2004	2004	21	Bilgen et al 2004²³⁾	?	Heuristic	?	Multiple files with multiple sequences each (Max 1 Mb sequence) from ESTs.	?	Yes, searches for exact–inexact TRs and exact–inexact compound repeats	No?	?	? Download	Windows	C(with Microsoft Visual C++) ? \| Searches among the organisms, organs, tissue types and development stages \| ? \| \| ^MsatFinder \| 2005 \| 2007 \| 48 \| Thurston and Field 2005²⁴⁾ \| One to 6 bp long \| ? \| 1) length of repeat 2) number repeat unit in the site 3) search engine (regex, multipass or iterative search) \| Limit of 10 Mb of sequence in the online access. Accepts GenBank, EMBL, Swissprot, FASTA, ASCII. \| Repeats, GFF, Counts, Msat_tabs, Flank_tabs, Fasta, MINE, Primers \| No, but detects compound perfect repeats \| \| Yes \| Command-line Download \| Unix (may work on Mac OSX) \| perl script \| ? \| Nucleic acid or amino acid sequence \| ? \| \| ^FireµSat \| 2006 \| 2011 \| 5 and this \| de Ridder at al 2006²⁵⁾ and de Ridder at al 2013²⁶⁾ \| 1 to 5 bp. Length set by the user. The next update should allow for detection of 6 to 100 bp repeats. \| Uses Counting Finite Automata (which are regular language acceptors) \| Max Motif Error (per motif); Max adjacent ATR elements; Motif Range Options; Min required TR elements; Max substring error (a threshold); Mismatch penalty (m_p); Delete penalty (d_p); Insert penalty (i_p). \| One fasta file with one sequence \| File in .csv format. \| Yes, substitutions and indels, but not compound loci. \| No \| No \| GUI and command-line, Download \| Windows; Linux in progress \| C and MatLab	Run time increases linearly with the sequence length; does not increase with longer motif lengths.	Designed for microsatellites, but can detect any type of TR	Fast, simple and flexible.
Phobos	2006	2010	?	Mayer 2010²⁷⁾	Perfect and imperfect TR, with a pattern size of 1 - 10 000 bp	Exhaustive, uses alignment scores	Mismatch score, indel score, minimum score, minimum length, minimum perfection, and others.	One file in fasta format, with multiple sequence. No limit in sequence length.	Text file, different formats, including gff and fasta	Yes. Substitutions and indels.	Not in itself, but yes as implemented in STAMP or Geneious.	No	User friendly GUI and easily scriptable Command-line program Download	MacOSX, Linux, Windows.	C Execution time increases with pattern size range. Very fast in the size range 1-10 bp, slow for patterns in the size range above 10-20 bp. \| Can be incorporated into pipelines. Implemented in STAMP and Geneious. \| Free only for academic users. \| \| ^SSRscanner \| 2006 \| ? \| 3 \| Anwar and Khan 2006²⁸⁾ \| Only searches for predefined motifs \| Exhaustive, uses dictionary approach. \| File containing motifs of different repeat types; number of times for the motifs to be repeated \| One file with one sequence \| Motifposition.txt (gives the frequency of each repeat provided in the motif file) and (2) Motifresult.exe (gives the specific location of each repeat) \| No, perfect only \| No \| No \| Command-line Availability unknown, contact the author \| Platform independent \| perl script \| ? \| ? \| ? \| \| ^TandemSWAN \| 2006 \| 2006 \| 35 \| Boeva et al 2006²⁹⁾ \| Fuzzy TR (degenerate TR without indels, but with a high substitution rate), > 3 bp long \| Heuristic \| Degeneracy level; the minimal and the maximal period sizes; the mode of statistical significance calculation (i.e. the “motif” mode or the “mask” mode); in plain, fasta, EMBL or GenBank format. \| One file with one sequence, no limit in length. In plain, EMBL, fasta or GenBank format. \| A file with: the start, end, and length of TR; motif size (period); the number of copies; the motif consensus sequence; the number of words in the motif that satisfy the consensus; the probability, P-value and statistical significance for the “motif” and the “mask”; the TR itself. \| Yes, only imperfect (no indels) \| No \| Yes \| Command-line Download \| Platform independent \| C \| ? \| Can simultaneously identify both short- and large period repeats with approximately the same fuzziness \| ? \| \| ^OMWSA \| 2007 \| ? \| 10 \| Du et al 2007³⁰⁾ \| From 3 bp long to unlimited length \| Spectral technique \| Window size \| ? \| A spectrogram (an image file) \| Yes, imperfect compound and indels \| No \| ? \| Command-line Download \| ? \| ? \| ? \| Graphically displays the potential repeats in the location-frequency plane \| ? \| \| ^IMex \| 2007 \| 2009 \| 27 \| Mudunuri et al 2007³¹⁾ \| One to 6 bp long \| Simple string-matching algorithm \| Number of edit operations/motif (k); percentage imperfection for the entire tract (_p); minimum repeat number (n); coding information file \| One file with one sequence, no limit in length. \| Pairwise alignment between the identified tract and its perfect counter part is produced to indicate the matches, mismatches and gaps \| Yes, imperfects with substitutions and indels. \| Yes, linked to Primer3 \| Yes \| User friendly GUI (graphical user interface) standalone. Download \| ? \| Standard C language. Web server using CGI-Perl \| Fast. Execution time is linear (directly proportional) to the sequence length rather than the number of repeats detected \| Tells if in coding or non-coding region. \| ? \| \| ^SciRoKo \| 2007 \| 2008 \| 44 \| Kofler et al 2007³²⁾ \| One to 6 bp long \| ? \| Hits (identity with a virtual perfect microsatellite), number of mismatches (mm), mismatch penalty (mmP) and the length of the SSR motif (mL). \| One file with multiple sequences in the fasta format. \| ? \| Yes, imperfect and compound (indels?) \| \| No \| User friendly GUI (graphical user interface) standalone. Download \| Windows. Should be platform independent, but Mac users have not been able install \| C# \| Fast \| ? \| Depends on .NET framework \| \| ^Msatcommander \| 2008 \| 2011 \| 102 \| Faircloth 2008³³⁾ \| ? \| Uses regular expressions \| ? \| One file with multiple sequences in the fasta format. \| Either a summary file (array detection only) or a directory at a user-selectable location \| No, but accepts N \| Yes, with Primer3, includes 5'-tailing \| No \| User friendly GUI (graphical user interface) standalone. Download \| MacOS X, Windows, Unix. \| Python \| ? \| Rapid and automated microsatellite array detection, locus-specific primer design, and 5'-tailing of designed primers \| ? \| \| ^ReRep \| 2008 \| 2008 \| 5 \| Otto et al 2008³⁴⁾ \| ? \| Uses self-similarity searches \| ? \| Genome survey sequences(GSS) files, including 454-reads \| ? \| Yes, substitutions and indels \| No \| No \| Command-line Download \| Linux \| Perl \| ? \| Can detect de novo repeats in Genome Sequence Survey sequence data \| ? \| ^T-REKS \| 2009 \| ? \| 5 \| Jorda and Kajava 2009³⁵⁾ \| No limits in repeat length \| Short string extension and K-means algorithm. \| delta-l Allowed % of length variability, P*sim—similarity threshold and an option to allow or not the detection of overlaping TR. \| Sequences in FASTA format. \| Output with start, end, length of TR and multiple alignment of the repeats. \| Yes, substitutions and indels. \| No \| Yes \| User friendly GUI (graphical user interface) standalone. Download \| Platform independent \| Java \| Fast. Execution time is linear (directly proportional) to the sequence length. \| Can be applied to nucleic acid, amino acid or any text sequence. \| ? \| ^BwTRS \| 2010 \| 2009 \| 2 \| Pokrzywa and Polanski 2010³⁶⁾ \| ? \| Exhaustive; uses efficient data compression algorithm \| “Minimum motif size”, “Maximum motif size”, “Minimum repeat size” and “Minimum repeat ratio”. \| One file with multiple sequences in the fasta format; or GenBank id. Accepts nucleotides and amino acids sequences. \| List of all TR with: Start and End (position of the TR in the sequence); Motif length; Ratio (between the motif length and the consensus repeat length); the motif itself. HTML or text. \| No \| No \| Yes, runs on a standard Tomcat servlet container without any local database \| ? Availability unknown, contact the authors. \| ? \| Java \| Depends on sequence length \| Nucleic acid or amino acid sequence \| ? \| ^QDD, QDD2 \| 2010 \| 2011 \| 16 \| Meglecz et al 2010³⁷⁾ \| 2 to 6 bp motifs \| First step: detects all perfect TR with at least 5 repetitions as target. Later (step 3, primer design), TR of 3-4 repeats are also detected and blocks of repetitions are pooled into a complex microsatellite if the distance between them is not greater than the footprint of the longest motif of two neighbors. \| Not parametrable \| One fasta file with multiple short sequences (< ~2Mb). Multiple files can be run in batch. \| (a) Fasta file with sequences containing a perfect TR of at least 5 repeats. (b) Text file with sequence code and length, number of microsatellites, TR motif, first position, last position, and number of repeats. csv file with primer information. \| Yes if it contains at least 5 perfect repeats. Imperfect TR will be detected as composite site. \| Yes \| Coming soon \| Command-line. Availability unknown, contact the authors. \| Linux, Windows \| Perl script (with Bioperl) \| ? \| Sorts raw sequences by tag, removes adapters, detects TR sites, redundancy, selects target microsatellites, and designs primer. Detects contamination. \| ? \| ^TRStalker (in TReaDS) \| 2010 \| ? \| 1 \| Pellegrini et al 2010³⁸⁾ \| ? \| ? \| ? \| ? \| ? \| Yes, imperfect and indels \| \| Yes \| ? Availability unknown, contact the authors \| ? \| ? \| ? \| ? \| ? \| ^Pipeline name ^ year released ^ Latest update ^ Number of citations ^ Original publication ^ Repeats detected ^ Algorithm used ^ Parameters ^ Input file ^ Output file ^ Detect imperfect or compound loci ^ Design primers ^ Online access ^ Interface (GUI or commandline) ^ Platform ^ language ^ Speed ^ Special Features ^ Limitations \| ^repeatfinder \| 2001 \| ? \| ? \| Volfovsky et al 2001³⁹⁾ \| ? \| Uses REPuter \| ? \| ? \| ? \| ? \| ? \| No \| Command-line Download \| Linux RedHat 6.x+, Sun Solaris, and Alpha OSF1 \| ? Open Source \| ? \| ? \| ? \| ^MsatMiner \| 2005 \| ? \| \| Thurston and Field 2005⁴⁰⁾ \| ? \| Uses msatfinder for motif discovery \| \| Different format of sequence file \| More statistical analysis are possible. \| Yes, imperfect and compound (indels?) \| Possible when using additional scripts \| ? \| Command-line \| Unix and MacOS \| Collection of perl scripts \| ? \| ? \| Running scripts is possibly complicated \| ^E-TRA \| 2005 \| 2004 \| 12 \| Karaca et al 2005⁴¹⁾ \| 1 to 1000 bp repeat \| Uses TRA \| ? \| Multiple files with multiple sequences each (maximum of 1 Mb long) \| \| Yes, compound and imperfect \| Yes \| No \| user friendly GUI (graphical user interface) Download \| Windows \| C (with Microsoft Visual C) \| Searches among the organisms, organs, tissue types and development stages \| Only 1 Mb of input sequence length \| ^SSRprimerII \| 2006 \| 2009 \| 29 \| Robinson et al. 2004⁴²⁾ and Jewell et al. 2006⁴³⁾ \| 2 to >6 bp long repeats \| Uses Sputnik \| \| Limit of 4000 bp sequence \| \| \| Yes uses Primer3 \| Yes \| ? Contact the group \| ? \| Perl scripts \| ? \| ? \| ? \| ^TRAP \| 2006 \| 2005 \| 10 \| Sobreira et al 2006⁴⁴⁾ \| 1 to 2000 bp repeat \| Uses TRF \| All TRF parameters, as well as: min and max number of motifs; min and max motif size (period), min size of flanking regions and min match % between adjacent motifs. \| One file with multiple sequences in fasta format. \| Many formats (csv, HTML, flat files, and GFF) \| Yes, imperfect and compound \| No, but generates a fasta file with TR regions masked with Ns \| no \| Command-line Download \| Unix; MacOSX \| Perl scripts \| Near-proportional to the number of TRs found by TRF \| Selection, classification, quantification and automated annotation of TR sequences \| Not compatible with MS Windows version of TRF \| ^cid \| 2008 \| ? \| 11 \| Freita et al 2008⁴⁵⁾ \| Same as MISA \| Uses MISA for tandem repeat detection, and other external programs for other steps \| \| Set of chromatograms or multiFASTA file \| List of useful primers \| \| Yes, using Primer3 \| ? \| Web environment, Availability unknown, contact the authors \| ? \| perl and php to connect the different tools \| ? \| Can mask vectors and adaptors regions of cloned sequences \| ? \| ^Etandem (in EMBOSS) \| 2008 \| ? \| 2119 \| Rice et al 2010⁴⁶⁾ \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ? \| ^STAMP \| 2009 \| 2010 \| 16 \| Kraemer et al 2009⁴⁷⁾ \| Imperfect and perfect, 1-unlimited \| Based on STADEN, uses Phobos (exhaustive search) to detect all TR that match user-defined options. TROLL was previously in STADEN. \| Mismatch score, indel score, minimum score, minimum length, minimum perfection, and others. \| Fasta file with one or multiple sequences and no limit in sequence length, Staden experimental files. \| Text files, different formats \| Yes (substitutions and indels), but only those defined by user. \| Yes, also for multiplexing microsatellites markers. Integrates Primer3 into GAP4 \| No \| Plugged into the GUI offered by STADEN, Download \| Mac Os X, Linux, Windows. Others upon request. \| TCL, C	Execution time increases with pattern size range. Very fast in the size range 1-10 bp, slow for patterns in the size range above 20 bp	Automatic pipeline, interaction in primer design step possible.	Free only for academic users.
WebSat	2009	?	22	Martins et al 2009⁴⁸⁾	Same as TROLL	Uses TROLL	Motif length and minimum number of motifs	Individual sequences, in raw or FASTA format, or a group of sequences in a multi-FASTA format, maximum of 150 000 characters.	?	?	?	Yes Uses Ajax techniques	Command-line Download	?	Written in PHP and JavaScript, making use of Ajax techniques	?	?	?
TReaDS	2010	?	?	Renda et al 2010⁴⁹⁾	?	Includes Approximate Tandem Repeat Hunter (ATRHunter), mreps, Tandem Repeat finder (TRF), TandemSwan, TRStalker (new algorithm, only available on TReaDS)	?	?	?	?	?	Yes	? Availability unknown, contact the authors	?	?	?	?	?

Algorithms details

Sputnik

Abadjian 1994 and La Rota et al. 2005⁵⁰⁾ Sputnik uses a recursive algorithm and the performance depends on the recursion depth of the program. Sputnik reads through the entire sequence, assumes the existence of a repeat at every position, compares subsequent nucleotides and applies a simple scoring rule. If the resulting score rises above a preset threshold, the region along with its position and score is written out. Score is determined by the length of the repeat and the number of errors. Each nucleotide that matches the value predicted (by assuming a repeat) adds to the score. Each “error” subtracts from the score. When an error is encountered, the three possible kinds of errors (mismatch, insertion and deletion) are assumed and recursive calls to the comparison routine are made. If the resulting score from one of these is above the cutoff threshold, it is returned and the best of three pursued.

Unnamed 1998

Sagot and Myers 1998⁵¹⁾ It is an exhaustive search, but with a pre-filtering using a heuristic. It uses a general combinatorial framework of “consensus repeat” and uses of some heuristic filtering steps to avoid exponential increase in time complexity. The algorithm exhaustively searches for all possible consensus models whose edit distance to the real repeated units is no bigger than an upper bound. The algorithm works by extending prefix models. The first step is a filter that eliminates all regions whose probability of containing a satellite is less than one in a certain threshold (?), when percentage of sequence variation between units is 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence.

TEIRESIAS

Rigoustos et al 1998⁵²⁾ Is a heuristic variation on TRF algorithm. It is a model less algorithm that replaces the TRF contiguous k-tuples with patterns of not necessarily contiguous k characters. The first scanning step detects potential elementary patterns, using the heuristics that the number of equi-spaced patterns reported is larger in TR regions than elsewhere in the sequence. The second step is a convolution phase, it is a verification step where elementary patterns are combined into larger patterns and false positives are discarded using a scoring system.

TRF

Benson 1999⁵³⁾ It was originally designed to search for long tandem repeats. TRF is based on the alignment procedure and k-tuples. TRF uses a probabilistic algorithm that includes a “detection step” to identify the candidate repeats (seeds of miminum of 5 bp) and an “analysis” step, according to statistically-founded criteria to filter the candidate repeats.

Unnamed 2001

Landau et al 2001⁵⁴⁾ The algorithm finds all approximate single repeats within a sequence. Considers two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). In each iteration i, the input string is divided into substrings. Each substring is searched individually for all repeats that span its centre character.

REPuter

Kurtz et al 2001⁵⁵⁾ The algorithm uses two different distance models: the Hamming distance model and the edit distance model. It then assesses the significance of a repeat by computing its E-value, i.e. the number of repeats of the same length or longer and with the same number of errors or fewer that one would expect to find in a random DNA of the same length.

SSRIT

Temnykh et al 2001⁵⁶⁾ The script uses regular expressions to locate SSR patterns in sequence files. In a second step, it eliminates redundancy by locating candidate sites to original sequence using BLAST search.

ComplexTR

Hauth and Joseph 2002⁵⁷⁾ The algorithm both locates and characterizes TR regions. Three main tasks are performed: 1) isolate a tandem repeat by determining its period and its approximate sequence location; 2) determine the pattern associated with a region period; 3) characterize the region using the pattern. As in TRF⁵⁸⁾, the algorithm uses k-length substrings in a sequence by finding recurring distances between identical substrings. The ComplexTR differs from TRF in that a statistical model is not used to locate interesting periods, but a simple and accurate filter. In the first step, a distance arrays D, parallel to S is constructed to record a distance d between identical occurrences of k-length substrings (termed words). The algorithm requires at least five identical d occurrences to be present in a run. ComplexTR uses word and distances similarities to determine significant periods within a region. The general procedure for the second step (construct region base pattern) is to select a segment of sequence corresponding to all or a portion of a copy, to align the segment to several copies and to for a pattern using the consensus. This constitutes the region based pattern. The third step (characterize using region pattern) uses wraparound dynamic programming to handle regular expressions un the context of TR identifications. The final alignment is displayed as a series of copies, each representing a pattern occurrence in S, and a consensus copy is formed.

TROLL

Castelo et al 2002⁵⁹⁾ and Martins et al. 2006⁶⁰⁾ The algorithm uses the dictionary approach to find tandem repeats of pre-selected motifs. Find all occurrences from a list of patterns in a text string (the Aho–Corasick algorithm). At a mismatch character, it uses the failure links to continue to search without having to re-sample characters in the text. It uses “repeat buffer” operation to avoid missing repeats or counting the same site multiple times. Reads and write are done in constant time

ATRHunter

Wexler et al 2004 or Wexler et al 2005⁶¹⁾ The algorithm is heuristic search originally designed to search for long tandem repeats. First, the screening phase identifies substrings that have an unusually high probability of being an ATR using three similarity criteria. Second, the verification phase determines which of these candidates are in fact ATRs of the desired type.

MISA

Thiel et al 2003⁶²⁾ The algorithm is not described

Mreps

Kolpakov et al 2003⁶³⁾ The algorithm is a mixed combinatorial/heuristic paradigm. First an exhaustive combinatorial algorithm finds all approximate tandem repeats. Those repeats are then submitted to a heuristic treatment in order to obtain more biologically relevant representation of the repeats. This heuristic step includes trimming edges, computing the best period and merging, filtering out statistically expected repeats, gathering the results. The user specifies a resolution parameter that determines the ‘fuzziness’ of found repeats instead of introducing a scoring function and specifying a threshold score value for found repeats.

STRING

Parisi et al 2003⁶⁴⁾ The algorithm is a dynamic programming procedure originally designed to search for long tandem repeats. The general strategy is the following: (a) Instead of studying the whole sequence we examine only the tracts (interesting zones) that we consider to be the more promising ones. (b) Instead of studying all possible consensus words we examine only the words that we consider to be the more promising ones. Detailed steps: 1) First using a local alignment (autoalignment) procedure; 2) Second, by group in suitable clusters all the autoalignments obtained in phase 1, by putting in the same cluster all the autoalignments whose augmented extensions are not disjoint.

W-SSRF

Sreenu et al 2003⁶⁵⁾ Scans a nucleotide sequence for the presence of perfect simple sequence repeats from motifs of 1 to 10 bp long.

IRF

Warburton et al 2004⁶⁶⁾ Algorithm not known

ExTRS

Krishnan and Tang 2004⁶⁷⁾ It is an exhaustive search. The algorithm takes a definition of tandem repeat that uses the Hamming distance measure, and provably finds all tandem repeats. The definition is parameterized by a mismatch ratio p which allows for more mismatches in longer tandem repeats (and fewer in shorter), thereby avoiding the problems of a fixed mismatch count. Additionally, the algorithm uses a filtering algorithm that prunes the resultant exhaustive set to a smaller one with fewer redundancies. The different algorithms are parallel (can independently search for tandem repeats for different pattern lengths k) and well suited for Beowulf-class clusters.

SRF

Sharma et al. 2004⁶⁸⁾ The algorithm uses the following steps: 1) Input a DNA sequence of length n. 2) Compute the power spectrum, S( f ), and the spectral average, Š, of the entire sequence. 3) Identify all peaks with S(fi)/ Š > T (the threshold, here chosen to be 4). For each frequency fi so identified, there are potential repeats of length Ni = 1/fi. 4) For each of these, compute the Pm( j ) = S(fi)/ Š in a sliding window of length m centered on position j in the sequence. Regions containing the repeat of length Ni can be identified directly as those where Pm( j ) exceeds the threshold. 5) Since both the repeat length, Ni, and its location are known, an exact method (step 6) is used to identify the repeat units. 6) Consider all Ni-mers in the repeat region and identify those occurring most frequently by local alignment. This automatically makes it possible to allow for insertions and deletions to any desired level.

STAR

Delgrange and Rivals 2004⁶⁹⁾ The algorithm detects all significant Approximate Tandem Repeats (ATR) of a given motif, where significance is assessed using the Minimum Description Length (MDL) criterion. MDL provides an absolute measure of the significance of an ATR independently of the motif. It evaluates how many mutations are allowed in an ATR when compared to an ETR of the best possible length. The STAR algorithm needs no threshold value and optimally locates ATR of any input motif with respect to the MDL criterion.

TRA

Bilgen et al 2004⁷⁰⁾ Is a useful algorithm for expressed sequences. There are two different algorithms implemented in two modules. The first module searches for exact and compound motifs. It searches for Sn, a string of repeated units in a DNA sequence wn- S1 = w1 [i1, j1] symbolizes the S1 starting with the i1-th and ending with the j1-th bases of the DNA sequence w1. The distance between i1 and j1 will therefore be equal to m1 × r1 where m1 and r1 refer to a type of DNA motif length and the number of repeats in S1 string of each w of a fixed length, respectively. When applicable, strings in a sequence of w are referred to as S1, S2, S3 . . . Sn for each consecutive string in a sequence w. The distance between S1 and S2 is referred to as d1, the distance between S2 and S3 is d2 and so on till Sn, dn- . The second module that searches for inexact and compound motifs with STRING algorithm (Parisi). Overall, program goes as follows: (i) searches the user defined organism (s) and/or keywords (organs, cell lines, tissue types or development stages) analyzing the whole data set provided in a data folder; (ii) isolates simple and non-simple (compound, imperfect and extended compound) tandem repeats by determining their type, lengths, and sequence location in a given DNA strings within DNA sequences; (iii) characterizes the repeats containing sequences based on the user defined parameters/ options, and; (iv) displays the results according to the user’s parameters/options.

MsatFinder

Thurston and Field 2005⁷¹⁾ No published description of the algorithm could be found. According to the online access, regular expression (regex), multipass or iterative search engines can be used.

FireµSat

de Ridder at al 2006⁷²⁾ and de Ridder at al 2013⁷³⁾, Is a combination of straightforward FA technology combined with a flavour of Moore machine technology. It uses Counting Finite Automata, which are regular language acceptors. The parameters details are the following:

Max Motif Error: Is the number of motif errors the user wants to allow per motif (mutations/ motif errors allowed: deletions, mismatches, insertions).
Max adjacent ATR elements: The number of ATREs that the user allows next to each other.
Motif Range Options: The motif range option enables the user to specify a range of motifs FireµSat should search for.
Min required TR elements: Indicates the minimum number of TREs that should occur before a TR is output – this parameter serves as a length filter.
Mismatch penalty (m_p): The penalty value allocated to a mismatch.
Delete penalty (d_p): The penalty value assigned to a deletion.
Insert penalty (i_p): The penalty value allocated to an insertion.
Max substring error: Is a threshold value that the user enters. The substring error should always be smaller than the Max substring error. The calculation of the substring-error (α) is simply: α = (m_p x n_m) + (i_p x n_i) + (p_d + n_d) where Mismatch count (n_m) is the number of mismatches; delete count (n_d) is the number of deletions and insert count (n_i) is the number of insertions.

Phobos

Mayer 2010⁷⁴⁾ Uses the alignment score as an optimality criterion to decide whether to extend a satellite up or downstream; is able to find alternative alignments with different units for the same satellite.

SSRscanner

Anwar and Khan 2006⁷⁵⁾ Uses dictionary approach to find simple sequence repeats of pre-selected motifs.

TandemSWAN

Boeva et al 2006⁷⁶⁾ The algorithm is based on calculation of the repeat’ statistical significance, and identifies the length of the repeated unit and the number of repetitions. It consider only the repeated structures whose number of copies is integer and greater than two. Since algorithm does not search for tandem repeats with period size less than 3, it does not detect poly-A or TATA-like sequences.

OMWSA

Du et al 2007⁷⁷⁾ The optimized moving window spectral analysis uses Fourier spectrogram, is a text-free digital signal processing based method. The moving window spectral analysis procedure produces the spectrogram at each frequency k and location n in the location-frequency plane. Therefore, the weight vector W can be set dependent on the spectrum at the frequency k and location n. If it shows the periodic components as highlighted regions in the spectrogram in the location-frequency plane, then from the coordinates (n, k) of the regions, it can obtain the information of both the periods and locations of DNA repeats.

IMEX

Mudunuri et al 2007⁷⁸⁾ It is a simple string-matching algorithm that scans the entire sequence using sliding window approach and reports the results in a single run. Is a two-step procedure: (a) identification of microsatellite nucleation sites (b) extension of the nucleation sites on both sides. IMEx progressively scans for nucleation sites starting from the longest repeat unit. The repeating motif at every iteration can harbor up to ‘k’ number of point mutations (substitutions or indels of nucleotides) and the user can set a value for ‘k’ between 0 and m where m = repeat motif size.

SciRoKo

Kofler et al 2007⁷⁹⁾ The programs contain two main modules: an SSR search module, which supports five different SSR search modes and a module for SSR-statistics, notably for mismatch frequency and compound microsatellite analysis. A nucleotide at position i is tested for identity with the nucleotide at position i + t, where t is the motif length (1–6). Upon identity i is increased i = i + 1 until no further identity can be found.

Msatcommander

Faircloth 2008⁸⁰⁾ It uses regular expression pattern matching within each DNA sequence to locate microsatellite arrays within user-selected repeat classes. Is used to locate all microsatellite repeats fitting these designations. DNA sequences are first scanned in the 5'–3' orientation. The program then takes a second pass through the complement of the sequence in the 3'–5'orientation.

ReRep

Otto et al 2008⁸¹⁾ It is a pipeline that is based on similarity searches, the interpretation of sequence landscapes (SL), the assembly of clustered sequences and in-house Perl scripts. First, all reads are compared to each other with BLAST, with a word size of 8 and an e-value cut-off of 10-20, or with NUCMER tool, with a word size of 11. Each result is pre-processed by joining overlapping hits and by deleting self-hits. For each read, an SL is constructed by counting how often each base of the read is part of a hit with another read. Generate the graph using external tools. The minimal length that an alignment must have to enter into the analysis was defined as l. Runs with different values for l can be performed.

T-REKS

Jorda and Kajava 2009⁸²⁾ Mainly designed for amino acid repeats. Based on clustering of lengths between identical short strings by using a K-means algorithm and on the idea that in a tandem repeat region, the most frequently occurred length between identical SSs should be equal to the repeat length. Therefore, detection of regions of an analyzed sequence where certain lengths between identical SSs have anomalously high occurrence may lead to the localization of the tandem repeats. Steps are as follow: 1) Short string probes and K-means clustering (Use K-means algorithm for unsupervised classification) 2) Establishment of tandem repeat lengths 3) Contiguity filtering, results in identification of hypothetical repeats 4) Extension and bridging of runs 5) Similarity filtering using a build-in program based on “center-star algorithm”, CLUSTALW and MUSCLE.

BwTRS

Pokrzywa and Polanski 2010⁸³⁾ It is an efficient data compression algorithm based on the idea of backward search with the Burrows–Wheeler Transform (BWT) algorithm. It allows listing all occurrences of exact tandem repeats in a given string of length n in O(n log n) time. It then uses the efficient string indexing structure by Ferragina and Manzini for searching for the occurrences of so called rearmost tandem repeats that are then used to list the locations of the desire preferred tandem repeats, namely, the maximal tandem repeats of the primitive motif.

QDD, QDD2

Meglecz et al 2010⁸⁴⁾ The algorithm takes the following steps:

Sequence cleaning and microsatellite detection (Sort according to sequence tags, trims vector and adaptors sequences).
Sequence similarity detection (Is time consuming, is meant to remove redundancy, and sequences that are part of a repetitive region of the genome).
Iterative primer design using Primer3. Iterations (a) from designing primers only for perfect microsatellites with no short repeats in the flanking region till allowing multiple target microsatellites, short repetitions and homopolymers in the amplified region (b) produce primer pairs with PCR products in different size ranges.
BLASTs selected sequences to Genbank to detect serious contaminations or mixing up samples.

¹⁾

Sharma, P. C., Grover, A., and Kahl, G. (2007). Mining microsatellites in eukaryotic genomes. Trends in Biotechnology 25, 490-498.

²⁾

Lerat, E. (2009). Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104, 520-533.

³⁾ , ⁵⁰⁾

La Rota, M., Kantety, R., Yu, J.-K., and Sorrells, M. (2005). Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley. BMC Genomics 6, 23.

⁴⁾

Morgante, M., Hanafey, M., and Powell, W. (2002). Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nature Genetics 30, 194-200.

⁵⁾ , ⁵¹⁾

Sagot, M.-F., and Myers, E. W. (1998). Identifying Satellites and Periodic Repetitions in Biological Sequences. Journal of Computational Biology 5, 539-553.

⁶⁾ , ⁵²⁾

Rigoutsos, I., and Floratos, A. (1998). Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14, 55-67.