Mining genomic data for tandem repeats

Abstract

Microsatellites are short sequences repeated in tandem in a genome, where the unit is generally from 2 to 6 base pairs (bp). These genomic regions tend to rapidly accumulate mutations in the form of duplication and deletion of repeat units, due to specific molecular mechanisms such as replication slippage, unequal crossing-over, unequal sister chromatide exchange and slipped-strand mis-pairing. Subsequent point mutation will also occur during DNA replication, repair, or recombination. The elevated length polymorphism (in base pairs) level makes the microsatellites appropriate markers for estimating within-population genetic diversity, which is an important indicator in biodiversity science.

Traditionally, microsatellites loci discovery involved long and costly molecular biology experiments using magnetic beads for enrichment in specific repeats, followed by cloning, clone screening and sequencing. The level of polymorphism in the discovered loci is then assessed by PCR amplification from several individual of a population using primers specific for conserved regions flanking the microsatellite loci, followed by length estimation in a capillary-based or acrylamide gel electrophoresis instrument. Depending on the flanking regions mutation level, the primers may amplify successfully in genomes of related species of the same genus, or even in more distantly related genera.

With the advent of genome sequencing projects, many bioinformatics tools have been developed for mining available sequence data for presence of microsatellite loci. These tools have methodological bias, and mining one set of sequences with different algorithms commonly results in widely different results (detected loci). In all case, they represent a useful, fast and low-cost alternative for discovering novel microsatellites loci in a genome. A few of these bioinformatics tools are presented here.

Table 1 Non-exhaustive list of bioinformatic algorithms and pipelines available for detecting tandem repeat sites from genomic sequence data; updated from Sharma et al 20071) and Lerat 20092). WARNING: The table is still in construction, any inaccuracy can be reported to the QCBS team.

Name Year released Latest update Number of citations Original publication Repeats detected Algorithm Parameters Input file Output file Detect imperfect or compound loci Design primers Online access Interface (GUI or command-line) Platform language Speed Special Features Limitations
Sputnik and Sputnik II 1994 2005 124 Abajian, 1994, with modifications in La Rota et al. 20053) 2 to 5 bp, or 1 to 5 bp in Morgante et al 20024) Recursive Match bonus and mismatch penalty, the validation score, the fail score, the maximum number of recursions, the minimum percentage of perfection, and the period size. One file with one sequence Resulting hits are written to stdout along with their position in the sequence, length, and score Yes, Insertions, mismatches and deletions and compound ? ? ? Download, SputnikII modified in 2005 Download Windows C Execution time is dependent on the sequence composition Automated statistical analysis files not generated.
Not named 1998 ? 37 Sagot and Myers 19985) Detect repeats of fixed length, of 2 to 45 bp long per unit. Exhaustive Minimal size of a repeat (min_repeat), minimum number of repeats within a TR (max-range), ? (max-jump) ? ? Yes, substitutions and indels No No ? ? ? Execution time increases with the number of TR detected Can be adapted to finding TR in protein sequences; and to identifying mixed direct-inverse TR. Does not detect duplicated genes
TEIRESIAS 1998 ? 502 Rigoustos et al 19986) Detects nested TR Heuristic L (length of interrogated sequence) and W (length of one unit) and threshold K (length of the unit going to verification step) One file with multiple sequences. ? Yes, only substitutions (not indels) no No ? Available upon request to the authors ? ? Execution time increases quasi-linearly with length of units (W) and length of interrogated sequence (L) Motifs are guaranteed to be maximal in length and composition. Can identify duplicated genes. Can search in amino acid sequences. ?
TRF 1999 2004 1487 Benson 19997) >6 bp repeats Heuristic; based on the alignment procedure and k-tuples. a) alignment weights for match, mismatch and indels; b) pM and pI; c) minimum size for patterns to report; d) minimum alignment score (threshold). One file with one sequence A summary table file with location and statistical properties of the TR found. The other file is alignment of each repeat with its consensus sequence Yes, imperfect and compound Yes ? Download platform independent Source code not available Slow. Execution time is exponential to the number of repeats detected Able to detect TR in a wide range of size and copy number Does not accept sequences longer than 5 Mb; Automated statistical analysis files not generated; output files are difficult to manage.
Not named 2001 ? 94 Landau et al 20018) ? Exhaustive; uses Hamming distance (k mismatches) and edit distance (k differences) ? ? ? Yes, substitutions (but not indels) No ? Available upon request to the authors ? ? ? ? ?
REPuter 2001 2005 341 Kurtz et al 20019) Detect repeats of fixed length. Repeats need not to be in tandem Heuristic; uses the Hamming distance model and the edit distance model sequencetype, smap, direct, palindromic, length, hamming, edit, best, content, identity, evalue (error threshold k ≥ 0 and a length threshold l > 0 are given) Limit of 5 Mb in the online version, DNA or protein Statistical and graphical analysis Yes, imperfect including substitutions and compound Yes Command-line, Download Linux, OSX, Solaris, Irix and Alpha ? ? Nucleic acid or amino acid sequence; combine linear time efficiency with exhaustive analysis Is not primarily designed for microsatellites
SSRIT (and CUGIssr) 2001 ? 666 Temnykh et al 200110) 2 to >6 bp long repeats Uses regular expressions and similarity searches ? Multiple files with one sequence in a fasta format. Limit of 1 Mb per sequence analyzed. Reports the GenBank ID, SSR motif, number of repeats, sequence coordinates for each SSR and GC% in DNA sequences (up to 500 bp in length) immediately adjoining SSR No, perfect only Yes, using Primer 0.5 Yes Command-line Download platform independent Perl script ? ? Automated statistical analysis files not generated;
ComplexTR 2002 2005 33 Hauth and Joseph 200211) Long repeats, variable length TR (VLTRs) and multiperiod TR (MPTRs) Seed-extension technique; analyses k-length substrings ? One file with one sequence HTML page with detailed characterization of each TR region The web access is not functioning Command-line Download ? C, perl. Tandem repeat identification code (C++); Web Interface via CGI scripts (perl) (Tcp/Tk script) ? ? ?
ATRHunter 2004 Not updated on a regular basis. 47 Wexler et al 2004 or Wexler et al 200514) ? Heuristic 1) Alignment parameters (match, mismatch, gap, terminal gap) 2) Maximum motif length 3) Similarity level between adjacent copies 4) Average similarity level between adjacent copies 5) Minimum alignment score with a repeating copy. One sequence, no limit in length. ? Yes no Yes Command-line Download ? ? ? ? ?
MISA 2003 2010 468 Thiel et al 200315) One to six bp motifs ? Unit sizes; lower threshold of repeats for that specific unit; maximal number of bases between two adjacent microsatellites to be recognised as being a compound microsatellite Individual or group of sequences in fasta format Two files: one table file with localization and type of identified microsatellite(s); one file with statistics (e.g. frequency of a specific microsatellite type) Yes, compound and imperfect (but no indels) (other source mention it recognizes only perfect) Yes, with Primer3 No Command-line Download Platform independent perl script ? ? Detection of interrupted TR not efficient; inappropriate classification of different motifs; automated statistical analysis files not generated.
Mreps 2003 ? 156 Kolpakov et al 200316) Sites shorter than period + 9 are automatically discarded. Mixed combinatorial/heuristic paradigm Start and end positions of the region to be processed; length interval, period interval, minimal exponent of the repetitions to report; resolution level, use or not of the sliding window. One file with multiple sequences in the fasta format. list of all repeats, with start and end positions of the repeat in the sequence, overall size of the repeat, period, exponent, error level, the repeat sequence itself Yes, imperfect and compound (not indels) Yes Yes (or here) Command-line Download Linux, SunOS, Digital Unix and Windows systems. Online? ANSI C Should be fast. linear in the sequence length ? Automated statistical analysis files not generated;
STRING 2003 2003 32 Parisi et al 200317) No limit in repeat length Heuristic; dynamic programming procedure. ? One file with one sequence, no limit in length. Length of the consensus word; first and last position of the TR; number of repeated units; score; consensus word; flanking sequences; alignment between the model TR and the given sequence; number of indels; number of matches and mismatches; TR base composition percentages; flag indicating a likely nested expansion. Yes, imperfect and compound (and indels?) No but may be possible A web interface is mentioned in the original publication, but cannot be found Command-line Download Unix, Windows, MacOS C Slowly increases as a function of the sequence length, while it increases more quickly as a function of the number of TRs. ? Automated statistical analysis files not generated.
W-SSRF 2003 ? 8 Sreenu et al 200318) 1 to 10 bp long Scans a nucleotide sequence ? One file with one sequence. Upload limit is a 20 kb file. Sequence content of the motif, repeat numbers, start and end position of the tract in the sequence No, perfect only Yes, using Autoprimer (in MICAS) No The user friendly GUI (graphical user interface) is MICAS, Available upon request to the authors MICAS is web-only Java ? ? ?
IRF program 2004 2007 80 Warburton et al 200419) ? ? ? ? ? ? Command-line Download ? ? ? ? ?
ExTRS 2004 ? 33 Krishnan and Tang 200420) ? Exhaustive ? One file with one sequence, no limit in length. Redundancy in the output is reduced Yes, substitutions (not indels) ? ? Available upon request to the authors ? Source code available only on request Near-proportional to the number of TR found ? ?
SRF 2004 2004 55 Sharma et al. 200421) 2-300 bp (default is 2-10 bp) Spectral technique Minimum repeat length; Maximum repeat length; Minimum % Match; FFT peak cut-off; (optional) Minimum number of copies. One file with one sequence, no limit in length. In fasta or genbank format. Information about the repeat unit, consensus pattern, region, copy number and score; as well as Fourier spectrum and the detailed analysis of any particular repeat unit. Yes, imperfect and indels No Yes Available upon request to the author Linux and Mac Perl Execution time increases as a function of repeat/sequence length Repeats can be tandem, dispersed and/or imperfect ?
STAR 2004 ? 52 Delgrange and Rivals 200422) Only searches for predefined motifs >2 bp, but allows approximate repeats. Exhaustive. As searching for microsatellites is a special cases of approximate TR, it can be done globally in a single run using the MotifFile parameter. Contact the authors for more informations. File with one motif per line, and then each motif is searched independently. Position offset One file with one sequence ? Yes, substitutions and indels As a service on an online platform Command-line Download Linux, SunOS, Mac OSX, and Windows systems. Source code not available Slow. The overall time complexity needed by STAR to find all ATR of a given motif of length p, in a sequence of length n, is O(np+n log n) Does not detect TR that would appear in a random sequence (false positives)
TRA 2004 2004 21 Bilgen et al 200423) ? Heuristic ? Multiple files with multiple sequences each (Max 1 Mb sequence) from ESTs. ? Yes, searches for exact–inexact TRs and exact–inexact compound repeats No? ? ? Download Windows C(with Microsoft Visual C++) and MatLab Run time increases linearly with the sequence length; does not increase with longer motif lengths. Designed for microsatellites, but can detect any type of TR Fast, simple and flexible.
Phobos 2006 2010 ? Mayer 201027) Perfect and imperfect TR, with a pattern size of 1 - 10 000 bp Exhaustive, uses alignment scores Mismatch score, indel score, minimum score, minimum length, minimum perfection, and others. One file in fasta format, with multiple sequence. No limit in sequence length. Text file, different formats, including gff and fasta Yes. Substitutions and indels. Not in itself, but yes as implemented in STAMP or Geneious. No User friendly GUI and easily scriptable Command-line program Download MacOSX, Linux, Windows. C (with Microsoft Visual C) Execution time increases with pattern size range. Very fast in the size range 1-10 bp, slow for patterns in the size range above 20 bp Automatic pipeline, interaction in primer design step possible. Free only for academic users.
WebSat 2009 ? 22 Martins et al 200948) Same as TROLL Uses TROLL Motif length and minimum number of motifs Individual sequences, in raw or FASTA format, or a group of sequences in a multi-FASTA format, maximum of 150 000 characters. ? ? ? Yes Uses Ajax techniques Command-line Download ? Written in PHP and JavaScript, making use of Ajax techniques ? ? ?
TReaDS 2010 ? ? Renda et al 201049) ? Includes Approximate Tandem Repeat Hunter (ATRHunter), mreps, Tandem Repeat finder (TRF), TandemSwan, TRStalker (new algorithm, only available on TReaDS) ? ? ? ? ? Yes ? Availability unknown, contact the authors ? ? ? ? ?

Algorithms details

Sputnik

Abadjian 1994 and La Rota et al. 200550) Sputnik uses a recursive algorithm and the performance depends on the recursion depth of the program. Sputnik reads through the entire sequence, assumes the existence of a repeat at every position, compares subsequent nucleotides and applies a simple scoring rule. If the resulting score rises above a preset threshold, the region along with its position and score is written out. Score is determined by the length of the repeat and the number of errors. Each nucleotide that matches the value predicted (by assuming a repeat) adds to the score. Each “error” subtracts from the score. When an error is encountered, the three possible kinds of errors (mismatch, insertion and deletion) are assumed and recursive calls to the comparison routine are made. If the resulting score from one of these is above the cutoff threshold, it is returned and the best of three pursued.

Unnamed 1998

Sagot and Myers 199851) It is an exhaustive search, but with a pre-filtering using a heuristic. It uses a general combinatorial framework of “consensus repeat” and uses of some heuristic filtering steps to avoid exponential increase in time complexity. The algorithm exhaustively searches for all possible consensus models whose edit distance to the real repeated units is no bigger than an upper bound. The algorithm works by extending prefix models. The first step is a filter that eliminates all regions whose probability of containing a satellite is less than one in a certain threshold (?), when percentage of sequence variation between units is 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence.

TEIRESIAS

Rigoustos et al 199852) Is a heuristic variation on TRF algorithm. It is a model less algorithm that replaces the TRF contiguous k-tuples with patterns of not necessarily contiguous k characters. The first scanning step detects potential elementary patterns, using the heuristics that the number of equi-spaced patterns reported is larger in TR regions than elsewhere in the sequence. The second step is a convolution phase, it is a verification step where elementary patterns are combined into larger patterns and false positives are discarded using a scoring system.

TRF

Benson 199953) It was originally designed to search for long tandem repeats. TRF is based on the alignment procedure and k-tuples. TRF uses a probabilistic algorithm that includes a “detection step” to identify the candidate repeats (seeds of miminum of 5 bp) and an “analysis” step, according to statistically-founded criteria to filter the candidate repeats.

Unnamed 2001

Landau et al 200154) The algorithm finds all approximate single repeats within a sequence. Considers two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). In each iteration i, the input string is divided into substrings. Each substring is searched individually for all repeats that span its centre character.

REPuter

Kurtz et al 200155) The algorithm uses two different distance models: the Hamming distance model and the edit distance model. It then assesses the significance of a repeat by computing its E-value, i.e. the number of repeats of the same length or longer and with the same number of errors or fewer that one would expect to find in a random DNA of the same length.

SSRIT

Temnykh et al 200156) The script uses regular expressions to locate SSR patterns in sequence files. In a second step, it eliminates redundancy by locating candidate sites to original sequence using BLAST search.

ComplexTR

Hauth and Joseph 200257) The algorithm both locates and characterizes TR regions. Three main tasks are performed: 1) isolate a tandem repeat by determining its period and its approximate sequence location; 2) determine the pattern associated with a region period; 3) characterize the region using the pattern. As in TRF58), the algorithm uses k-length substrings in a sequence by finding recurring distances between identical substrings. The ComplexTR differs from TRF in that a statistical model is not used to locate interesting periods, but a simple and accurate filter. In the first step, a distance arrays D, parallel to S is constructed to record a distance d between identical occurrences of k-length substrings (termed words). The algorithm requires at least five identical d occurrences to be present in a run. ComplexTR uses word and distances similarities to determine significant periods within a region. The general procedure for the second step (construct region base pattern) is to select a segment of sequence corresponding to all or a portion of a copy, to align the segment to several copies and to for a pattern using the consensus. This constitutes the region based pattern. The third step (characterize using region pattern) uses wraparound dynamic programming to handle regular expressions un the context of TR identifications. The final alignment is displayed as a series of copies, each representing a pattern occurrence in S, and a consensus copy is formed.

TROLL

Castelo et al 200259) and Martins et al. 200660) The algorithm uses the dictionary approach to find tandem repeats of pre-selected motifs. Find all occurrences from a list of patterns in a text string (the Aho–Corasick algorithm). At a mismatch character, it uses the failure links to continue to search without having to re-sample characters in the text. It uses “repeat buffer” operation to avoid missing repeats or counting the same site multiple times. Reads and write are done in constant time

ATRHunter

Wexler et al 2004 or Wexler et al 200561) The algorithm is heuristic search originally designed to search for long tandem repeats. First, the screening phase identifies substrings that have an unusually high probability of being an ATR using three similarity criteria. Second, the verification phase determines which of these candidates are in fact ATRs of the desired type.

MISA

Thiel et al 200362) The algorithm is not described

Mreps

Kolpakov et al 200363) The algorithm is a mixed combinatorial/heuristic paradigm. First an exhaustive combinatorial algorithm finds all approximate tandem repeats. Those repeats are then submitted to a heuristic treatment in order to obtain more biologically relevant representation of the repeats. This heuristic step includes trimming edges, computing the best period and merging, filtering out statistically expected repeats, gathering the results. The user specifies a resolution parameter that determines the ‘fuzziness’ of found repeats instead of introducing a scoring function and specifying a threshold score value for found repeats.

STRING

Parisi et al 200364) The algorithm is a dynamic programming procedure originally designed to search for long tandem repeats. The general strategy is the following: (a) Instead of studying the whole sequence we examine only the tracts (interesting zones) that we consider to be the more promising ones. (b) Instead of studying all possible consensus words we examine only the words that we consider to be the more promising ones. Detailed steps: 1) First using a local alignment (autoalignment) procedure; 2) Second, by group in suitable clusters all the autoalignments obtained in phase 1, by putting in the same cluster all the autoalignments whose augmented extensions are not disjoint.

W-SSRF

Sreenu et al 200365) Scans a nucleotide sequence for the presence of perfect simple sequence repeats from motifs of 1 to 10 bp long.

IRF

Warburton et al 200466) Algorithm not known

ExTRS

Krishnan and Tang 200467) It is an exhaustive search. The algorithm takes a definition of tandem repeat that uses the Hamming distance measure, and provably finds all tandem repeats. The definition is parameterized by a mismatch ratio p which allows for more mismatches in longer tandem repeats (and fewer in shorter), thereby avoiding the problems of a fixed mismatch count. Additionally, the algorithm uses a filtering algorithm that prunes the resultant exhaustive set to a smaller one with fewer redundancies. The different algorithms are parallel (can independently search for tandem repeats for different pattern lengths k) and well suited for Beowulf-class clusters.

SRF

Sharma et al. 200468) The algorithm uses the following steps: 1) Input a DNA sequence of length n. 2) Compute the power spectrum, S( f ), and the spectral average, Š, of the entire sequence. 3) Identify all peaks with S(fi)/ Š > T (the threshold, here chosen to be 4). For each frequency fi so identified, there are potential repeats of length Ni = 1/fi. 4) For each of these, compute the Pm( j ) = S(fi)/ Š in a sliding window of length m centered on position j in the sequence. Regions containing the repeat of length Ni can be identified directly as those where Pm( j ) exceeds the threshold. 5) Since both the repeat length, Ni, and its location are known, an exact method (step 6) is used to identify the repeat units. 6) Consider all Ni-mers in the repeat region and identify those occurring most frequently by local alignment. This automatically makes it possible to allow for insertions and deletions to any desired level.

STAR

Delgrange and Rivals 200469) The algorithm detects all significant Approximate Tandem Repeats (ATR) of a given motif, where significance is assessed using the Minimum Description Length (MDL) criterion. MDL provides an absolute measure of the significance of an ATR independently of the motif. It evaluates how many mutations are allowed in an ATR when compared to an ETR of the best possible length. The STAR algorithm needs no threshold value and optimally locates ATR of any input motif with respect to the MDL criterion.

TRA

Bilgen et al 200470) Is a useful algorithm for expressed sequences. There are two different algorithms implemented in two modules. The first module searches for exact and compound motifs. It searches for Sn, a string of repeated units in a DNA sequence wn- S1 = w1 [i1, j1] symbolizes the S1 starting with the i1-th and ending with the j1-th bases of the DNA sequence w1. The distance between i1 and j1 will therefore be equal to m1 × r1 where m1 and r1 refer to a type of DNA motif length and the number of repeats in S1 string of each w of a fixed length, respectively. When applicable, strings in a sequence of w are referred to as S1, S2, S3 . . . Sn for each consecutive string in a sequence w. The distance between S1 and S2 is referred to as d1, the distance between S2 and S3 is d2 and so on till Sn, dn- . The second module that searches for inexact and compound motifs with STRING algorithm (Parisi). Overall, program goes as follows: (i) searches the user defined organism (s) and/or keywords (organs, cell lines, tissue types or development stages) analyzing the whole data set provided in a data folder; (ii) isolates simple and non-simple (compound, imperfect and extended compound) tandem repeats by determining their type, lengths, and sequence location in a given DNA strings within DNA sequences; (iii) characterizes the repeats containing sequences based on the user defined parameters/ options, and; (iv) displays the results according to the user’s parameters/options.

MsatFinder

Thurston and Field 200571) No published description of the algorithm could be found. According to the online access, regular expression (regex), multipass or iterative search engines can be used.

FireµSat

de Ridder at al 200672) and de Ridder at al 201373), Is a combination of straightforward FA technology combined with a flavour of Moore machine technology. It uses Counting Finite Automata, which are regular language acceptors. The parameters details are the following:

Phobos

Mayer 201074) Uses the alignment score as an optimality criterion to decide whether to extend a satellite up or downstream; is able to find alternative alignments with different units for the same satellite.

SSRscanner

Anwar and Khan 200675) Uses dictionary approach to find simple sequence repeats of pre-selected motifs.

TandemSWAN

Boeva et al 200676) The algorithm is based on calculation of the repeat’ statistical significance, and identifies the length of the repeated unit and the number of repetitions. It consider only the repeated structures whose number of copies is integer and greater than two. Since algorithm does not search for tandem repeats with period size less than 3, it does not detect poly-A or TATA-like sequences.

OMWSA

Du et al 200777) The optimized moving window spectral analysis uses Fourier spectrogram, is a text-free digital signal processing based method. The moving window spectral analysis procedure produces the spectrogram at each frequency k and location n in the location-frequency plane. Therefore, the weight vector W can be set dependent on the spectrum at the frequency k and location n. If it shows the periodic components as highlighted regions in the spectrogram in the location-frequency plane, then from the coordinates (n, k) of the regions, it can obtain the information of both the periods and locations of DNA repeats.

IMEX

Mudunuri et al 200778) It is a simple string-matching algorithm that scans the entire sequence using sliding window approach and reports the results in a single run. Is a two-step procedure: (a) identification of microsatellite nucleation sites (b) extension of the nucleation sites on both sides. IMEx progressively scans for nucleation sites starting from the longest repeat unit. The repeating motif at every iteration can harbor up to ‘k’ number of point mutations (substitutions or indels of nucleotides) and the user can set a value for ‘k’ between 0 and m where m = repeat motif size.

SciRoKo

Kofler et al 200779) The programs contain two main modules: an SSR search module, which supports five different SSR search modes and a module for SSR-statistics, notably for mismatch frequency and compound microsatellite analysis. A nucleotide at position i is tested for identity with the nucleotide at position i + t, where t is the motif length (1–6). Upon identity i is increased i = i + 1 until no further identity can be found.

Msatcommander

Faircloth 200880) It uses regular expression pattern matching within each DNA sequence to locate microsatellite arrays within user-selected repeat classes. Is used to locate all microsatellite repeats fitting these designations. DNA sequences are first scanned in the 5'–3' orientation. The program then takes a second pass through the complement of the sequence in the 3'–5'orientation.

ReRep

Otto et al 200881) It is a pipeline that is based on similarity searches, the interpretation of sequence landscapes (SL), the assembly of clustered sequences and in-house Perl scripts. First, all reads are compared to each other with BLAST, with a word size of 8 and an e-value cut-off of 10-20, or with NUCMER tool, with a word size of 11. Each result is pre-processed by joining overlapping hits and by deleting self-hits. For each read, an SL is constructed by counting how often each base of the read is part of a hit with another read. Generate the graph using external tools. The minimal length that an alignment must have to enter into the analysis was defined as l. Runs with different values for l can be performed.

T-REKS

Jorda and Kajava 200982) Mainly designed for amino acid repeats. Based on clustering of lengths between identical short strings by using a K-means algorithm and on the idea that in a tandem repeat region, the most frequently occurred length between identical SSs should be equal to the repeat length. Therefore, detection of regions of an analyzed sequence where certain lengths between identical SSs have anomalously high occurrence may lead to the localization of the tandem repeats. Steps are as follow: 1) Short string probes and K-means clustering (Use K-means algorithm for unsupervised classification) 2) Establishment of tandem repeat lengths 3) Contiguity filtering, results in identification of hypothetical repeats 4) Extension and bridging of runs 5) Similarity filtering using a build-in program based on “center-star algorithm”, CLUSTALW and MUSCLE.

BwTRS

Pokrzywa and Polanski 201083) It is an efficient data compression algorithm based on the idea of backward search with the Burrows–Wheeler Transform (BWT) algorithm. It allows listing all occurrences of exact tandem repeats in a given string of length n in O(n log n) time. It then uses the efficient string indexing structure by Ferragina and Manzini for searching for the occurrences of so called rearmost tandem repeats that are then used to list the locations of the desire preferred tandem repeats, namely, the maximal tandem repeats of the primitive motif.

QDD, QDD2

Meglecz et al 201084) The algorithm takes the following steps:

  1. Sequence cleaning and microsatellite detection (Sort according to sequence tags, trims vector and adaptors sequences).
  2. Sequence similarity detection (Is time consuming, is meant to remove redundancy, and sequences that are part of a repetitive region of the genome).
  3. Iterative primer design using Primer3. Iterations (a) from designing primers only for perfect microsatellites with no short repeats in the flanking region till allowing multiple target microsatellites, short repetitions and homopolymers in the amplified region (b) produce primer pairs with PCR products in different size ranges.
  4. BLASTs selected sequences to Genbank to detect serious contaminations or mixing up samples.