Comuptational Biology Research Unit > Tandemizer

Tandemizer Documentation

Tandem repeat sequences occur in abundance throughout the genomes of prokaryotic organisms as single nucleotide repeats (SNRs), microsatellites, and minisatellites. Microsatellites, or simple sequence repeats (SSRs) are comprised of repeating units of 2 to 6 bases in arrays of 10-100 bases, while minisatellites consist of repeating units of about 15 bases in arrays of hundreds to tens of thousands of bases[1]. SSRs [2,3] and microsatellite variable-number tandem-repeat (VNTR) loci have been used as species- and strain- and population-specific phylogenetic markers, particularly in multiple-locus VNTR analysis (MLVA) [3,4].

During an initial examination of tandem arrays between pairs of bacterial genomes it was observed that tandem array blocks (TABs - multiple, ordered tandem repeat array loci separated by identical or similar spacer regions - Figure 1) appeared to be shared, with numbers of shared blocks correlating with degree of relatedness. We therefore designed a Java tool, Tandemizer, which uses output from the MUMmer suite exact-tandems program [5], combined with whole-genome sequences and optional GenBank files, to graphically compare TABs among two or more prokaryotic whole genomes. Tandemizer is an extensible, flexible tool that allows visualization of whole-genome arrangement among many genomes based upon TABs, with correlations to GenBank annotations for genes and other features, and drill-down to the sequence level for single-nucleotide polymorphism (SNP) analysis.

Figure 1 - Tandem Array Block (TAB)

AATAAT<250 bp spacer>CGCGCGCG<50 bp spacer>TTTTTT<836 bp spacer>CTAGCTAG

Figure 1 - Example of a Tandem Array Block (TAB), consisting of four or more tandem array repeat loci, each with 2 or more repeating units (underlined) totaling at least 6 bases in length, and separated by identical or variable-length nucleotide sequence spacer regions.


How Tandemizer Works

The initial tandem repeat loci are detected and output to individual text files using the exact-tandems program [6] to detect repeats of 6 or more bases in the complete FASTA sequence from each genome under comparison. The text files, along with the FASTA sequences are placed in a directory, and two subdirectories are created (GenBank and Excel). Though GenBank files are not necessary, they are highly desirable to provide comparative annotations of the genomes; these files are placed in the GenBank directory.

Tandemizer is a Java jar file that uses JRE 1.5 or greater, and requires BioJava to parse the GenBank files. It can be run in stand-alone mode or remotely on a server. This latter mode is advantageous to speed up the processing when comparing five or more genomes, since a large amount of memory can be allocated using the -Xmx flag.

The program is menu-driven, and has two main functions. The initial function, New, reads the tandem text files, along with the FASTA and GenBank files, comparing tandem block patterns between each pair of genomes, and creating Excel files for each pair. The structure of the Excel file is tab-delimited, listing the details of each TAB as well as GenBank annotations for features that encompass, overlap, or are contained within each TAB (Figure 2).

Figure 2 - Tandemizer output file

Figure 2 - Tandemizer output file in Excel format. Each output file is a very large, multi-megabyte, tab-delimited file. One output file is created for each pair of genomes in the comparison. Fields include information about each tandem repeat within a tandem block (TAB), as well as any GenBank annotations that are associated with the block

Figure 3 - Tandemizer diagram of 13 Staphylococcus genomes


Figure 3 - Tandemizer diagram showing TABs of 13 Staphylococcus genomes, including 9 in-groups strains of S. aureus (Sa), 2 strains of S. epidermidis, and one strain each of S. haemolyticus and S. saprophyticus. Colors correspond to number of connections to other genomes (e.g., black TABs are unique, dark blue TABs are shared with one other genome, goldenrod TABs are shared with 6 other genomes). Very large TABs shared between two taxa (e.g., dark blue TABs in SaN315 and SaMu50) may contain many smaller TABs that connect to other taxa (smaller colored regions within the large, dark blue TABs).
Figure 4 - TAB detail view

Figure 4 - TAB detail view, selected by right-clicking on a TAB in the genome diagram. This view gives detailed information about each tandem repeat, including sequence, length, number of repeats, and position in each genome. Yellow clickable highlighting in the InterLength columns indicates sequences containing SNPs. Numbers in the overlap column are also clickable and indicate number of genes overlapping the spacer region. Green highlighting shows presence of a selected gene (mecA, selected in Figure 8 and indicated by arrows in Figure 9).

Figure 5 - GenBank detail view


Figure 5 - GenBank detail view showing GenBank record for the locus selected in the TAB detail view (Figure 4). This selection shows records for the mecA gene contained in the SaCOL block indicated by the arrow in Figure 9.

Figure 6 - SNP details of mecA region

Figure 6 - SNP details for chosen spacer sequence in mecA region (InterLength labeled 191 in Figure 4, corresponding to spacer region at SaCOL near position 41,603. SNP positions are indicated by red bars. In this case, SaMW2 and SaUSA300 share a common allele (A), while the other four taxa have a C at that same position.

Figure 7 - Filtering tool

Figure 7 - The filtering tool provides many options for selecting blocks to visualize in the multi-genome view (Figure 3). Blocks can be selected based on a number of block and spacer region parameters, including filtering for blocks that contain only a specific sequence pattern.

Figure 8 - Pointing/selection tool

Figure 8 - Pointing/selection tool. Text can be searched in any or all of the above four fields in the GenBank record. In this case, the mecA gene was searched, resulting in the arrows pointing to specific blocks in Figure 9.

The second function, Open, reads the Excel files and creates a graphical display, showing each genome as a linear set of blocks corresponding to each TAB and color-coded for the number of connections to other genomes (Figure 3). The display is highly interactive, and each genome can be moved vertically to a different position or rendered invisible. The interactive mode allows the user to click on a block to see connections to similar TABs in other genomes (Figure 3), and to drill down to TAB details (Figure 4), GenBank annotations (Figure 5) and sequences to visualize SNPs (Figure 6). Menu functions allow visualization of any combination of connections, and powerful filtering (Figure 7) and pointing/selection (Figures 8, 9) options are available. The program has a number of export options, allowing the export of a subset of spacer regions within selected tandem blocks of an entire genome or specifically chosen single spacer regions.

Figure 9 - Selection of mecA gene

Figure 9 - Arrows below TABs are pointers to the mecA gene selected in Figure 8. Clicking on the TAB above the arrow in SaCOL shows connections between equivalent TABs, which also contain the mecA gene. Detailed views of this region are shown in Figures 4-6.

Conclusion

Tandemizer has been used to visually assess whole-genome rearrangements among genomes of Francisella, Staphylococcus, Streptococcus, and other bacterial taxa. It has been used to select unique marker regions for rtPCR assay development in Staphylococcus, and for correlating duplicated regions, IS element and virulence-gene differences among supspecies of Francisella. Tandemizer is under active development to interface with other data showing sequence similarity, including coords files from MUMmer and SNP and indel loci files from SIDACS (a SNP and Indel Discovery And Classification System based on MUMmer and developed jointly between TGen and NAU). The program has a proposed release time-frame of mid-year 2007.