A suffix tree is a tree data structure typically used to store a list of strings. It is also referred to as the compressed version of a trie, as, unlike a trie, each unique suffix in the list is compressed together and represented by a single node or branch in a suffix tree.
There are many ways to construct a suffix tree, but the semantics that is shared by most if not all types of suffix trees are as follows:
Before we learn about suffix trees, it is important to know the difference between a trie and a suffix tree.
In a trie, each alphabet of all the strings in the prescribed string is parsed one by one and represented by a single node. If two or more words start with the same sub-string, the identical sub-string is represented by the same chain of nodes. The chain breaks where the sub-string ends and the unique suffix begins. Subsequently, each alphabet of the remaining suffix is represented by a separate node.
The list below contains the names of seven animals:
A simple illustration of the trie for our list of animal names would look like this:
As mentioned earlier, each unique suffix in the list is compressed together in a suffix tree. An illustration of the suffix tree of the same list of animal names that we used earlier would look like this:
Typically, we do not want an implicit suffix tree, and terminating each node with any special character helps us avoiding one. In our examples, we use the
$
character.
The application of suffix trees is diverse and inter-disciplinary in nature.
In Computational Biology, suffix trees are widely used to identify the repeating structures in a DNA molecule. Similarly, it may be used to find the longest common sub-string or sub-sequence in a DNA sequence. These techniques are vital to the study of evolution and to trace similarities between organisms.
Moreover, in Forensic Science, it is crucial to make sure that DNA samples are not contaminated. Using suffix trees, analysts can verify if a given DNA sequence is contaminated or not!
Free Resources