Suffix Array¶
Definition¶
Let
A suffix array will contain integers that represent the starting indexes of the all the suffixes of a given string, after the aforementioned suffixes are sorted.
As an example look at the string
After sorting these strings:
Therefore the suffix array for
As a data structure it is widely used in areas such as data compression, bioinformatics and, in general, in any area that deals with strings and string matching problems.
Construction¶
approach¶
This is the most naive approach.
Get all the suffixes and sort them using quicksort or mergesort and simultaneously retain their original indices.
Sorting uses
approach¶
Strictly speaking the following algorithm will not sort the suffixes,
but rather the cyclic shifts of a string.
However we can very easily derive an algorithm for sorting suffixes from
it:
it is enough to append an arbitrary character to the end of the string
which is smaller than any character from the string.
It is common to use the symbol $.
Then the order of the sorted cyclic shifts is equivalent to the order of
the sorted suffixes, as demonstrated here with the string
Since we are going to sort cyclic shifts, we will consider cyclic substrings.
We will use the notation
The algorithm we discuss will perform
In each iteration of the algorithm, in addition to the permutation
Let's look at an example.
Consider the string
It is worth noting that the values of
Let us now focus on the implementation of the algorithm.
We will write a function that takes a string
vector<int> sort_cyclic_shifts(string const& s) {
int n = s.size();
const int alphabet = 256;
At the beginning (in the
vector<int> p(n), c(n), cnt(max(alphabet, n), 0);
for (int i = 0; i < n; i++)
cnt[s[i]]++;
for (int i = 1; i < alphabet; i++)
cnt[i] += cnt[i-1];
for (int i = 0; i < n; i++)
p[--cnt[s[i]]] = i;
c[p[0]] = 0;
int classes = 1;
for (int i = 1; i < n; i++) {
if (s[p[i]] != s[p[i-1]])
classes++;
c[p[i]] = classes - 1;
}
Now we have to talk about the iteration step.
Let's assume we have already performed the
To do this, note that the cyclic substrings of length
This gives us a very simple solution:
sort the substrings of length
How do we quickly perform such a sorting of the pairs?
Since the elements of the pairs do not exceed
We use here the technique on which radix sort is
based: to sort the pairs we first sort them by the second element, and
then by the first element (with a stable sort, i.e. sorting without
breaking the relative order of equal elements).
However the second elements were already sorted in the previous
iteration.
Thus, in order to sort the pairs by the second elements, we just need to
subtract
So only by simple subtractions we can sort the second elements of the pairs in
The only thing left is to compute the equivalence classes
Here is the remaining implementation.
We use temporary arrays
vector<int> pn(n), cn(n);
for (int h = 0; (1 << h) < n; ++h) {
for (int i = 0; i < n; i++) {
pn[i] = p[i] - (1 << h);
if (pn[i] < 0)
pn[i] += n;
}
fill(cnt.begin(), cnt.begin() + classes, 0);
for (int i = 0; i < n; i++)
cnt[c[pn[i]]]++;
for (int i = 1; i < classes; i++)
cnt[i] += cnt[i-1];
for (int i = n-1; i >= 0; i--)
p[--cnt[c[pn[i]]]] = pn[i];
cn[p[0]] = 0;
classes = 1;
for (int i = 1; i < n; i++) {
pair<int, int> cur = {c[p[i]], c[(p[i] + (1 << h)) % n]};
pair<int, int> prev = {c[p[i-1]], c[(p[i-1] + (1 << h)) % n]};
if (cur != prev)
++classes;
cn[p[i]] = classes - 1;
}
c.swap(cn);
}
return p;
}
If it is known that the string only contains a subset of characters,
e.g. only lowercase letters, then the implementation can be optimized,
but the optimization factor would likely be insignificant, as the size
of the alphabet only matters on the first iteration. Every other
iteration depends on the number of equivalence classes, which may
quickly reach
Also note, that this algorithm only sorts the cycle shifts.
As mentioned at the beginning of this section we can generate the sorted
order of the suffixes by appending a character that is smaller than all
other characters of the string, and sorting this resulting string by
cycle shifts, e.g. by sorting the cycle shifts of $s + $$.
This will obviously give the suffix array of
vector<int> suffix_array_construction(string s) {
s += "$";
vector<int> sorted_shifts = sort_cyclic_shifts(s);
sorted_shifts.erase(sorted_shifts.begin());
return sorted_shifts;
}
Applications¶
Finding the smallest cyclic shift¶
The algorithm above sorts all cyclic shifts (without appending a character to the string), and therefore
Finding a substring in a string¶
The task is to find a string
Comparing two substrings of a string¶
We want to be able to compare two substrings of the same length of a given string
For this we construct the suffix array in
Using this information we can compare any two substring whose length is equal to a power of two in O(1): for this it is sufficient to compare the equivalence classes of both substrings. Now we want to generalize this method to substrings of arbitrary length.
Let's compare two substrings of length
Here is the implementation of the comparison.
Note that it is assumed that the function gets called with the already calculated
int compare(int i, int j, int l, int k) {
pair<int, int> a = {c[k][i], c[k][(i+l-(1 << k))%n]};
pair<int, int> b = {c[k][j], c[k][(j+l-(1 << k))%n]};
return a == b ? 0 : a < b ? -1 : 1;
}
Longest common prefix of two substrings with additional memory¶
For a given string
The method described here uses
We construct the suffix array in
Let's compute the LCP for two suffixes starting at
int lcp(int i, int j) {
int ans = 0;
for (int k = log_n; k >= 0; k--) {
if (c[k][i % n] == c[k][j % n]) {
ans += 1 << k;
i += 1 << k;
j += 1 << k;
}
}
return ans;
}
Here log_n
denotes a constant that is equal to the logarithm of
Longest common prefix of two substrings without additional memory¶
We have the same task as in the previous section.
We have compute the longest common prefix (LCP) for two suffixes of a string
Unlike the previous method this one will only use
The basis for this algorithm is the following idea:
we will compute the longest common prefix for each pair of adjacent suffixes in the sorted order.
In other words we construct an array
Thus if we have such an array
So the main task is to build this array
Let's look at two adjacent suffixes in the sorted order (order of the suffix array).
Let their starting positions be
Now we already can implement the algorithm.
We will iterate over the suffixes in order of their length. This way we can reuse the last value
vector<int> lcp_construction(string const& s, vector<int> const& p) {
int n = s.size();
vector<int> rank(n, 0);
for (int i = 0; i < n; i++)
rank[p[i]] = i;
int k = 0;
vector<int> lcp(n-1, 0);
for (int i = 0; i < n; i++) {
if (rank[i] == n - 1) {
k = 0;
continue;
}
int j = p[rank[i] + 1];
while (i + k < n && j + k < n && s[i+k] == s[j+k])
k++;
lcp[rank[i]] = k;
if (k)
k--;
}
return lcp;
}
It is easy to see, that we decrease
Number of different substrings¶
We preprocess the string
To do this, we will think about which new substrings begin at position
Because the suffixes are sorted, it is clear that the current suffix
Practice Problems¶
- Uva 760 - DNA Sequencing
- Uva 1223 - Editor
- Codechef - Tandem
- Codechef - Substrings and Repetitions
- Codechef - Entangled Strings
- Codeforces - Martian Strings
- Codeforces - Little Elephant and Strings
- SPOJ - Ada and Terramorphing
- SPOJ - Ada and Substring
- UVA - 1227 - The longest constant gene
- SPOJ - Longest Common Substring
- UVA 11512 - GATTACA
- LA 7502 - Suffixes and Palindromes
- GYM - Por Costel and the Censorship Committee
- UVA 1254 - Top 10
- UVA 12191 - File Recover
- UVA 12206 - Stammering Aliens
- Codechef - Jarvis and LCP
- LA 3943 - Liking's Letter
- UVA 11107 - Life Forms
- UVA 12974 - Exquisite Strings
- UVA 10526 - Intellectual Property
- UVA 12338 - Anti-Rhyme Pairs
- UVA 12191 - File Recover
- SPOJ - Suffix Array
- LA 4513 - Stammering Aliens
- SPOJ - LCS2
- Codeforces - Fake News (hard)
- SPOJ - Longest Commong Substring
- SPOJ - Lexicographical Substring Search
- Codeforces - Forbidden Indices
- Codeforces - Tricky and Clever Password
- LA 6856 - Circle of digits