Dart - KMP (Knuth-Morris-Pratt) Algorithm Implementation

KMP (Knuth-Morris-Pratt) is a string matching algorithm for finding the occurrences of a pattern in a string. Unlike brute force algorithm, which shifts only one character if not matching, KMP uses a smarter way for shifting by skipping unnecessary check.

It finds what is the longest possible shift to avoid wasteful comparisons. which is the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1] where P is the pattern. Because it will be used multiple times especially if the string is long, it's better to compute and store the largest identical prefix suffix at each index.

Computing Longest Prefix Suffix (LPS)

Before going to the algorithm, I'll give you an example. For example, the pattern is abaab. At the first character, the LPS is always 0 because suffix must always start at least from index 1 (the second character). Move to the next index and now we have ab substring. Clearly those two characters aren't the same, so the LPS is 0 too. At index 2 where the substring is aba, we can get a (the first character) as the prefix and a (the third character) as the suffix, so the LPS is a whose length is 0. At the index 4, the LPS is ab (the first two and last two characters). That's an example how to compute LPS.

The algorithm for finding longest suffix prefix is not long, but a bit tricky. If we use 0 (not 1) as the first character index i, there is a loop from i = 0 until i = m, where m is the pattern length. We have another variable j which holds the current longest prefix suffix. While i < m, if the character at j equals the character at i, it means the LPS at the index is the value of j + 1 and we can increment the value of j and i. Else, if the character at j doesn't equal the character at i and j > 0, update the value of j to use the LPS at j - 1. We need to handle the case where j equals 0. If that happens, it means the LPS at the index is 0 and we can move to the next index.

  List computeLPS(String pattern) {
    List lps = new List(pattern.length);
    lps[0] = 0;
    int m = pattern.length;
    int j = 0;
    int i = 1;
  
    while (i < m) {
      if (pattern[j] == pattern[i]) {
        lps[i] = j + 1;
        i++;
        j++;
      } else if (j > 0) {
        j = lps[j-1];
      } else { // no match
        lps[i] = 0;
        i++;
      }
    }
  
    return lps;
  }

And below is the implementation of KMP algorithm. The computeLPS function is called first before the iteration.

  List<int> kmp(String text, String pattern) {
    List<int> foundIndexes = new List<int>();
    int n = text.length;
    int m = pattern.length;
  
    int i = 0;
    int j = 0;
    List lps = computeLPS(pattern);
  
    while (i < n) {
      if (pattern[j] == text[i]) {
        i++;
        j++;
      }

      if (j == m) {
        foundIndexes.add(i - m); // Match
        print(j);
        j = lps[j - 1];
        i += m - 1;
      } else if (i < n && pattern[j] != text[i]) {
        if (j != 0) {
          j = lps[j - 1];
        } else {
          i = i + 1;
        }
      }
    }
  
    return foundIndexes;
  }
  
  main() {
    print(kmp('ABCABCDEFGABC', 'ABC')); // Output: [0, 3, 10]
    print(kmp('AAAAAAAAAAAAAAABBCCDDAAA', 'AAAA')); // Output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
    print(kmp('AAAABBCCDDAAA', 'ABAAB')); // Output: []
  }

This Knuth-Morris-Pratt algorithm implementation in Dart returns the list of indexes where the pattern found. There is a loop as many as the text length. Variable i stores the current text index, while variable j stores the current pattern index. m is the pattern length, while n is the text length. At every index, we compare the value of pattern at index j and text at index i. If the two characters are the same, the potential to get match is still alive and we increment the value of i and j by one. If it has reached the last pattern character, it means we found a match. If the two characters are different and j > 0, assign j with the value of LPS at j - 1. It means we don't need to compare from the beginning. If j is 0, just increment the i value to check the next text character.

As you can see on the output, for the second example, it returns the presence of pattern AAAA at every index the pattern matches. It means there are some overlaps in the result. If you don't want to get overlapped result, you can add i += m - 1; when a match is found.