Experiments in program optimisation

Searching for a better indexOf

Tags: Java optimisation search

13 Apr 2019

Today we’ll be searching for strings inside other strings. We all know that this topic has been studied comprehensively, and virtually all possible algorithms have already been developed and published. We’ll deliberately not look there. Our objective is not to google the best algorithm but to see what ideas naturally come to a programmer, implement them in Java, and measure performance.

The journey in front of us is rather long. We’re going to look at twenty-two different solutions. For the impatient ones and the scared ones I suggest jumping straight to the LastByteMatcher, SuffixMatcher and LightestChainSuffixMatcher. The summary of all results is here. The other results of interest are Big random patterns and Big random patterns, huge data.

One very good matcher suggested by a reader is here.

The problem

In my Java coding practice I once had to search for certain byte sequences inside bigger byte arrays. If these were character strings, I would have used String.indexOf straight away, and there would have been no topic for this study. I had to implement indexOf for byte arrays, and, as far as I know, there is no library function for that.

There are three factors that affect search algorithms.

The first one is the length of a pattern. There are fast ways to search for a single byte, and there are (as we’ll see later) fast ways to search for a megabyte-long pattern. Different algorithms prefer different pattern lengths. The patterns I had to search were 10 – 20 bytes long.

The second one is the nature of data. On one end of the spectrum lies searching for random byte sequences inside other random byte sequences. On the other end we have regular repeating patterns, such as searching for aaaabaa in aaaaaacaaaaaabaaaa. Somewhere in between lies the case where there are some patterns of much less horrible structure, such as in natural language text. In my case, the pattern was such a text, while the array contained both natural text and binary data. Natural text search is a good enough approximation of this problem.

The third factor is possibility of pre-processing the data. This makes three types of search problems:

1) searching for the same pattern in multiple strings (pattern pre-compilation)

2) searching for multiple patterns in the same string (indexing)

3) searching for a single ad-hoc pattern in a single ad-hoc string (both indexing and pre-compilation can work if they are fast, but pre-compilation seems to have better chance).

We’ll concentrate on the first class, although the simplest solutions cover all the classes.

Conventions

Here are some naming and formatting conventions used in this article:

The snapshots of search algorithms running will be shown in the following format:

We have almost found our pattern in the text
                        pattern

Here the top line contains the current portion and some of its neighbourhood, while the second one shows our pattern, placed right under the current portion. We’ll avoid patterns starting or ending with spaces for these demonstrations.

The test

As a text we need something with a non-trivial distribution of characters, repeating patterns and good chances to find short strings in more than one place. Some piece of poetry will do, so we’ll use The Tragedy of Hamlet, Prince of Denmark (file text in the root directory). We’ll convert it into lowercase, remove punctuation and replace line breaks with spaces. Our text is 150128 characters long. We’ll use one arbitrarily chosen verse as the base of our patterns:

doubt thou the stars are fire
doubt that the sun doth move
doubt truth to be a liar
but never doubt i love

We’ll use substrings of this pattern base of various lengths as our patterns.

Since there are very good ways to search for short patterns (such as one or two bytes), and complex algorithms are unlikely to perform well there, we’ll limit the study to substrings of length of at least four.

Our pattern base is 106 bytes long. To save space, we’ll only publish the exponential results (for patterns of lengths of 4, 8, 16, 32, 64 and 106). For each of those we’ll use all possible substrings of this length and average the results. This makes the full 106-long pattern special, so the results may be a bit odd for that one. To get averaged results for long patterns, we’ll add length 96 as well.

The test will count how many times each pattern is encountered in the entire text (which must be at least one). We’ll measure the time spent, in nanoseconds, per byte of the text. Since the text is short and is iterated more than once, the caching effects are ignored for now. We will, however, try very long data sets later.

Our matchers will extend the abstract class Matcher with the following method:

public abstract int indexOf (byte [] text, int fromIdx);

Each class will be accompanied by a corresponding factory class – some implementations will use different classes for different patterns. The classes will contain debug code, which collects appropriate statistics; I’ll omit this code in included code snippets.

The main class of the test is byteIndexof.Indexof.

We’ll run our tests on Java 1.8.142 on Xeon® CPU E5-2620 v3 @ 2.40GHz, using Linux. The full source code is available in the repository. Now we’re ready to go.

A Simple Matcher

Repository reference: SimpleMatcher.java.

This is the simplest of all implementations. We’ll use it as the reference point: everything else will be tested against this.

public int indexOf (byte[] text, int fromIdx)
{
    int pattern_len = pattern.length;
    int text_len = text.length;
    for (int i = fromIdx; i + pattern_len <= text_len; i++) {
        if (compare (text, i, pattern, pattern_len)) return i;
    }
    return -1;
}

where compare() is defined in the base Matcher class:

public static boolean compare (byte [] text, int offset,
                               byte [] pattern, int pattern_len)
{
    for (int i = 0; i < pattern_len; i++) {
        if (text [offset + i] != pattern [i]) return false;
    }
    return true;
}

Here are the times for our exponential set (nanoseconds per byte of the text):

Matcher 4816326496106
Simple 2.90 2.89 3.30 3.28 3.29 3.21 2.68

We promised some statistics. For this matcher, we’ll calculate the average inequality index, that is, the average value of i at the exit of compare():

Value 4816326496106
Inequality index 0.10 0.10 0.10 0.10 0.10 0.10 0.04

The reason this index is so much smaller in the case of length 106 is that in this case the first character of the pattern is fixed (d), and this letter occurs rather infrequently (3.22%), while smaller pattern lengths involve variety of first characters.

This index is bigger than the value of 1/26 = 0.038, which is the mean predicted value for the alphabet of 27 characters. The reason is non-uniform distribution of characters, both in the text and in the pattern.

First Byte Matcher

Repository reference: FirstByteMatcher.java.

Since 90% of compare failures happen at the first byte, it looks attractive to treat this byte specially:

public int indexOf (byte [] text, int fromIdx)
{
    int pattern_len = pattern.length;
    int text_len = text.length;
    byte a = pattern [0];

    for (int i = fromIdx; i <= text_len - pattern_len; i++) {
        if (buf [i] == a && compare (buf, i + 1, pattern, 1, pattern_len - 1))
            return i;
    }
    return -1;
}

This is indeed faster than the simple matcher:

Matcher 4816326496106
Simple 2.90 2.89 3.30 3.28 3.29 3.21 2.68
FirstByte 1.47 1.51 1.52 1.52 1.49 1.46 0.92

We even broke the one-nanosecond barrier, but we already know why this happened: the first character of the pattern is a rare letter d.

The statistic we collect this time is the rate compare() is invoked, or, in other words, the frequency of the first byte not matching:

Value 4816326496106
Compare() rate 8.4% 8.2% 8.4% 8.3% 8.2% 7.6% 3.2%

First Bytes Matcher

Repository reference: FirstBytesMatcher.java.

The success of special treatment of the first byte makes us think of extending this approach to multiple first bytes. This would definitely help if we wrote in C, where reading (at least, on Intel architecture) a multi-byte word is as cheap as reading one byte. In Java, we aren’t so sure, so it needs testing. We’ll use the first four bytes:

public int indexOf (byte[] text, int fromIdx)
{
    for (int i = fromIdx; i <= text.length - pattern.length; i++) {
        if (text[i] == a && text[i+1] == b && text[i+2] == c && text[i+3] == d
            && compare (text, i + 4, pattern, 4, pattern.length - 4))
        {
            return i;
        }
    }
    return -1;
}

Here are the results:

Matcher 4816326496106
FirstByte 1.47 1.51 1.52 1.52 1.49 1.46 0.92
FirstBytes 1.18 1.35 1.36 1.37 1.38 1.32 0.86

There is some improvement, although it isn’t very big. We didn’t expect much anyway, because the change only addresses 8% of all cases.

We collect the compare() rate again, and it is much lower than before:

Value 816326496106
compare() rate 0.09%0.09%0.10%0.15%0.16%0.01%

First Bytes Matcher improved

Repository reference: FirstBytesMatcher2.java.

We can save a bit on reading buf [i+1], buf [i+2] and buf [i+3] in the following way:

private int indexOf (byte[] text, int fromIdx, byte [] pattern)
{
    int text_len = text.length;
    int pattern_len = pattern.length;
    if (text_len < pattern_len) {
        return -1;
    }
    byte a = pattern [0];
    byte b = pattern [1];
    byte c = pattern [2];
    byte d = pattern [3];
    byte A = text [fromIdx];
    byte B = text [fromIdx+1];
    byte C = text [fromIdx+2];

    for (int i = fromIdx; i <= text_len - pattern_len; i++) {
        byte D = text [i+3];
        if (A == a && B == b && C == c && D == d &&
            compare (text, i+4, pattern, 4, pattern_len - 4))
        {
            return i;
        }
        A = B;
        B = C;
        C = D;
    }
    return -1;
}

Generally, I don’t like such manual improvements, because I believe that a good compiler should be clever enough to perform them for us. However, it did help a bit:

Matcher 4816326496106
FirstBytes 1.18 1.35 1.36 1.37 1.38 1.32 0.86
FirstBytes2 1.21 1.18 1.19 1.19 1.20 1.16 0.83

Compiled matcher

Repository reference: CompileMatcher.java.

Let’s try a different approach and make use of the fact that we’re writing in Java. What if we create a new class for every pattern we search for, compile it and load? All of that must happen in the factory class. It takes time, but this time isn’t affecting our test results. There are three ways to generate a class:

We’ll go the last route as it’s the simplest one; the generator is in the repository. Here is an example of the code for some small pattern length (pattern doubt th):

    public int indexOf (byte[] buf, int fromIdx)
    {
        for (int i = fromIdx; i < buf.length - 7; i++) {
            if (buf [i+0] != 100) continue;
            if (buf [i+1] != 111) continue;
            if (buf [i+2] != 117) continue;
            if (buf [i+3] != 98) continue;
            if (buf [i+4] != 116) continue;
            if (buf [i+5] != 32) continue;
            if (buf [i+6] != 116) continue;
            if (buf [i+7] != 104) continue;
            return i;
        }
        return -1;
    }

Results are, however, not better than before, so we can discard this method:

Matcher 4816326496106
FirstBytes2 1.21 1.18 1.19 1.19 1.20 1.16 0.83
Compile 1.18 1.38 1.38 1.39 1.34 1.27 0.80

This is actually quite strange. Perhaps, it’s a topic of a new study.

Hash matcher

Repository reference: HashMatcher.java.

For the first time we’ll try not to improve little things here and there, but rather use some sort of non-trivial algorithm. We’ll start with a hash matcher.

The idea is to apply some function to both the pattern and the current portion to obtain their hash value. This function must be additive, i.e. allow for quick addition and removal of bytes, so that we could slide the current portion along the text. We’ll only perform full comparison when the hash values match. To make things simple, we’ll use addition as a hash function:

private static int hashCode (byte[] array, int fromIdx, int length)
{
    int hash = 0;
    for (int i = fromIdx; i < fromIdx + length; i++)
        hash += array[i];
    return hash;
}

public int indexOf (byte[] text, int fromIdx)
{
    byte [] pattern = this.pattern;
    int pattern_len = pattern.length;
    int text_len = text.length;

    if (fromIdx + pattern_len > text_len) return -1;
    int pattern_hash = this.hash;
    
    int hash = hashCode (text, fromIdx, pattern_len);

    for (int i = fromIdx; ; i++) {
        if (hash == pattern_hash && compare (text, i, pattern, pattern_len))
            return i;
        if (i + pattern_len >= text_len) return -1;
        hash -= text [i];
        hash += text [i + pattern_len];
    }
}

Here are the times:

Matcher 4816326496106
FirstBytes2 1.21 1.18 1.19 1.19 1.20 1.16 0.83
Hash 2.59 2.47 2.43 2.42 2.40 2.39 2.40

Rather poor performance of this algorithm is surprising. Let’s look at the compare() rate statistics:

Value 4816326496106
Compare() rate 1.6%1.2%1.0%0.8%0.7%0.6%0.6%

The compare rate isn’t very low. It is much higher than for FirstBytesMatcher, and some operations (two hash updates and one comparison) happen on each iteration. Some better algorithm is required, one that helps reduce the number of iterations.

Last Byte Matcher

Repository reference: LastByteMatcher.java.

The idea of the last byte matcher is that when our pattern does not match the current portion, in most cases we can do better than shift the pattern by one. We can look at the last byte of the current portion, find it in the pattern and shift accordingly. This is how it works:

Let’s assume we are searching for the first line of our pattern: doubt thou the stars are fire, and let’s try matching it against an arbitrary portion of our text:

to be or not to be that is the question whether tis nobler in the mind to suff
doubt thou the stars are fire

The strings clearly don’t match, and the last character of the pattern (e) falls against an h. Instead of shifting the pattern by 1, which would put this h against an r, we can search our pattern right to left and find the rightmost h in it, which is the one in the (16 positions from the end). We can then shift our pattern by 16 to put h under h:

to be or not to be that is the question whether tis nobler in the mind to suff
                doubt thou the stars are fire

Completely coincidentally, the last e fell on h once more, so we can shift by 16 again:

ot to be that is the question whether tis nobler in the mind to suffer the sli
                      doubt thou the stars are fire

Here we got very lucky: e falls on n, which does not occur in our pattern. We can shift by the entire pattern length (29):

whether tis nobler in the mind to suffer the slings and arrows of outrageous fo
                     doubt thou the stars are fire

Note that shift by 1 can still occur, such as in this case:

to be or not to be that is the question whether tis nobler
                  doubt thou the stars are fire

What if the last byte matches, but not the rest?

to be or not to be that is the question whether tis nobler
 doubt thou the stars are fire

In this case we must find the second rightmost occurrence of the last character (in this case it’s e in are) and shift there (by 5):

to be or not to be that is the question whether tis nobler
      doubt thou the stars are fire

So here is the plan: we search for all possible bytes in our pattern and record how far from the end of the pattern they occur, for the last byte in the pattern ignoring that last occurrence. For the bytes that never occur, we record the pattern length. Then we’ll take the last byte of the current portion, look up the recorded value for that byte and shift our pattern by this number. This is how we calculate the shifts:

private final int [] shifts = new int [256];

public MatcherImpl (byte [] pattern)
{
    this.pattern = pattern;

    int len = pattern.length;
    for (int i = 0; i < 256; i++) {
        shifts [i] = len;
    }
    for (int i = 0; i < len-1; i++) {
        shifts [pattern[i] & 0xFF] = len - i - 1;
    }
}

And this is how we search:

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int pattern_len = pattern.length;
    byte last = pattern [pattern_len-1];
    int [] shifts = this.shifts;
    
    for (int pos = fromIndex; pos < text.length - pattern_len + 1;) {
        byte b = text [pos + pattern_len - 1];
        if (b == last && compare (text, pos, pattern, pattern_len - 1)) {
            return pos;
        }
        pos += shifts [b & 0xFF];
    }
    return -1;
}

Except for very short patterns, it really runs faster than everything we tried so far:

Matcher 4816326496106
FirstBytes2 1.21 1.18 1.19 1.19 1.20 1.16 0.83
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22

The one-nanosecond barrier has now been broken convincingly; it doesn’t depend on the specific character of the pattern.

The statistics of interest for this matcher are the average shift and the compare rate:

Value 4816326496106
Avg shift 3.5 6.0 9.613.919.222.224.8
Compare() rate 8.9% 9.2% 9.1% 9.0% 8.9% 7.8%10.1%

The compare rate is pretty much the same as for the first byte matcher (except for the last pattern; probably, because it ends with a very frequent letter e).

The average shift is quite a bit bigger than one (that’s why this matcher is so fast), but it still disappoints: sometimes it is as much as four times smaller than the pattern length. This is to be expected, though, since our alphabet is so small. The chances for a given character to be found inside a long pattern are quite high. If the text and the pattern were random byte sequences, we could be much better off with this matcher (or, rather, the same problem would require longer patterns to occur), but for our string example we need to increase the shift.

For medium-sized patterns (such as of length 16), however, the shift value is satisfactory, the speed is good, and the simplicity of this matcher makes it very attractive. I ended up using this one in my production code.

Multi-byte matcher

Repository reference: MultiByteMatcher.java.

One way to increase the average shift value is to work with more than one byte. Let’s look at the example we used when introducing the LastByteMatcher:

to be or not to be that is the question whether tis nobler in the mind
doubt thou the stars are fire

We looked at the last character (h), and found it 16 positions from the end of the pattern, so we could shift the pattern by this number. The next character in the current portion (t) doesn’t work so well, since it occurs in stars and only provides the shift value of 11. The next one (space) is even worse, as it occurs just two positions to the left. The next one (s) is no better: it gives the shift value of 6. Finally, i becomes the champion: it never occurs in the pattern before this point, so the shift is 24. There is no point looking further – the shift can only get smaller.

The last byte has the potential to produce the biggest shift (the shift can never be bigger than the character’s position from the left), that’s why if we can only use one byte, the last one is the best choice. What if we use more than one? We’ll start from the right-hand side and go left while there is a chance to improve the shift.

One downside of this approach is that instead of one array of size 256 we’ll have to keep a whole collection of those – one for each position in the pattern:

private final int [][] shifts;

public MatcherImpl (byte [] pattern)
{
    this.pattern = pattern;

    int len = pattern.length;
    shifts = new int [pattern.length][256];
    for (int pos = len-1; pos >= 0; pos --) {
        for (int i = 0; i < 256; i++) {
            shifts [pos][i] = pos+1;
        }
        for (int i = 0; i < pos; i++) {
            shifts [pos][pattern[i] & 0xFF] = pos - i;
        }
    }
}

This is what matching looks like:

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int pattern_len = pattern.length;
    int [][] shifts = this.shifts;
    
    for (int pos = fromIndex; pos < text.length - pattern_len + 1;) {
        if (compare (text, pos, pattern, pattern_len)) {
            return pos;
        }
        
        int shift = 0;
        for (int i = pattern_len - 1; i >= 0; i--) {
            int sh = shifts [i] [text [pos + i] & 0xFF];
            if (sh > shift) shift = sh;
            if (shift > i) {
                break;
            }
        }
        pos += shift;
    }
    return -1;
}

We collect two statistics for this matcher: the average shift and the average number of iterations this shift was obtained at:

Value 4816326496106
Avg shift 3.7 7.314.629.157.687.297.3
Iterations 1.31.62.43.97.49.89.7

The shift looks very good (very close to the pattern length), but the iteration count is quite high, which can undermine the entire effort.

This seems indeed to be happening: the results aren’t very good.

Matcher 4816326496106
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
MultiByte 2.59 1.90 1.32 0.95 0.69 0.54 0.45

Limited Multi-byte matcher

Repository reference: LimitedMultiByteMatcher.java.

We can limit the number of iterations in MultiByteMatcher to some small value (two or three). This will reduce both the average shift and the operation cost, and the overall performance can improve. We just copy the code for MultiByteMatcher and introduce the constructor parameter n for the factory and the matcher. The main loop will run to pattern_len - n instead of 0:

for (i = pattern_len - 1; i >= pattern_len-n; i--) {
    int sh = shifts [i] [text [pos + i] & 0xFF];
    if (sh > shift) shift = sh;
    if (shift > i) {
        break;
    }
}

First, some statistics. Let’s start with the average shift:

Matcher 4816326496106
LastByte 3.5 6.0 9.613.919.222.224.8
LMultiByte (2) 3.7 7.112.720.430.336.639.8
LMultiByte (3) 3.7 7.313.924.137.647.050.5
LMultiByte (4) 3.7 7.314.426.142.654.959.6
MultiByte 3.7 7.314.629.157.687.297.3

As expected, the shift gradually increases with the maximal iteration count.

Here are the actual iteration counts used:

Matcher 4816326496106
LMultiByte (2) 1.2 1.4 1.6 1.8 1.9 1.9 1.9
LMultiByte (3) 1.3 1.6 2.0 2.4 2.7 2.7 2.7
LMultiByte (4) 1.3 1.6 2.2 2.8 3.4 3.5 3.5
MultiByte 1.3 1.6 2.4 3.9 7.4 9.8 9.7

Now let’s see if smaller iteration counts compensate for smaller shifts:

Matcher 4816326496106
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
LMultiByte (2) 2.63 1.90 1.20 0.71 0.45 0.37 0.22
LMultiByte (3) 2.66 1.96 1.29 0.80 0.51 0.41 0.28
LMultiByte (4) 2.66 1.98 1.35 0.86 0.55 0.43 0.30
MultiByte 2.59 1.90 1.32 0.95 0.69 0.54 0.45

It doesn’t. The limited multi-byte matcher is still slower than the ordinary last-byte matcher.

Unrolled Limited Multi-byte matcher

Repository reference: UnrolledLimitedMultiByteMatcher.java.

We don’t give up so easily. When the number of iterations of the limited multi-byte version is known and small, we can unroll the loop and, possibly, increase performance.

This is what the unrolled routine looks like for the number of iterations of three:

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int pattern_len = pattern.length;
    int [] shifts1 = shifts[len-1];
    int [] shifts2 = shifts[len-2];
    int [] shifts3 = shifts[len-3];
    byte last = pattern [len-1];
    
    for (int pos = fromIndex; pos < text.length - pattern_len + 1;) {
        byte b1 = text [pos + len-1];
        if (b1 == last && compare (text, pos, pattern, pattern_len-1)) {
            return pos;
        }
        
        int shift = shifts1 [b1 & 0xFF];
        if (shift < pattern_len) {
            byte b2 = text [pos + pattern_len-2];
            int sh = shifts2 [b2 & 0xFF];
            if (sh > shift) shift = sh;
            if (shift < pattern_len) {
                byte b3 = text [pos + pattern_len-3];
                int sh3 = shifts3 [b3 & 0xFF];
                if (sh3 > shift) shift = sh3;
            }
        }
        pos += shift;
    }
    return -1;
}

Here are the times:

Matcher 4816326496106
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
UnrolledLMultiByte (2) 2.14 1.38 0.79 0.48 0.33 0.27 0.16
UnrolledLMultiByte (3) 2.19 1.49 0.90 0.54 0.36 0.29 0.15
UnrolledLMultiByte (4) 2.31 1.57 0.97 0.59 0.38 0.30 0.11

Unrolling improves times, but not enough to outperform the LastByteMatcher, except for our pattern of 106, which is non-representative by its nature.

Search String Suffixes

Until now we looked for occurrences of single bytes in our pattern. When more than one byte was used, they were searched independently of each other. It may be more productive to search for groups of bytes, e.g. for entire suffixes of the current portion.

Let’s look at some example: we’ll search for the second line of our pattern base (which is 28 characters long) close to its presence in the text:

doubt thou the stars are fire doubt that the sun doth move doubt truth to be a
    doubt that the sun doth move

If we record positions of single characters only (as in LastByteMatcher), we get the shift value of 2, as this is how far the last o is found from the end of the pattern.

If we work with suffixes of length 2, we’ll search for do and find it 7 characters from the end, which is better but still far from perfect.

Three characters (" do") take us to the same place (the shift value of 7).

Finally, four characters (e do) are not found in the pattern at all. We can shift by 25, so that the next current portion starts at " do":

doubt thou the stars are fire doubt that the sun doth move doubt truth to be a
                             doubt that the sun doth move

We can do better than this if for all suffix lengths we record whether these suffixes occur in the beginning of the pattern. We know the pattern does not start with " do", so we can increase the shift to 26. Since the pattern starts with do, we can’t increase it more, and we can see that 26 is indeed the right value – it takes us straight to the match:

doubt thou the stars are fire doubt that the sun doth move doubt truth to be a
                              doubt that the sun doth move

What if no suffix is found in the beginning of the pattern? Then we can shift by the entire pattern length, as in the example from the LastByteMatcher section:

to be or not to be that is the question whether tis nobler in the mind to suff
doubt thou the stars are fire

Here h, th, " th" give the same shift (16), and neither of them is found in the beginning. The suffix of length 4 (s th) isn’t found at all, which allows us to shift the pattern by its entire length of 29.

What if more than one suffix is found in the beginning of the pattern? We must be conservative and use the longest one, which will result in the smallest shift. This case seems impossible at first, but it is in fact quite real:

none my lord but that the worlds grown honest
       t that the sun d

Three suffixes are found in the beginning: t, t t, t that t, the longest one being eight characters long. This means that if we get as far as the suffix ut that t, which does not occur in the pattern, we must shift by the pattern length minus 8, which is 8:

None my lord but that the worlds grown honest
               t that the sun d

In short, here are the simple rules we can follow (assuming P is the pattern length):

Last Byte Suffix Matcher

Repository reference: LastByteSuffixMatcher.java.

Let’s start with a simplistic implementation of this idea, which only deals with the suffixes of the pattern.

Let’s go back to the LastByteMatcher. There we used the last byte of the current portion to calculate the shift, no matter if it matched the last byte of the pattern or not. Let’s modify this a bit. If the last byte doesn’t match, we do exactly as before. If it does match, we look how many more bytes match (find the good suffix), and apply the entire procedure described above to that suffix. The good part of it is that the suffix is also a suffix of the pattern and is therefore entirely defined by its length. We can pre-calculate whatever we want for these suffixes. Here is what we do to initialise the matcher:

private final byte [] pattern;
private final int [] shifts;
private final int [] suffix_shifts; // index is the suffix length

private int find (byte [] pattern, int suffix_len)
{
    int len = pattern.length;
    
    for (int pos = len - suffix_len - 1; pos >= 0; pos --) {
        if (compare (pattern, pos, pattern, len - suffix_len, suffix_len)) {
            return pos;
        }
    }
    return -1;
}

public MatcherImpl (byte [] pattern)
{
    this.pattern = pattern;

    int len = pattern.length;
    shifts = new int [256];
    for (int i = 0; i < 256; i++) {
        shifts [i] = len;
    }
    for (int i = 0; i < len-1; i++) {
        shifts [pattern[i] & 0xFF] = len - i - 1;
    }
    
    suffix_shifts = new int [pattern.length];
    suffix_shifts [0] = shifts [pattern [len-1] & 0xFF];
    int atzero_len = 0;
    
    for (int suffix_len = 1; suffix_len < pattern.length; suffix_len ++) {
        int pos = find (pattern, suffix_len);
        int suffix_shift;
        
        if (pos < 0) {
            suffix_shift = len - atzero_len;
        } else {
            suffix_shift = len - pos - suffix_len;
        }
        suffix_shifts [suffix_len] = suffix_shift;

        if (compare (pattern, len - suffix_len, pattern, 0, suffix_len)) {
            atzero_len = suffix_len;
        }
    }
}

And here is the matcher:

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int len = pattern.length;
    byte last = pattern [len-1];
    int [] shifts = this.shifts;
    
    for (int pos = fromIndex; pos <= text.length - len;) {
        byte b = text [pos + len - 1];
        int shift;
        if (b != last) {
            shift = shifts [b & 0xFF];
        } else {
            int i = len-2;
            while (true) {
                if (text [pos + i] != pattern [i]) {
                    break;
                }
                if (i == 0) {
                    return pos;
                }
                -- i;
            }
            int suffix_len = len - i - 1;
            shift = suffix_shifts [suffix_len];
        }
        pos += shift;
    }
    return -1;
}

Here is an example of operation of this matcher:

had he been put on to have proved most royally
                ver doubt i love

Here the pattern length is 16 and the good suffix is of size 3 (ove), and it never occurs in the pattern. However, it doesn’t mean we can shift by 16: a shorter suffix (ve) matches the beginning of the pattern, so the correct shift is 14:

had he been put on to have proved most royally
                              ver doubt i love

The times, however, aren’t great:

Matcher 4816326496106
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
LastByteSuffix 1.56 0.91 0.57 0.40 0.30 0.25 0.20

It is easy to see why the speed hasn’t improved. Remember, in the LastByteMatcher the compare rate was 9 – 10%. This is the fraction of invocations when the last byte didn’t match. And this is the only case when this new matcher tries to make a difference. Even if we made very high shift values in these 10% of cases, we wouldn’t have made a big difference, but we didn’t do even that: when the good suffix length is exactly one, we’re back at the LastByteMatcher territory, where shift values aren’t very big, as any single character is quite likely to be found in a long enough pattern.

To check this, let’s look at statistics: the average shift counts in general, the average shift counts for the suffixes of length one, and the average shift counts for the suffixes of length more than one (together with the same statistic from the LastByteMatcher):

Matcher 4816326496106
Avg shift, LastByte 3.5 6.0 9.6 13.9 19.2 22.2 24.8
Avg shift 3.5 6.1 9.7 14.2 19.6 22.7 24.9
Avg shift, size = 1 3.8 5.7 7.9 9.7 10.3 10.7 14.0
Avg shift, size > 1 3.9 7.5 13.7 22.9 37.2 43.4 29.9

We see that increasing the suffix size does increase the shift, but not enough; besides, the situation when it happens is rather rare.

Next Byte Suffix Matcher

Repository reference: NextByteSuffixMatcher.java.

There are ways to improve these solutions. Let’s look at one of them.

We can combine the LastByteSuffixMatcher with the MultiByteMatcher. For that, we’ll look at both the good suffix (just like in the previous case), and several of the bytes that follow, choosing the bigger one of two shifts. The simplest is to look at exactly one such byte – the bad byte. For that, we need a full collection of 256-element shift arrays, one for each position where that bad byte occurs.

I’ll skip the initialisation code. The matcher code looks like this:

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int len = pattern.length;
    byte last = pattern [len-1];
    int [][] shifts = this.shifts;
    int [] last_shifts = shifts [len-1];
    
    for (int pos = fromIndex; pos <= text.length - len;) {
        byte b = text [pos + len - 1];
        int shift;
        if (b != last) {
            shift = last_shifts [b & 0xFF];
        } else {
            int i = len-2;
            while (true) {
                b = text [pos + i];
                if (b != pattern [i]) {
                    break;
                }
                if (i == 0) {
                    return pos;
                }
                -- i;
            }
            int suffix_len = len - i - 1;
            shift = Math.max (shifts [i][b & 0xFF],
                              suffix_shifts [suffix_len]);
        }
        pos += shift;
    }
    return -1;
}

The shift count improved a bit:

Matcher 4816326496106
Avg shift, LastByteSuffix 3.5 6.0 9.6 13.9 19.2 22.2 24.8
Avg shift, NextByteSuffix 3.5 6.1 10.0 14.8 20.9 24.3 27.2

This wasn’t, however, enough to achieve better performance. It stayed the same or even dropped a bit:

Matcher 4816326496106
LastByteSuffix 1.56 0.91 0.57 0.40 0.30 0.25 0.20
NextByteSuffix 1.70 0.98 0.60 0.42 0.31 0.26 0.21

Possible improvement: Next Suffix Matcher

Another possible way to improve this solution is to extend the set of the indexed suffixes. Currently, we only work with the direct suffixes of the pattern. We can extend it to work with the strings like “a suffix of the pattern plus one arbitrary character”. This will require allocating an array of size 256 for every position in the pattern (just like what we did in the MultiByteMatcher). The shifts are quite likely to improve a lot. The overall big performance improvement is, however, highly unlikely, because these big shifts will only be produced in 10% of all the cases – when the last bytes match. The shifts in 90% of the cases will be exactly the same as for LastByteMatcher. That’s why we’ll skip this solution.

If we want to get good shifts, we must really go wild and index more suffixes. What if we index everything?

Suffix Matcher

Repository reference: SuffixMatcher.java.

The procedure of searching suffixes described above was rather complex; this was due to limitations on the kind of strings we indexed, both in their length and their nature.

If we can afford to index suffixes of any length, things suddenly become much simpler. There is no such thing anymore as “a suffix matches some string in the middle of the pattern”. If it does, expand it by one byte. Eventually, we’ll either find a mismatch, or reach the beginning of the pattern.

The description of our algorithm becomes very easy and straightforward:

If we look at the previously considered example:

None my lord but that the worlds grown honest
       t that the sun d

There are three suffixes of the current portion that are also prefixes of the pattern: the strings t, t t, t that t of lengths 1, 3 and 8. Since the pattern length is 16, we must shift by 8:

None my lord but that the worlds grown honest
               t that the sun d

We don’t need to index all the substrings of our pattern; we only need the prefixes, and there are only P of them. We could use a hash table for that, but I feel that a search tree is more appropriate. This is what we do:

private static final class Node
{
    Node [] nodes = null;
    boolean boundary = false;
}

private final Node root;

public MatcherImpl (byte [] pattern)
{
    int pattern_len = pattern.length;
    root = new Node ();

    for (int prefix_len = 1; prefix_len <= pattern_len; prefix_len++) {
        Node node = root;
        for (int i = prefix_len - 1; i >= 0; --i) {
            int b = pattern [i] & 0xFF;
            if (node.nodes == null) {
                node.nodes = new Node [256];
            }
            if (node.nodes [b] == null) {
                node.nodes [b] = new Node ();
            }
            node = node.nodes [b];
        }
        node.boundary = true;
    }
}

This is much simpler than initialisation of the LastByteSuffixMatcher: we don’t need to search anything, we just index what we have. We make use of a relatively small branching factor (256).

Here are the conventions:

Note that node.nodes == null can only happen if node.boundary flag is set. The opposite is not true: a substring that is a prefix of the pattern can be expanded left to another prefix of the pattern. We’ve seen the examples of this before.

This is what the search procedure now looks like:

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int len = pattern.length;
    
    for (int pos = fromIndex; pos < text.length - len + 1;) {
        Node node = root;
        int shift = len;

        for (int i = len - 1; ; --i) {
            node = node.nodes [text [pos + i] & 0xFF];
            if (node == null) {
                break;
            }
            if (node.boundary) {
                shift = i;
                if (node.nodes == null) {
                    break;
                }
            }
        }
        if (shift == 0) return pos;
        pos += shift;
    }
    return -1;
}

Both initialisation and search procedures are very simple and easy to read – much simpler, for instance, than the LastByteSuffixMatcher.

Here are some comments on this code:

Here are the stats:

Statistic 4816326496106
Avg shift 3.9 7.9 15.931.963.995.8105.9
Avg count 1.3 1.5 1.8 2.1 2.4 2.6 2.7

Both statistics look very good. We manage to shift the pattern by very close to its full length, and to establish this shift in under three iterations.

How is the speed?

Matcher 4816326496106
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
Suffix 1.89 1.33 0.74 0.41 0.27 0.22 0.11

The suffix matcher is slower on short patterns and faster on long ones. The speed achieved for patterns of length 106 is very good; it is 24 times higher than that of the Simple matcher and 9 times higher than the FirstByteMatcher. The nice thing here is that the speed was achieved by better algorithm only – no intricate optimisations or dirty tricks.

Compiled versions

The last result was quite good. Can we possibly improve on that by employing the technique of generating Java code again? After all, the entire search tree is known when we create a matcher, so why now hard-code the entire tree traversal in Java?

I’ll be brief here. We tried three approaches:

1) CompileSuffixMatcher: we generate a method for every tree node; it calls other methods as it goes deeper. We rely on JVM to optimise these methods and perform all necessary inlining. Here is an example of one of the methods (all the methods return calculated shift):

private int match_74 (byte [] text, int pos)
{
    switch (text [pos + 6]) {
        case 32: return match_74_20 (text, pos);
        case 98: return match_74_62 (text, pos);
    }
    return 8;
}

2) CompileSuffixMatcher2: put the entire search into one function, with corresponding depth of nesting ifs and switches. It slows down at length 64 and fails completely at 106: the code of the method becomes too big for the Java compiler.

3) CompileSuffixMatcher3: we’ve seen that the average iteration count is 2.7 for the longest pattern (length 106). It seems attractive to perform the first iterations in generated Java code, and the rest on a tree, as before. Here is an example of generated code (pattern doub):

private static int match (byte [] text, int pos)
{
    switch (text [pos + 3]) {
    case 98: 
        if (text [pos + 2] == 117) return match (text, pos + 1, node_62_75);
        return 4;
    case 100: return 3;
    case 111: 
        if (text [pos + 2] == 100) return 2;
        return 4;
    case 117: 
        if (text [pos + 2] == 111) return match (text, pos + 1, node_75_6F);
        return 4;
    }
    return 4;
}

Here are the results:

Matcher 4816326496106
Suffix 1.89 1.33 0.74 0.41 0.27 0.22 0.11
CompileSuffix 1.93 1.43 0.91 0.52 0.32 0.24 0.16
CompileSuffix2 1.76 1.59 0.99 0.55 1.31 0.90 fail
CompileSuffix3(1) 2.05 1.70 1.05 0.56 0.37 0.29 0.18
CompileSuffix3(2) 2.08 1.60 1.02 0.57 0.38 0.29 0.18
CompileSuffix3(3) 1.94 1.40 0.99 0.55 0.34 0.28 0.18
CompileSuffix3(4) 1.76 1.56 0.98 0.54 0.33 0.26 0.17
CompileSuffix3(5) 1.76 1.57 0.97 0.54 0.33 0.25 0.16
CompileSuffix3(6) 1.76 1.57 0.97 0.54 0.32 0.25 0.16
CompileSuffix3(7) 1.76 1.56 0.97 0.53 0.32 0.25 0.81
CompileSuffix3(8) 1.76 1.55 0.97 0.54 0.32 0.91 0.84

The results are not so good. The original version is the best (and, as we remember, on small patterns the LastByteMatcher is even better).

Saving memory: SmallSuffixMatcher

Repository reference: SmallSuffixMatcher.java.

The SuffixMatcher, being fast, has one big disadvantage: it allocates a 256-element array of objects (one kilobyte on JVM with compressed pointers) for each internal tree node, and the upper limit for the number of these nodes is

N =  P (P − 1)  + 1
2

where P is the pattern length. The actual values are a bit lower, but not by much: the value for our 106-byte pattern is 5353, while the limit is 5365.

This can become a problem for longer patterns, so it is attractive to reduce this number, even at the cost of some speed degradation. It may, however, even help improve the speed for very long patterns due to better caching.

Let’s implement tree nodes as different classes depending on their children number. There is a big variety of options we can take: use vectors with linear search, vectors with binary search, small hash tables and so on. For now, let’s implement a very simple solution where we define three classes of nodes:

We can build these nodes directly, but it felt easier to build the old-style nodes (just using maps instead of arrays) first and then convert them into proper nodes. This is the matter of taste; practically applied solutions can do it differently.

So we rename our previous class to BuildNode:

private static final class BuildNode
{
    boolean boundary = false;
    ByteMap<BuildNode> map = null;
    
    BuildNode add (byte b)
    {
        if (map == null) {
            map = new ByteMap<> ();
        }
        BuildNode n = map.get (b);
        if (n == null) {
            n = new BuildNode ();
            map.put (b, n);
        }
        return n;
    }
    
    Node convert ()
    {
        if (map == null) {
            return terminal;
        }
        if (map.size () == 1) {
            ByteMap.Entry<BuildNode> e = map.iterator ().next ();
            return new SingleNode (boundary, e.b, e.n.convert ());
        }
        ArrayNode n = new ArrayNode (boundary);
        for (ByteMap.Entry<BuildNode> e : map) {
            n.put (e.b, e.n.convert ());
        }
        return n;
    }
}

Here ByteMap is a home-made map from byte to BuildNode (in repository). A normal Java map could be used here; I just don’t like excessive boxing.

The new Node class and its subclasses look like this:

static final EmptyNode terminal = new EmptyNode ();

private static abstract class Node
{
    public final boolean boundary;
    
    Node (boolean boundary) {
        this.boundary = boundary;
    }
    
    abstract Node get (byte b);

    boolean terminal () { return false;}
}

private static class EmptyNode extends Node
{
    EmptyNode () {super (true);}
    @Override Node get (byte b)   {return null;}
    @Override boolean terminal () {return true;}
}

private static final class SingleNode extends Node
{
    private final int b;
    private final Node node;
    
    SingleNode (boolean boundary, int b, Node node)
    {
        super (boundary);
        this.b = b;
        this.node = node;
    }
    
    @Override
    Node get (byte b)
    {
        return b == this.b ? node : null;
    }
}

private static final class ArrayNode extends Node
{
    private final Node [] nodes = new Node [256];
    
    ArrayNode (boolean boundary)
    {
        super (boundary);
    }

    void put (byte b, Node node)
    {
        nodes [b & 0xFF] = node;
    }
    
    @Override
    Node get (byte b)
    {
        return nodes [b & 0xFF];
    }
}

There is no information inside a terminal node. That’s why we can allocate exactly one of those and re-use it (previously we couldn’t do it, because a terminal node could become internal as we built the tree).

The search procedure changes, too:

for (int pos = fromIndex; pos < text.length - len + 1;) {
    Node node = root;
    int shift = len;
    int i;

    for (i = len - 1; ; --i) {
        node = node.get (text [pos + i]);
        if (node == null) {
            break;
        }
        if (node.boundary) {
            shift = i;
            if (node.terminal ()) {
                break;
            }
        }
    }
    if (shift == 0) return pos;
    pos += shift;
}
return -1;

Obviously, such an amount of virtual calls should affect performance, and it did:

Matcher 4816326496106
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
Suffix 1.89 1.33 0.74 0.41 0.27 0.22 0.11
SmallSuffix 2.09 1.86 1.10 0.61 0.35 0.25 0.10

Results are still better than for LastByteMatcher for long patterns, but the boundary value is now higher (96 rather than 32).

The number of allocated arrays improved dramatically: the pattern of length 106 produced 4335 instances of SingleNode and 49 of ArrayNode. Of these, 33 had two children, 7 had three and 4 had four. It is unclear if these numbers justify special treatment for nodes with very small branch factors.

Possible other options

We know that for patterns of length 106 the average iteration count is 2.7. We could keep the first three levels of our tree as ArrayNodes, and implement the rest using some space-saving technique. We could even hard-code this in our search routine, avoiding virtual calls at these levels. This is very speculative, as it might not work with data of other nature. And, in any case, it won’t run faster than the SuffixMatcher. After all, the purpose of this exercise is to make a matcher close to SuffixMatcher in performance, but less memory-hungry. However, there is one option that has a potential to be faster as well.

Chain matcher

Repository reference: ChainSuffixMatcher.java.

If we dump a typical search tree now, we’ll see that there are lots of nodes of type SingleNode, whose children are again SingleNodes, and so on. These SingleNodes form a chain, and the search engine traverses this chain one by one. How about modelling this chain explicitly as a special type of node? This will require change in the interface of Node: it must match itself against a byte sequence instead of a single byte:

abstract int match (byte [] text, int pos, int i, int shift);

Each node matches several incoming bytes to a portion of the tree and then calls the matcher of the subsequent node. The method returns obtained shift. This way iteration is replaced by tail recursion, so the main loop body becomes very small:

for (int pos = fromIndex; pos < text.length - len + 1;) {
    int shift = root.match (text, pos, len, len);
    if (shift == 0) return pos;
    pos += shift;
}
return -1;

We’ll follow the same approach as for SmallSuffixMatcher: build a generic tree first, converting it later to the specialised tree. Here are the classes of that tree:

private static abstract class Node
{
    public final boolean boundary;
    
    Node (boolean boundary) {
        this.boundary = boundary;
    }
    abstract int match (byte [] text, int pos, int i, int shift);
}

private static class EmptyNode extends Node
{
    EmptyNode () {super (true);}

    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        return i;
    }
}

private static final class SingleNode extends Node
{
    private final byte b;
    private final Node node;
    
    SingleNode (boolean boundary, byte b, Node node)
    {
        super (boundary);
        this.b = b;
        this.node = node == terminal ? null : node;
    }

    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        if (boundary) shift = i;
        -- i;
        if (text [pos + i] != b) return shift;
        if (node == null) return i;
        return node.match (text, pos, i, shift);
    }
}

private static final class ArrayNode extends Node
{
    private final Node [] nodes = new Node [256];
    
    ArrayNode (boolean boundary)
    {
        super (boundary);
    }

    void put (byte b, Node node)
    {
        nodes [b & 0xFF] = node;
    }
    
    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        if (boundary) shift = i;
        -- i;
        Node n = nodes [text [pos + i] & 0xFF];
        if (n == null) return shift;
        return n.match (text, pos, i, shift);
    }
}

private static final class ChainNode extends Node
{
    private final byte [] chain;
    private final Node node;
    
    ChainNode (boolean boundary, byte [] chain, Node node)
    {
        super (boundary);
        this.chain = chain;
        this.node = node;
    }

    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        if (boundary) shift = i;
        i -= chain.length;
        for (int j = 0; j < chain.length; j++) {
            if (text [pos + i + j] != chain [j]) return shift;
        }
        if (node == null) return i;
        return shift = node.match (text, pos, i, shift);
    }
}

Most single nodes were converted to chains: our 106-byte pattern now produces 49 arrays, 9 single nodes and 107 chains of average length 49.5.

Unfortunately, the times didn’t improve:

Matcher 4816326496106
Suffix 1.89 1.33 0.74 0.41 0.27 0.22 0.11
SmallSuffix 2.09 1.86 1.10 0.61 0.35 0.25 0.10
ChainSuffix 2.50 1.77 1.11 0.61 0.35 0.25 0.15

There is still some value in this solution: it allocates fewer memory. Here is the amount of memory, in kilobytes, allocated for several types of matchers:

Matcher 4816326496106
LastByte 1 1 1 1 1 1 1
MultiByte 4 8 17 33 67 100 110
LastByteSuffix 1 1 1 1 1 1 1
NextByteSuffix 4 8 17 33 67 100 111
Suffix 8 30 117 5002020 4662 5698
SmallSuffix 2 3 10 27 77 149 179
ChainSuffix 2 2 8 18 37 54 62

Light chain suffix matcher

Repository reference: LightChainSuffixMatcher.java.

There are multiple ways the ChainMatcher can be improved. We tried two, which are not presented here for the reason that they didn’t improve speed. They are, however, published in the repository. In one, ChainSuffixMatcher2, tree nodes were specialised by presence of the child (node being null in SingleNode or ChainNode). In another one, ChainSuffixMatcher3, they were also specialised by the boundary flag. This complicated the code but made no performance difference.

This doesn’t mean there are no ways to improve speed. And there are definitely ways to improve memory footprint. The table above shows very good memory use for ChainSuffix, but only compared to other methods. If the pattern length grows, so does the memory use, eventually making this method inapplicable. We want to move this point as far right as possible on the pattern length scale.

One way to save memory looks next to trivial: all our chains are in fact substrings of our pattern, so they can be represented as offsets and lengths, with no need to copy them into specially allocated arrays. This is a second-degree improvement compared to what we’ve just done (remember, previously each byte of a chain was represented by a node (a Java object), and before that - as an array of 256 elements. Still, getting rid of this array looks attractive. All that’s necessary is to collect the positions of the original nodes in the pattern, and, obviously, provide the ChainNodes with the reference of the pattern.

I’ll skip the modifications to the tree building procedure (see in the repository); here is the new ChainNode:

private static final class ChainNode extends Node
{
    private final byte [] pattern;
    private final int start;
    private final int len;
    private final Node node;
    
    ChainNode (byte [] pattern, int position, int len,
               boolean boundary, Node node)
    {
        super (boundary);
        this.pattern = pattern;
        this.start = position - len;
        this.len = len;
        this.node = node;
    }
    
    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        if (boundary) shift = i;
        i -= len;
        for (int j = 0; j < len; j++) {
            if (text [pos + i + j] != pattern [start + j]) {
                return shift;
            }
        }
        if (node == null) {
            return i;
        }
        return shift = node.match (text, pos, i, shift);
    }
}

Here are the times:

Matcher 4816326496106
ChainSuffix 2.50 1.77 1.11 0.61 0.35 0.25 0.15
LightChainSuffix 2.51 1.79 1.13 0.62 0.35 0.25 0.14

And the memory use:

Matcher 4816326496106
ChainSuffix 2 2 8 18 37 54 62
LightChainSuffix 2 2 8 17 34 48 56

The memory use isn’t much lower than before, but we’ll look at this again when considering longer patterns.

There are other ways to reduce memory, e.g. to re-use some subtrees, essentially converting the tree into a DAG. This, however, falls outside of the scope of this article, which already has grown too long.

Let’s however, do the last effort.

Lightest chain suffix matcher

Repository reference: LightestChainSuffixMatcher.java.

The LightChainSuffixMatcher is good in speed and in memory, but lacks a good procedure for its instantiation. The way the tree is created is obviously non-optimal. The final tree is well optimised for space, but the initial tree is not, and, as we’ll see later, it may require quite an amount of memory. It is also not very quick to build. Even though our primary interest is in application of matchers and not in their instantiation, we still prefer our solution to be practically usable for wide range of inputs.

Let’s save some time and space by creating the tree in one step. Another change is that we’ll also insert full strings into the tree rather than bytes one by one.

Here is the base class for a tree node:

private static abstract class Node
{
    boolean boundary;
            
    abstract Node add (byte [] pattern, int start, int len);
    abstract int match (byte [] text, int pos, int i, int shift);
}

A new method add will insert a given substring of pattern into this node, creating, if necessary, more nodes and, possibly, replacing this node with a new one.

We’ll sacrifice a SingleNode (it will be replaced with a ChainNode of length 1; it can be re-introduced, but the code will become longer and less clear with uncertain benefits). There will be three subclasses of Node:

The boundary bit indicates that the path in the tree from the root to (but not including) this node represents a valid prefix of the pattern. The EmptyNode always has this bit on. This bit always applies to the whole node; if a prefix of the pattern ends in the middle of a chain, this chain is split in two.

Here are our nodes:

private static class EmptyNode extends Node
{
    EmptyNode () {boundary = true;}
    
    @Override
    Node add (byte [] pattern, int start, int len)
    {
        Node n = new ChainNode (pattern, start, len, null);
        n.boundary = true;
        return n;
    }

    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        return i;
    }
}

private static final class ArrayNode extends Node
{
    private final Node [] nodes = new Node [256];
    
    @Override
    Node add (byte [] pattern, int start, int len)
    {
        byte b = pattern [start +-- len];
        Node n = nodes [b & 0xFF];
        if (n != null) {
            if (len == 0) {
                n.boundary = true;
            } else {
                nodes [b & 0xFF] = n.add (pattern, start, len);
            }
        } else {
            nodes [b & 0x0FF] = len == 0 ? terminal :
                                new ChainNode (pattern, start, len, null);
        }
        return this;
    }
    
    void put (byte b, Node node)
    {
        if (node == null) node = terminal;
        nodes [b & 0xFF] = node;
    }

    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        if (boundary) shift = i;
        -- i;
        Node n = nodes [text [pos + i] & 0xFF];
        if (n == null) return shift;
        return n.match (text, pos, i, shift);
    }
}

private static final class ChainNode extends Node
{
    private final byte [] pattern;
    private int start;
    private int len;
    private Node node;
    
    ChainNode (byte [] pattern, int start, int len, Node node)
    {
        this.pattern = pattern;
        this.start = start;
        this.len = len;
        this.node = node;
    }

    @Override
    Node add (byte [] pattern, int start, int len)
    {
        Node result = this;
        int end = start + len - 1;
        int this_end = this.start + this.len - 1;
        
        int min_len = Math.min (len,  this.len);
        int i;
        for (i = 0; i < min_len; i++) {
            if (pattern [end - i] != pattern [this_end - i]) {
                break;
            }
        }
        if (i == this.len) {
            if (this.len == len) {
                if (node != null) {
                    node.boundary = true;
                }
            } else {
                if (node != null) {
                    node = node.add (pattern, start, len - i);
                } else {
                    node = new ChainNode (pattern, start, len - i, null);
                    node.boundary = true;
                }
            }
        } else if (i == len) {
            node = new ChainNode (pattern, this.start, this.len - len, node);
            node.boundary = true;
            this.start = this.start + this.len - i;
            this.len = i;
        } else {
            byte chain_byte = pattern [this_end - i];
            ArrayNode array = new ArrayNode ();
            if (i == 0) {
                array.boundary = boundary;
                boundary = false;
                -- this.len;
                array.put (chain_byte, this.len == 0 ? node : this);
                result = array;
            } else if (i == this.len - 1) {
                ++ this.start;
                -- this.len;
                array.put (chain_byte, node);
                node = array;
            } else {
                Node new_chain = new ChainNode (pattern, this.start,
                                                this.len-i-1, node);
                this.start = this.start + this.len - i;
                this.len = i;
                array.put (chain_byte, new_chain);
                node = array;
            }
            array.add (pattern, start, len - i);
        }
        return result;
    }
    
    @Override
    int match (byte [] text, int pos, int i, int shift)
    {
        if (boundary) shift = i;
        i -= len;
        for (int j = 0; j < len; j++) {
            if (text [pos + i + j] != pattern [start + j])
                return shift;
        }
        if (node == null) return i;
        return shift = node.match (text, pos, i, shift);
    }
}

The ChainNode.add looks complicated but actually is not. It must cater for four major cases:

The tree-building procedure now looks very neat:

private Node build_root ()
{
    Node root = terminal;
    for (int prefix_len = pattern.length; prefix_len >= 1; prefix_len--) {
        root = root.add (pattern, 0,  prefix_len);
    }
    return root;
}

The matching procedure looks nice, too:

public int indexOf (byte[] text, int fromIndex)
{
    int len = pattern.length;
    for (int pos = fromIndex; pos < text.length - len + 1;) {
        int shift = root.match (text, pos, len, len);
        if (shift == 0) return pos;
        pos += shift;
    }
    return -1;
}

The times are very close to those of the LightChainMatcher – removal of the SingleNode hasn’t affected them:

Matcher 4816326496106
LightChainSuffix 2.51 1.79 1.13 0.62 0.35 0.25 0.14
LightestChainSuffix 2.43 1.71 1.11 0.61 0.35 0.25 0.14

This was the last of the algorithms we planned to look at today. Now it’s time to look at other aspects of string search.

String matcher

The primary reason why we went on this journey was absence of indexOf method for byte arrays, which was needed for my real-world problem. However, there are such methods for Java Strings, and we can compare them with our solutions converted to work with strings.

We’ll convert some of our solutions (too much work to do it for everything; besides, it feels unnecessary), namely Simple, FirstBytes, LastByte and Suffix. To simplify things, we’ll ignore Unicode and limit our character set to 256 characters.

Besides mentioned matchers, we’ll add three string-specific ones:

The code is here, and the main class is here.

Here are the times:

Matcher 4816326496106
Standard 1.43 1.37 1.41 1.41 1.43 1.35 0.72
Simple 2.99 3.42 3.45 3.45 3.49 3.38 2.58
Indexof 1.53 1.51 1.54 1.53 1.56 1.47 0.80
FirstBytes 1.16 2.14 2.16 2.16 2.18 2.10 1.56
Regex 2.25 1.26 0.79 0.56 0.41 0.35 0.31
LastByte 2.03 0.90 0.58 0.43 0.33 0.28 0.23
Suffix 1.93 1.32 0.73 0.42 0.27 0.22 0.13

The first place in each column is marked green, the second is yellow and the third is red.

The observations:

The Regex pattern matcher

Repository reference: RegexByteMatcher.java.

The pattern matching engine of Java is designed to handle much more complex cases than ours: it works with potentially very sophisticated regular expressions. Taking this into account, even though we managed to overtake it, this engine performed very well in our test. How was this achieved?

The pattern search routine, starting at java.util.regex.Matcher.find (), quickly ends up at the java.util.regex.Matcher.BnM.match () (BnM standing for Boyer-Moore search algorithm, 1977):

boolean match(Matcher matcher, int i, CharSequence seq) {
    int[] src = buffer;
    int patternLength = src.length;
    int last = matcher.to - patternLength;

NEXT:
     while (i <= last) {
         for (int j = patternLength - 1; j >= 0; j--) {
             int ch = seq.charAt(i+j);
             if (ch != src[j]) {
                 i += Math.max(j + 1 - lastOcc[ch&0x7F], optoSft[j]);
                 continue NEXT;
             }
         }
         .....
     }
     .....
}

There is more code there, related to matching more complex patterns than just a simple string, which we can ignore for now. This is, however, the main execution path in our case. Here are some immediate observations on this code:

We didn’t use this exact algorithm anywhere above; the closest to it was the NextByteSuffixMatcher. This is what the Regex code does:

1) compare the current portion to the pattern right to left and identify the good suffix (we remind that this is the longest suffix that matches, and we know the most common case is the suffix of length zero);

2) if this suffix spans the entire pattern, we have a match;

2) otherwise, look up the shift for this suffix in a prepared table, just like we did;

3) then look at the bad character (the first character that didn’t match), and get a shift for this character. The greater of the two shifts is used.

The last step differs from our implementation: we had a separate shift table for each position; they use just one table, calculated for the last position (exactly the same as our table for LastByteMatcher), adjusting the shift by the bad character’s position. The shift obtained this way, while safe, must be, generally, worse than the shift we got. Let’s see by how much:

Statistic 4816326496106
Avg shift, LastByteSuffix 3.5 6.0 9.6 13.9 19.2 22.2 24.8
Avg shift, Regex 3.5 6.1 10.0 14.8 20.9 24.3 27.2
Avg shift, NextByteSuffix 3.5 6.1 10.0 14.8 20.9 24.3 27.2

Here comes a surprise: the average shift counts for the Regex are exactly the same as for NextByte. Is it simply a rounding problem, the results being similar but not equal? We can print the raw data used to calculate averages. They are exactly the same. For instance, for the pattern length of 16 both algorithms report the shift sum of 1514837185 for 151637812 operations.

This can’t be a coincidence, and, indeed, it’s not. The shift generated by multiple tables (NextByte) can differ from the shift from a single table (Regex) only when the latter is negative, that is, if the bad character occurs in the good suffix.

Anything that can only be found to the left of the bad character, gets the same shifts from both algorithms:

 beetles oer his base into the sea and there ass
                doubt thou the s

Here the bad character is o, which gets the shift of one from the NextByteMatcher array, and the shift of 7 found in Regex table, adjusted by its position 6, also produces 1.

When the Regex shift of the bad character is negative, the final shift is determined by the good suffix shift. It is easy to see that it is never smaller than the shift of NextByteMatcher for the bad character.

Let the pattern length be l and the good suffix length be n. We have two cases:

Here is the example demonstrating this case, where m = 3:

which his melancholy sits on brood and i do doubt the hatch and
               doubt that the sun doth move dou

The bad character is o, and its shift is −3, so the shift of 29 is produced by the good suffix " dou", of which the suffix dou if a prefix of the pattern (l = 32, n = 4, m = 3).

Here is the example demonstrating this case:

for every thing is seald and done
               sun doth move do

Here the suffix is " do" of length 3, and its shift k = 10. The bad character is d (occurs in this suffix), and it is found 8 positions left from itself. The NextByte implementation will give 8, the Regex will produce −2, and the total shift will be determined by the suffix shift (10).

So it is proven that the algorithm with one table produces exactly the same results as the algorithm with a table at every character position. This is remarkable, and it wasn’t obvious immediately. This shows that there is a reason algorithms get inventors’ names.

It’s worth noticing, however, that the Regex algorithm, even being very smart, is still a variation of our LastByteMatcher. In 90% cases (when the last character does not match) it produces the same shift. The reason it performs so well as a string matcher (although it lost to LastByte, and lost big time to Suffix on long patterns) is its simplicity.

The RegexByteMatcher

Repository reference: RegexByteMatcher.java.

A detour into string matchers was quite productive: we’ve learned another good algorithm, which is close, but not identical to those considered before. Let’s return to the byte array matching territory and try implementing this algorithm. The new matcher will get its name based on its origin (RegexByteMatcher):

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int len = pattern.length;
    int [] shifts = this.shifts;
    int [] suffix_shifts = this.suffix_shifts;
    
    int pos = fromIndex;
NEXT:
    while (pos <= text.length - len) {
        for (int i = len-1; i >= 0; i--) {
            byte b = text [pos + i];
            if (b != pattern [i]) {
                pos += Math.max (shifts [b & 0xFF] - (len - 1 - i),
                                 suffix_shifts [i]);
                continue NEXT;
            }
        }
        return pos;
    }
    return -1;
}

The shift statistics is exactly the same as expected, which means we’ve implemented it right. The times, however, are not so good:

Matcher 4816326496106
NextByteSuffix 1.70 0.98 0.60 0.42 0.31 0.26 0.21
Regex 2.16 1.24 0.76 0.52 0.38 0.32 0.27

What could have caused such a difference? The NextByteSuffix matcher, apart from using multiple tables, also employs a special treatment of the last byte. This is a classic case of a partial loop unroll: if there are big chances that the loop is executed exactly once (which is our case, the probability is 90%), and the loop body in this case is simpler than the general one, move it out of the loop and adjust the iteration count. Sometimes, a compiler can perform this transformation automatically, but often it does not have enough information to do so. Let’s help it. We saw that previously it helped in FirstByteMatcher; let’s try again, we’ll call it RegexMatcher2 (repository reference: RegexByteMatcher2.java).

public int indexOf (byte[] text, int fromIndex)
{
    byte [] pattern = this.pattern;
    int len = pattern.length;
    int [] shifts = this.shifts;
    int [] suffix_shifts = this.suffix_shifts;
    byte last = pattern [len-1];
    
    int pos = fromIndex;
NEXT:
    while (pos <= text.length - len) {
        byte b = text [pos + len-1];
        if (b != last) {
            pos += shifts [b & 0xFF];
            continue NEXT;
        }
        for (int i = len-2; i >= 0; i--) {
            b = text [pos + i];
            if (b != pattern [i]) {
                pos += Math.max (shifts [b & 0xFF] - (len - 1 - i),
                                 suffix_shifts [i]);
                continue NEXT;
            }
        }
        return pos;
    }
    return -1;
}

Two factors may contribute to performance of this code: the last variable loaded once (hopefully, stored in a register), and the suffix_shifts array not being accessed when the suffix length is zero (the shift array contains zero in this case, but it would be too much to expect the compiler to make use of that). So, this is the speed:

Matcher 4816326496106
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
NextByteSuffix 1.70 0.98 0.60 0.42 0.31 0.26 0.21
Regex 2.16 1.24 0.76 0.52 0.38 0.32 0.27
Regex2 1.74 1.00 0.62 0.43 0.31 0.26 0.21

Performance has improved quite a bit and is now pretty much the same as that of NextByteSuffixMatcher, but the code is simpler. It is, however, still more complex than the simple LastByteMatcher, and is not faster than that.

Results summary

We tried many options:

Here are all the results:

Matcher 4816326496106
Simple 2.90 2.89 3.30 3.28 3.29 3.21 2.68
FirstByte 1.47 1.51 1.52 1.52 1.49 1.46 0.92
FirstBytes 1.18 1.35 1.36 1.37 1.38 1.32 0.86
FirstBytes2 1.21 1.18 1.19 1.19 1.20 1.16 0.83
Compile 1.18 1.38 1.38 1.39 1.34 1.27 0.80
Hash 2.59 2.47 2.43 2.42 2.40 2.39 2.40
LastByte 1.54 0.90 0.56 0.40 0.30 0.26 0.22
MultiByte 2.59 1.90 1.32 0.95 0.69 0.54 0.45
MultiByte2 2.39 1.73 1.22 0.88 0.63 0.48 0.41
LMultiByte (2) 2.63 1.90 1.20 0.71 0.45 0.37 0.22
LMultiByte (3) 2.66 1.96 1.29 0.80 0.51 0.41 0.28
LMultiByte (4) 2.66 1.98 1.35 0.86 0.55 0.43 0.30
LMultiByte2 2.06 1.32 0.77 0.48 0.34 0.28 0.18
UnrolledLMultiByte (2) 2.14 1.38 0.79 0.48 0.33 0.27 0.16
UnrolledLMultiByte (3) 2.19 1.49 0.90 0.54 0.36 0.29 0.15
UnrolledLMultiByte (4) 2.31 1.57 0.97 0.59 0.38 0.30 0.11
NextByteSuffix 1.70 0.98 0.60 0.42 0.31 0.26 0.21
Regex 2.16 1.24 0.76 0.52 0.38 0.32 0.27
Regex2 1.74 1.00 0.62 0.43 0.31 0.26 0.21
LastByteSuffix 1.56 0.91 0.57 0.40 0.30 0.25 0.20
Suffix 1.89 1.33 0.74 0.41 0.27 0.22 0.11
CompileSuffix 1.93 1.43 0.91 0.52 0.32 0.24 0.16
SmallSuffix 2.09 1.86 1.10 0.61 0.35 0.25 0.10
ChainSuffix 2.50 1.77 1.11 0.61 0.35 0.25 0.15
LightChainSuffix 2.51 1.79 1.13 0.62 0.35 0.25 0.14
LightestChainSuffix 2.43 1.71 1.11 0.61 0.35 0.25 0.14

The colour scheme is the same as before (green – yellow – red for the first, second and third place). The compiled versions are excluded from the competition as very exotic. LastByteMatcher and its variants are clearly winning on shorter patterns, while the family of the SuffixMatcher is taking over on the longer ones.

It is remarkable how the biggest improvements were obtained by algorithmic changes rather than low-level optimisations. It is also worth noticing that the generic Suffix matcher was both neater and faster than the specialised LastByteSuffix matcher.

Note: Since publishing of this article, a new, better matcher has been suggested, see Update 2.

Random data

Until now we’ve been working with natural texts with significant simplifications (lowercase, no punctuation, no line breaks, the total alphabet size of 27). How will the results change if our text and our pattern becomes totally random sequences of bytes?

We’ll generate a text a bit bigger than before (4M) and use some position inside as a source for our patterns. Since the data are random, we don’t need to use very many patterns to get an average value: we’ll just use four. We won’t run all our algorithms now, just the most significant ones:

Matcher 48163264128256
Simple 1.92 1.92 1.92 1.92 2.55 2.55 2.54
FirstBytes 0.46 1.01 1.01 1.01 1.01 1.01 1.01
Hash 2.38 2.38 2.36 2.37 2.36 2.36 2.36
LastByte 1.26 0.64 0.31 0.16 0.10 0.07 0.07
NextByteSuffix 1.20 0.61 0.31 0.16 0.10 0.06 0.06
Regex2 1.20 0.60 0.30 0.16 0.10 0.06 0.06
Suffix 0.52 0.28 0.16 0.10 0.08 0.06 0.04
SmallSuffix 0.63 0.35 0.20 0.13 0.11 0.08 0.06
ChainSuffix 0.72 0.40 0.23 0.15 0.11 0.09 0.07
LightChainSuffix 0.73 0.40 0.23 0.15 0.11 0.09 0.07
LightestChainSuffix2 0.64 0.36 0.23 0.15 0.11 0.09 0.07

Surprisingly, the top performing solutions haven’t changed. Suffix is the best, although its advantage over LastByte is small.

All simple solutions fall way behind, so better algorithms are beneficial for random data as well.

Let’s try longer patterns:

Matcher 5121024204840968192
LastByte 0.057 0.051 0.054 0.054 0.051
Suffix 0.025 0.023 0.030 fail fail
SmallSuffix 0.025 0.013 0.010 0.012 0.018
ChainSuffix 0.025 0.012 0.008 0.007 0.010
LightChainSuffix 0.024 0.010 0.006 0.005 0.004
LightestChainSuffix 0.033 0.020 0.014 0.009 0.006

The SuffixMatcher failed on long patterns due to shortage of RAM, so it was good that we bothered to implement more compact SmallSuffix and ChainSuffix. They started slowing down due to shortage of L3 cache, and will eventually also fail at even longer patterns. However, on current patterns they perform quite well. Here are the amounts of allocated memory, in megabytes:

Matcher 5121024204840968192
LastByte 0 0 0 0 0
Suffix 139 556 2228 fail fail
SmallSuffix 3.3 12.8 51 202 806
ChainSuffix 0.3 0.8 2.5 9.0 35
LightChainSuffix 0.2 0.3 0.4 0.5 1.1
LightestChainSuffix 0.2 0.3 0.4 0.5 1.1

Big random patterns

How about even longer patterns? The SmallSuffix is already about to fail. The chain suffix matchers look much better, but two of them suffer from the same problem. Both ChainBuffer and LightChainBuffer first build a classic search tree, which takes a lot of space. Here is the amount of memory this tree takes, in megabytes:

Matcher 5121024204840968192
ChainSuffix 14 54 219 872 3489

This is where our effort to make LightestChainSuffixMatcher pays out: of the tree-based matchers, it is the only one that can survive doubling of the pattern size. Obviously, the less efficient but less memory-hungry ones also can. Let’s try to grow our pattern to one megabyte. First of all, let’s check if use of LightestChain is feasible at all. We know it is building a big tree: how long does it take?

Value 16K32K64K128K256K512K1M
Time, ms 2 8 10 25 60 100 310

The time of 310 ms isn’t insignificant, but still appropriate for practical use. Now, let’s check the matching times:

Matcher 16K32K64K128K256K512K1M
LastByte 0.053 0.055 0.063 0.073 0.095 0.140 0.227
Regex2 0.052 0.052 0.060 0.067 0.082 0.116 0.178
LightestChain 0.006 0.010 0.023 0.046 0.090 0.177 0.353

The results are somewhat disappointing: starting impressively at nearly 10 times the speed of Regex, the LightestChain matcher eventually (at 256K bytes-long patterns) loses its advantage, and it’s easy to see why. Let’s look at memory consumption of the matchers:

Matcher 16K32K64K128K256K512K1M
Regex2 0.067 0.132 0.263 0.525 1.050 2.098 4.195
LightestChain 2.7 7.7 21 46 74 95 137

(all LastByte matchers use exactly 1064 bytes).

We ran out of L3 cache, which on our machine is only 15 MB per chip. Our text (4 MB) fits in cache, even together with the data for simple matchers. That’s why, even with higher number of iterations they outperform the chain matcher.

Big random patterns, huge data

Finally, let’s run the ultimate test: we’ll search for long patterns (16K to 1M) in a sixteen gigabyte text. Technically this text is arranged as 16 arrays of one gigabyte each, one of which contains the pattern. The search happens exactly once, to prevent caching.

Matcher 16K32K64K128K256K512K1M
LastByte 0.37  0.34  0.35  0.34   0.35   0.35   0.35  
Regex2 0.37  0.35  0.35  0.34   0.34   0.34   0.34  
Lightest 0.009 0.005 0.003 0.0018 0.0009 0.0005 0.0003

It looks strange at first that the lightest chain suffix matcher is so much faster now than when the text was cached. Most probably, it is due to much less success rate (previously the pattern was found once in every four megabytes of scanned text), and it is a match that is slow.

The performance advantage of this matcher on the uncached data is impressive. When the pattern does not match the current portion, we shift the pattern by the value that is very close to the length of the pattern, avoiding a lot of uncached reads. Instrumenting the matcher code with a counter shows that matching each gigabyte of the text involves reading about 3K bytes there. The total number of bytes read from all 16 gigabytes is 2150497, of which 2100280 were read when processing the part that contained a match.

This makes it a good solution to find in files, too. So if one has a multi-terabyte file and needs to find a one-megabyte fragment inside, this is the way to go. This, however, doesn’t seem like a realistic practical problem.

Obviously, the LightestChainSuffixMatcher has weak points, too. The tree for a one-megabyte pattern occupies 137 Mbytes, so searching for much longer patterns requires radical space reduction. The time to build a tree is small, but not completely insignificant: for the same tree it is 310 ms. Some improvement may be needed there, too.

Update: UTF-8

On reddit, user pgris asked how these algorithms would perform with UTF-8 data, e.g. long Chinese text. This is very valid question. First of all, will the algorithms work at all? Characters in UTF-8 take more than one byte and there is a risk of misaliginng the pattern (trying to match beginning of one character to the middle of another one). Fortunately, the design of UTF-8 makes this situation impossible. The first byte of each character has high bits 00, 01 or 11, while subsequent bytes start with 10. This makes it possible to search for UTF-8 strings just like ordinary byte blobs.

Performance, however, is a different issue. Multi-byte characters may affect some of the algorithms negatively. For instance, if Chinese characters use small set of first bytes, it will slow down the first byte matcher. If some last bytes are very frequent, the last byte matcher will suffer, etc. Let’s try it.

On the Loyal Books site I found a long enough (1.8M) Chinese text: “The Three Kingdoms Romance” by Guanzhong Luo. The text contains Chinese characters, some punctuation and line breaks. It is in the repository under the name Book-23950.txt. We’ll use some arbitrary chosen line as our pattern. Its length is 114 bytes (38 characters). When searching for substrings, their lengths must be adjusted to search for full characters.

Here are the times:

Matcher 6918336696114
Simple 2.38 2.36 2.36 2.36 2.36 3.14 3.14
FirstByte 1.04 1.03 1.03 1.03 1.03 1.04 1.04
FirstBytes 0.97 0.96 0.96 0.96 0.96 0.96 0.96
LastByte 0.84 0.56 0.32 0.20 0.16 0.14 0.12
MultiByte 1.24 0.93 0.64 0.52 0.40 0.32 0.29
LastByteSuffix 0.83 0.54 0.30 0.20 0.15 0.13 0.12
RegexByte2 0.84 0.55 0.31 0.20 0.15 0.13 0.12
Suffix 0.63 0.49 0.37 0.30 0.20 0.15 0.14
SmallSuffix 0.76 0.58 0.45 0.43 0.33 0.20 0.17
ChainSuffix 0.85 0.67 0.54 0.50 0.30 0.20 0.17
LightChainSuffix 0.86 0.68 0.54 0.47 0.30 0.20 0.17
LightestChainSuffix 0.53 0.44 0.54 0.49 0.31 0.21 0.18

The LastByte and its family (such as Regex) performed very well, while the Suffix matcher with friends fell a bit behind (although, not looking too bad, either). Perhaps, they need longer patterns to excel in the UTF-8 case.

Another important feature of UTF-8 searching is Unicode normalisation, but that falls beyond the limits of our study. The same applies to case-insensitive search. After all, initially we only wanted to search for bytes…

Update 2: TwoByteHashShiftMatcher

During discussion on reddit, user john16384 suggested another solution, which outperforms those published above.

The idea is similar to LastByteMatcher, except instead of a single last byte we look at the suffix of length 2. The shifts for those suffixes are pre-calculated and stored in an array. Storing shifts for all possible suffixes can be expensive (we need an array of 64K elements, or 256K bytes, which is still smaller than many of our solutions), so we save space by using hash values. Of all possible hash functions, the simplest one is used:

    hash = (first_byte << shift_value) ^ second_byte;

Varying the shift, we can change the table size and the algorithm performance.

The matcher looks like this (full text in repository):

public int indexOf (byte[] text, int fromIdx) {
    int pattern_len = pattern.length;
    int text_len = text.length;
    int offset = pattern_len - 1;
    int i = fromIdx + offset;
    int maxLen = text_len - pattern_len + offset;

    while(i < maxLen) {
        int hash = ((text[i - 1] & 0xff) << p2) ^ (text[i] & 0xff);
        int skip = shifts[hash];

        if(skip == 0) {  // No skip, let's compare
            if(compare (text, i - offset, pattern, pattern_len)) {
                return i - offset;
            }
            i++;  // Compare failed, move ahead 1.
        }
        i += skip;  // Can be done always, if skip was zero it does nothing.
    }
    return -1;
}

Another difference from LastByteMatcher is that it doesn’t compare last bytes to those of the pattern; instead, it uses the shift value of zero to detect if they match. This reduces the shift achieved in this case (one instead of the shift to the second rightmost occurrence of these two bytes), but this happens very rarely.

This matcher works very fast:

Matcher 4816326496106
LastByte 1.540.90 0.56 0.40 0.30 0.26 0.22 
Suffix 1.891.33 0.74 0.41 0.27 0.22 0.11 
TwoByteHashShift(0) 1.93 0.87 0.45 0.27 0.20 0.17 0.155
TwoByteHashShift(1) 1.66 0.74 0.37 0.20 0.14 0.12 0.113
TwoByteHashShift(2) 1.66 0.74 0.37 0.20 0.13 0.11 0.107
TwoByteHashShift(3) 1.67 0.76 0.36 0.19 0.12 0.10 0.098
TwoByteHashShift(4) 1.83 0.81 0.39 0.21 0.13 0.10 0.093
TwoByteHashShift(5) 1.67 0.72 0.35 0.19 0.12 0.09 0.088

Results are improving with growth of the shift_value (called p2 in the program), and after value 1 are better than both LastByte and Suffix. The improvement rate slows down after two bits and stops after 5. To see why, let’s look at the average shift values:

Matcher 4816326496106
LastByte 3.5 6.0 9.613.919.222.224.8
TwoByteHashShift(0) 2.8 6.2 12.1 21.0 33.1 43.3 47.1
TwoByteHashShift(1) 2.9 6.5 13.2 24.6 43.0 56.1 60.2
TwoByteHashShift(2) 2.9 6.5 13.2 24.8 44.2 60.2 62.9
TwoByteHashShift(3) 2.9 6.6 13.6 25.9 47.3 65.5 70.7
TwoByteHashShift(4) 2.9 6.6 13.6 26.0 47.9 67.3 74.9
TwoByteHashShift(5) 2.9 6.7 13.7 26.4 48.9 69.1 75.3
TwoByteHashShift(8) 2.9 6.7 13.7 26.4 48.9 69.1 75.3
Suffix 3.9 7.9 15.931.963.995.8105.9

The shifts are worse that those of Suffix but better than those of LastByte. They gradually improve with shift_value but saturate at 5. With the 5 bit shift (meaning the total number of bits 13 and the array size 8192 elements) the solution works just as with full 8 bits – that’s probably due to our small, 27-character, alphabet.

The compare() rate varies between 0.9% for long patterns at 5 bits to 3% for short patterns and 0 bits, with a typical value of 1.5% in the middle.

Now let’s look at the binary data.

Matcher 48163264128256
LastByte 1.26 0.64 0.31 0.16 0.10  0.07  0.07 
Suffix 0.52 0.28 0.16 0.10 0.08  0.06  0.04 
TwoByteHashShift(0) 1.72 0.74 0.35 0.18 0.104 0.073 0.068
TwoByteHashShift(1) 1.71 0.73 0.35 0.16 0.094 0.060 0.053
TwoByteHashShift(2) 1.76 0.76 0.35 0.16 0.091 0.055 0.047
TwoByteHashShift(3) 1.76 0.75 0.35 0.16 0.091 0.056 0.046
TwoByteHashShift(4) 1.76 0.76 0.36 0.16 0.097 0.059 0.048

The results (except for very small patterns) are better than LastByte but slightly worse than Suffix. Still, it makes this matcher very attractive because of its simplicity. The shift factor of 2 is enough. Here are the shift factors for it, which are very good:

Matcher 48163264128256
TwoByteHashShift(2) 3.0 6.7 15.0 30.5 61.1 119.4 226.0

Now, the big patterns:

Matcher 16K32K64K128K256K512K1M
LastByte 0.053 0.055 0.063 0.073 0.095 0.140 0.227
LightestChain 0.006 0.010 0.023 0.046 0.090 0.177 0.353
TwoByteHash(0) 0.056 0.060 0.068 0.076 0.097 0.141 0.226
TwoByteHash(1) 0.039 0.042 0.047 0.057 0.081 0.123 0.216
TwoByteHash(2) 0.019 0.023 0.032 0.041 0.063 0.108 0.197
TwoByteHash(3) 0.009 0.012 0.020 0.032 0.058 0.106 0.203
TwoByteHash(4) 0.006 0.009 0.015 0.028 0.051 0.098 0.192

Again, it performs better than our previous solutions.

Finally, the huge test:

Matcher 16K32K64K128K256K512K1M
LastByte 0.37   0.34   0.35   0.34   0.35   0.35   0.35  
Lightest 0.0088 0.0054 0.0031 0.0018 0.0009 0.0005 0.0003
TBHS(4) 0.0293 0.0288 0.0319 0.0272 0.0308 0.0263 0.0295
TBHS(8) 0.0108 0.0061 0.0038 0.0030 0.0026 0.0026 0.0026

Only now the Lightest matcher managed to catch up with the Two-byte hashing one. However, while the Lightest is approaching its limit of applicability at the pattern length 1M, the Two-byte one does not have such a limit. Besides, there is a chance that for the huge array case the two-byte one can be upgraded to three-byte or four-byte, using some good enough hash function.

Conclusions

Things to do

I want to run similar tests in C++. It presents multiple challenges. On one hand, it also has a pattern-matching library. On another one, it allows easy access to the CPU instructions, which provide both fast byte search (SCAS) and wide comparisons (64 bit in normal mode and up to 512 in SSE/AVX case).

Comments are welcome below or at reddit.

comments powered by Disqus