Some time ago I was facing an interesting issue, which was filtering data based on a part of one of the attributes. It was a classic case of allowing users to get data by entering something in the search box. I figured out that there are more than one ways of completing the task, so I decided to research each of them a bit to have a better understanding of how good those options are.

The problem

Imagine that we have some entity described with a bunch of tags, that allow us to organize and view them, eg. in some API. Let's say that there are blog posts that can be described with some categories, and we want to be able to fetch all that are describing the same topic. Sounds like a pretty simple task, but when your dataset grows to a certain size, you must think about doing it as optimal as possible. For the sake of this example, we let our dataset to grow to huge proportions: we'll have 1 million posts and 50 tags, with each post being described with anything between one and ten tags. The table in its simplest form looks as follows:

> SELECT * FROM posts;
post_id | tags
---------------------------------------------------
post-1  | carly eda brittanie
post-2  | apolonia caitlyn brittanie
post-3  | caitlyn
post-4  | andree diana
post-5  | chante brittanie
...

Note: the list of tags was created using this generator.

The first and probably the simplest solution we can think of, given basic SQL knowledge, is looking for matching rows using SQL's LIKE keyword, but since this is a template search, we can assume that it might be really slow:

> SELECT * FROM posts WHERE tags LIKE '%carly%';
...
110224 rows in set (0.70 sec)

While this is not dramatic, I'm interested how does this result compare with other possible solutions.

Searching for words

One approach we might take is using a full-text search that is built into MySQL. It requires putting a FULLTEXT index on the column that consists the data we want to search through:

CREATE TABLE posts (
    post_id VARCHAR(10) NOT NULL PRIMARY KEY,
    tags VARCHAR(255) NOT NULL,
    FULLTEXT(tags)
);

Also, our way of making the search itself changes a bit, because we no longer are looking for records like our template, but those that MATCH AGAINST it. It also comes in two modes: NATURAL LANGUAGE (default) and BOOLEAN. The first one is case insensitive, and it automatically sorts the results in the order of how accurate the results are (how good they satisfy the template):

> SELECT * FROM posts WHERE MATCH (tags) AGAINST ('carly');
...
110224 rows in set (3.70 sec)

SELECT COUNT(*) FROM posts WHERE tags LIKE '%carly%';
...
1 row in set (1.12 sec)

Well, this is much worse than I thought... Maybe BOOLEAN mode helps, as it does not count the score, but only checks if the word(s) is present or not?

> SELECT * FROM posts WHERE MATCH (tags) AGAINST ('+carly' IN BOOLEAN MODE);
110224 rows in set (4.10 sec)

So disappointing... Does this look better for counting rows?

> SELECT COUNT(*) FROM posts WHERE MATCH (tags) AGAINST ('carly');
...
1 row in set (0.13 sec)

> SELECT COUNT(*) FROM posts WHERE MATCH (tags) AGAINST ('carly' IN BOOLEAN MODE);
...
1 row in set (0.14 sec)

It actually does! Maybe full-text searches are good only for counting rows?

Using an additional table

While searching for data using full-text search didn't give me the results I was hoping for, I decided to take another route and split the data into two tables. That would require having a JOIN between tables but would allow us to use more efficient indexes on each of the tables. This is also more realistic if the requirement for searching posts by tags was not defined earlier, and we need to add that to an existing system where posts and tags are stored separately. Also, if the tags are a part of a post data item, but adding an index while having the system running might take a long time and have some negative impact on the database, introducing another table might be a good solution:

> SELECT * FROM posts;
post_id | tags
---------------------------------------------------
post-1  | carly eda brittanie
post-2  | apolonia caitlyn brittanie
post-3  | caitlyn

> SELECT * FROM tags:
tag_id
-----------
apolonia
brittanie
caitlyn
carly
eda

> SELECT * FROM post_tag:
post_id | tag_id
---------------------------------------------------
post-1  | carly
post-1  | eda
post-1  | brittanie
post-2  | apolonia
post-2  | caitlyn
post-2  | brittanie
post-3  | caitlyn

You can already see, that the data will take much more space than in the previous approach, but fortunately, disk storage is rarely a problem in the modern software development - we are more concerned by the UX and the speed of our systems. So, how efficient is this approach?

> SELECT post_id FROM posts p JOIN post_tag pt USING (post_id) JOIN tags t USING (tag_id) WHERE t.tag_id = 'carly';
...
110224 rows in set (6.11 sec)

> SELECT COUNT(post_id) FROM posts p JOIN post_tag pt USING (post_id) JOIN tags t USING (tag_id) WHERE t.tag_id = 'carly';
...
1 row in set (4.75 sec)

Wow. Another approach and it's still worse than the original LIKE which we thought would be so bad. Turns out that the assumptions you have might not be so accurate, so it's always better to try them in action yourself.

Summary

As it turns out, the simplest approach is actually the best. We started by running queries using LIKE, then we tried our luck with FULLTEXT index which turned out to be more efficient only when counting the rows, not when returning the actual data. Finally, we switched our approach and used an additional table to create a classical many-to-many relationship, but the results here were by far the worst.

In one of the following posts, we will try to get to the bottom of each query to see if there is something we may improve to see better results.

Links