Litecene: Full-Text Search for Google BigQuery

I just released the first beta version of litecene, a Java library that implements a common boolean search syntax for full-text search, with its first transpiler for BigQuery.

Litecene makes Searching Text in BigQuery Easy!

Litecene Syntax

Litecene uses a simple, user-friendly syntax that is derived from Lucene query syntax that should be familiar to most users. It includes clauses for term with wildcard, phrase with proximity, AND, OR, NOT, and grouping.

As an example, this might be a good Litecene query to identify social media posts talking about common ways people user their smartphones:

(smartphone OR "smart phone" OR iphone OR "apple phone" OR android OR "google phone" OR "windows phone" OR "phone app"~8) AND (call* OR dial* OR app OR surf* OR brows* OR camera* OR pic* OR selfie)

The syntax is documented in more detail here.

Converting Litecene to BigQuery SQL

A developer could take a Litecene query like this:

iphone OR "i phone" OR android OR smartphone OR "smart phone" OR "google phone"~4 OR "apple phone"~4 OR "windows phone"~4

and quickly translate it into SQL to search a field t.text like this:

String predicate = new BigQuerySearchCompiler("t.text")
  .compile(BigQuerySearching.recommendedParseQuery(
      "iphone OR \"i phone\" OR android OR smartphone OR \"smart phone\" OR \"google phone\"~4 OR \"apple phone\"~4 OR \"windows phone\"~4"));

This produces the following SQL predicate, for the curious:

((REGEXP_CONTAINS(t.text, r"\b\Qiphone\E\b")) OR (REGEXP_CONTAINS(t.text, r"\b\Qi\E\b \b\Qphone\E\b")) OR (REGEXP_CONTAINS(t.text, r"\b\Qandroid\E\b")) OR (REGEXP_CONTAINS(t.text, r"\b\Qsmartphone\E\b")) OR (REGEXP_CONTAINS(t.text, r"\b\Qsmart\E\b \b\Qphone\E\b")) OR (EXISTS (SELECT 1 FROM (SELECT i, t FROM UNNEST(REGEXP_EXTRACT_ALL(t.text, r"[a-z0-9]+")) AS t WITH OFFSET i WHERE REGEXP_CONTAINS(t, r"\b\Qgoogle\E\b")) AS _q0 CROSS JOIN (SELECT i, t FROM UNNEST(REGEXP_EXTRACT_ALL(t.text, r"[a-z0-9]+")) AS t WITH OFFSET i WHERE REGEXP_CONTAINS(t, r"\b\Qphone\E\b")) AS _q1 WHERE GREATEST(_q0.i, _q1.i)-LEAST(_q0.i, _q1.i)+1 <= 4)) OR (EXISTS (SELECT 1 FROM (SELECT i, t FROM UNNEST(REGEXP_EXTRACT_ALL(t.text, r"[a-z0-9]+")) AS t WITH OFFSET i WHERE REGEXP_CONTAINS(t, r"\b\Qapple\E\b")) AS _q0 CROSS JOIN (SELECT i, t FROM UNNEST(REGEXP_EXTRACT_ALL(t.text, r"[a-z0-9]+")) AS t WITH OFFSET i WHERE REGEXP_CONTAINS(t, r"\b\Qphone\E\b")) AS _q1 WHERE GREATEST(_q0.i, _q1.i)-LEAST(_q0.i, _q1.i)+1 <= 4)) OR (EXISTS (SELECT 1 FROM (SELECT i, t FROM UNNEST(REGEXP_EXTRACT_ALL(t.text, r"[a-z0-9]+")) AS t WITH OFFSET i WHERE REGEXP_CONTAINS(t, r"\b\Qwindows\E\b")) AS _q0 CROSS JOIN (SELECT i, t FROM UNNEST(REGEXP_EXTRACT_ALL(t.text, r"[a-z0-9]+")) AS t WITH OFFSET i WHERE REGEXP_CONTAINS(t, r"\b\Qphone\E\b")) AS _q1 WHERE GREATEST(_q0.i, _q1.i)-LEAST(_q0.i, _q1.i)+1 <= 4)))

The user can then append this predicate into a SQL WHERE clause anywhere they like.

BigQuery Preparation

Litecene does require text to be analyzed before it can be searched. Specifically, when searching a field called “example,” the default configuration for BigQuery requires that field to be analyzed with the following SQL expression:

LOWER(TRIM(REGEXP_REPLACE(REGEXP_REPLACE(NORMALIZE(example, NFKD), r"\p{M}", ''), r"[^a-zA-Z0-9]+", ' ')))

This allows Litecene to search latin-based languages (e.g., English, French, Spanish, German, Italian, Portuguese, etc.) with high performance. Fortunately, this can be done quickly without any changes to application code using a materialized view, now that BigQuery supports non-aggregate materialized views.

This is documented in more detail in the Litecene BigQuery README.

BigQuery Performance

Whenever possible, litecene makes use of BigQuery’s new SEARCH function and search indexes to optimize performance, but search indexing is not required.

Regardless, BigQuery is a beast, so performance is good. As an (admittedly unscientific) test, I was just able to run the above query against the 1.35TB bigquery-public-data.wikipedia.wikidata table’s 2.35GB en_description field in an average of 2sec across 3 runs. (It turns out that there are 1,200 wikidata records whose English description matches the above query.)

Future Direction

Litecene’s primary goal is to define a portable, user-friendly search syntax that developers can use in their applications regardless of the data store their app uses. Therefore, the most important next steps are:

Improve user friendliness of the query language. For example, clear error messages and warnings would improve UX dramatically.
Add more data store transpilers. Adding support for PostgreSQL and ElasticSearch/OpenSearch are both high priorities.

Anyone interested can track the roadmap here. Please feel free to open and/or comment on issues if you have feedback! And contributions are always welcome.