How to do online prediction with BigQuery ML

Lak Lakshmanan
Towards Data Science
4 min readAug 13, 2018

--

BigQuery ML is a way of doing machine learning directly in the petabyte-scale, interactive data warehouse on Google Cloud. You can train machine learning models on millions of rows in a few minutes without having to move data around.

Note: BigQuery ML now supports extracting the trained model as a TensorFlow SavedModel. So you can simply export the model, and then deploy it to Cloud AI Platform Predictions. That said, this article is still useful as a reminder of how to view the weights and to reiterate the point that BQML is an open system.

Having trained your model, though, you need to predict with it. Out of the box, BigQuery supports batch prediction — this is suitable for reporting and dashboard applications. However, BigQuery queries usually have a latency of 1–2 seconds and so, the batch prediction capability can not be used for online prediction (such as from a web or mobile application).

Sometimes your predictions need to return immediately

In this article, I’ll show you how to pull out the necessary weights and scaling parameters from the training output tables and compute the prediction yourself. This code could be wrapped up in a web application framework or wherever you want the prediction code to live.

The full code for this article is on GitHub.

Create a model

Let’s start by creating a simple prediction model to predict arrival delays of aircraft (for more details, see this article). I’ll use this model to illustrate the process.

CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
(model_type='linear_reg', input_label_cols=['arr_delay']) AS
SELECT
arr_delay,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL

This took me about 6 minutes, trained on 6 million rows and 267 MB data, and cost about $1.25. (BigQuery‘s free tier might cover this for you; to reduce the cost, use a smaller table).

Batch prediction with model

Once you have a trained model, batch prediction can be done within BigQuery itself. For example, to find the predicted arrival delays for a flight from DFW to LAX for a range of departure delays, you could run this query:

SELECT * FROM ml.PREDICT(MODEL flights.arrdelay, (
SELECT
'AA' as carrier,
'DFW' as origin,
'LAX' as dest,
dep_delay,
18 as taxi_out,
1235 as distance
FROM
UNNEST(GENERATE_ARRAY(-3, 10)) as dep_delay
))

In the query above, I am hardcoding the input values for carrier, origin, etc. and using the GENERATE_ARRAY function to generate departure delays in the range -3 minutes to 10 minutes. This yields a table with predicted arrival delays for each departure delay:

Predicted arrival delay if a flight from Dallas to Los Angeles departs 3 minutes early (dep_delay=-3) to 10 minutes late (dep_delay=10)

Batch prediction is inexpensive. The above query processed 16 KB and cost 0.000008 cents.

While this prediction mechanism works for offline predictions, you can not realistically use it for online prediction. If the predictions are to be displayed as a result of user interactions in a website or mobile application, you can not afford the 1–2 second latency associated with each BigQuery call. You typically want latencies on the order of a few milliseconds and so you need a faster inference solution.

Weights and scaling

Fortunately, BigQuery exposes all the necessary information to compute the prediction value yourself. You can embed this code directly into your application. I’m illustrating this in Python, but you can do this in pretty much any language you want.

You need to pull 3 pieces of information:

  • The weights for each of your numeric columns which you can get using this query:
SELECT
processed_input AS input,
model.weight AS input_weight
FROM
ml.WEIGHTS(MODEL flights.arrdelay) AS model
  • The scaling for each of your numeric columns which you can get using this query:
SELECT
input, min, max, mean, stddev
FROM
ml.FEATURE_INFO(MODEL flights.arrdelay) AS model
  • The vocabulary and weights for each of your categorical columns which you can get using this query (If you are not familiar with UNNEST, see this article):
SELECT
processed_input AS input,
model.weight AS input_weight,
category.category AS category_name,
category.weight AS category_weight
FROM
ml.WEIGHTS(MODEL flights.arrdelay) AS model,
UNNEST(category_weights) AS category

Assuming that you have read the results of all three of these queries into three separate Pandas Dataframes, here is a function that will calculate the prediction:

What I am doing is this: I am walking through each of the numeric columns and finding the weight associated with that column. Then, I pull out the mean and standard deviation and use them to scale the input value. The product of the two is the contribution associated with this column. Then, I walk through the categorical columns. For each categorical column, there is a separate weight associated with each value that the column takes. So, I find the weight associated with the input value — this becomes the contribution. The sum of all the contributions is the prediction.

The above code assumes that you trained a regression model. If you trained a classification model, you need to apply a logistic function to the prediction to get the probability (to avoid overflow, treat the probability for pred < -500 as zero):

prob = (1.0/(1 + np.exp(-pred)) if (-500 < pred) else 0)

Here is an example of predicting the arrival delay of a specific flight:

rowdict = {
'carrier' : 'AA',
'origin': 'DFW',
'dest': 'LAX',
'dep_delay': -3,
'taxi_out': 18,
'distance': 1235
}
predicted_arrival_delay = compute_prediction(
rowdict, numeric_weights, scaling_df, categorical_weights)

This yields the following columns and their contributions:

col=dep_delay wt=36.5569545237 scaled_value=-0.329782492822 contrib=-12.0558435928
col=taxi_out wt=8.15557957221 scaled_value=0.213461991601 contrib=1.74090625815
col=distance wt=-1.88324519311 scaled_value=0.672196648431 contrib=-1.26591110699
col=__INTERCEPT__ wt=1.09017737502 scaled_value=1.0 contrib=1.09017737502
col=carrier wt=-0.0548843604154 value=AA contrib=-0.0548843604154
col=origin wt=0.966535564037 value=DFW contrib=0.966535564037
col=dest wt=1.26816262538 value=LAX contrib=1.26816262538

The total of the contributions is -8.31 minutes, which matches the batch prediction value, confirming that the code is correct.

The full code for this article is on GitHub. Enjoy!

--

--