Choosing your search relevance evaluation metric

February 28, 2020 Nate Day
Category: Relevancy

Ensuring results are relevant is tricky but critical to a good search experience. Choosing an evaluationmetric to summarize the performance of the search engine can be equally challenging because each metric tells you something different about how search is satisfying your users.

This post helps you appreciate what exactly each metric attempts to measure and when they can help (or hurt) your understanding of search performance. There is not one universal “best” metric but depending on your situation there are some good reasons to be choosy.

Relevance judgments 101

Judgments are how we quantify the quality of search results. Judgments can come from human experts, user interactions or even crowd-sourcing,but regardless of the source, they measure how relevant a result item isto a given search query. Imagine we are searching our favorite video streaming service for “Rocky”, if the movie Rocky Horror PictureShow is returned how relevant is that to our original search? You can see how getting good judgments can become tricky, but it is possible to get good judgments from any source. Discussing how we can ensure our judgments are good will have to wait until a future post.

Grading search relevance

The first step to picking a metric is deciding on the relevance grading scale you will use. There are two major types of scale: binary (relevant/ not-relevant) and graded (degrees of relevance).

Binary scales are simpler and have been around longer. They assume all relevant documents are equally useful to the searcher. This is similar to other binary classification tasks and is often measured with a precision metric.

Graded scales provide a nuanced view of relevance, where documents canhave differing degrees of utility to the searcher. This creates additional considerations for choosing a metric and the most popular ones are based on cumulative gain.

Grading is always done “at a position”. So if we are grading for precision at position 5, we are only considering the documents at positions 1 through 5. Since this is ubiquitous to evaluation metrics, there is a shorthand format: precision@5.

Metrics table

Scale	Metric	Measures	Drawbacks
Binary	Precision (P)	The relevance of the entire results set (gridded results display)	Doesn’t account for position
Binary	Average Precision (AP)	Relevance to a user scanning results sequentially	Large impact of low-rank results
Graded	Cumulative Gain (CG)	Information gain from a results set	Same as Precision doesn’t factor in position
Graded	Discount Cumulative Gain (DCG)	Information gain with positional weighting	Difficult to compare across queries
Graded	normalized DCG (nDCG)	How close the results are to the best possible	No longer shows information gain

Binary metrics

Two of the most common metrics used with binary grading scales are: Precision and Average Precision.

Precision

Precision is calculated as the number of relevant documents over the total number of documents

docs_returned <- c(1,1,0,0,1)precision_at_5 <- sum(docs_returned) / length(docs_returned)precision_at_5## [1] 0.6

Remember any metric is only considering the documents ranked higher than the positional cut-off. So in the event 5 additional documents were returned but the top 5 did not change, precision@5 would be the same.

more_docs_returned <- c(1,1,0,0,1,0,0,1,1,1)docs_considered <- more_docs_returned[1:5]precision_at_5 <- sum(docs_considered) / length(docs_considered)precision_at_5## [1] 0.6

The biggest downside of precision is that it doesn’t consider how the documents are ordered. Below, rankings one and two are considered equally relevant, despite the fact the 2nd ranking has irrelevant documents in the top rank.

docs1 <- c(1,1,1,0,0) docs2 <- c(0,0,1,1,1) precision <- function(docs) sum(docs) / length(docs)c("docs1" = precision(docs1),  "docs2" = precision(docs2))## docs1 docs2 ##   0.6   0.6

This is counter intuitive because we know that users scan results, usually top-to-bottom and left-to-right. So if this was the classic 10 blue links layout, we would prefer docs1 because it has the relevant results first. This might not be a big deal if you are working with a large grid of results (ie shopping for apparel), but generally it’s good for the metric to consider ranking.

Note: Going forward all metrics will be @5 variants even though I’mnot explicit in naming them so.

Average precision

Average precision addresses this shortcoming of precision and allows discrimination between docs1 and docs2. Instead of just computing precision once over the entire results set (documents 1-5), precision is computedat the position of each relevant document and the resulting values are averaged.

avg_precision <- function(docs) {   vals_to_avg <- vector("numeric", length(docs))   for (i in seq_along(docs)) {    if (docs[i] == 1) {      vals_to_avg[i] <- precision(docs[1:i])    }  }    mean(vals_to_avg[vals_to_avg != 0]) # only computed at positions with relevant documents}c("docs1" = avg_precision(docs1),  "docs2" = avg_precision(docs2))##     docs1     docs2 ## 1.0000000 0.4777778

The idea is that if a lot of irrelevant docs are returned early in the results (eg 1st, 2nd, 3rd positions) then those results will be penalized. In other words it rewards search engines that retrieve the relevant documents first. Average precision works well when results are present in a list and putting best results first is critical. Most users who find multiple not-relevant docs at the top of their search results will reformulate their query or abandon the session believing that relevant results don’t exist.

This high impact of early results can become a problem if you are presenting a lot of results in a grid and expect users to be perusing.In that situation having a few early results with poor relevance grades probably won’t be a deal breaker for users if the bulk of the results set has high relevance grades. If you find yourself with this problem, precision is probably a better model for evaluating relevance.

Graded metrics

Because the world is not black and white, more and more relevance grading is done on a scale with more gradation. This helps capture documents that may be partially relevant. A common grading scale, andthe one used by the TREC Newstrack, is 0-4:

0 – No useful information

1 – Some useful information

2 – Significant useful information

3 – Essential useful information

4 – Critical useful information

This does make the grading process more difficult, because getting agreement among graders can be harder as it becomes more subjective. But the benefit is a more realistic picture of relevance, which may justify the extra work.

Any graded scale can be converted to a binary scale by setting a cutoff value. So for the 0-4 scale above we could call the 3s and 4s relevant and everything else not-relevant.

Cumulative gain

One of the first metrics to try and leverage graded relevance was Cumulative Gain (CG). CG tries to capture the total information gained from a set of search results and is calculated as the sum of relevance grades.

docsG1 <- c(4,3,2,1,0)docsG2 <- rev(docsG1)cumulative_gain <- function(docs) sum(docs)c("docs1" = cumulative_gain(docsG1),  "docs2" = cumulative_gain(docsG2))## docs1 docs2 ##    10    10

The main drawback of cumulative gain is the same as precision, it doesn’t care about ranking. Depending on the search result display this might not be a major factor, but generally you want to consider the ranking of results.

Discounted cumulative gain

Discounted Cumulative Gain (DCG) introduces a discount factor to CG, as a way to properly weight documents based on their rank. The idea is we want high weights for high rank documents, because searchers are likely to inspect them, and low weights for low rank documents that searchers are unlikely to ever see.

Discount examples

Rank	Grade	Discount1 [1/rank]	Discount2 [log2(rank + 1)]	Discount1 Grade	Discount2 Grade
1	4	1.000	1.000	4.000	4.000
2	3	0.500	0.631	1.500	1.893
3	2	0.333	0.500	0.667	1.000
4	1	0.250	0.431	0.250	0.431
5	1	0.200	0.387	0.200	0.387

The discount factor is commonly chosen as log2(rank + 1) and is used to divide the relevance grade. Using a logarithm for the position penalty makes the decay effect more gradual compared to using the position itself.

discounted_cumulative_gain <- function(docs) {   scores_to_sum <- vector("numeric", length(docs))   for (i in seq_along(docs)) {        scores_to_sum[i] <- (docs[i]) / (log(i + 1, base = 2))      }  sum(scores_to_sum)}c("docs1" = discounted_cumulative_gain(docsG1),  "docs2" = discounted_cumulative_gain(docsG2))##    docs1    docs2 ## 7.323466 4.470371

An alternative formulation of DCG exists, that emphasizes highly relevant documents by using 2^(relevance_grade – 1) instead of the relevance grade directly. If you are using a binary relevance scale, then both of these formulations compute the same value, but with a graded scale they differ.

alt_discounted_cumulative_gain <- function(docs) {   scores_to_sum <- vector("numeric", length(docs))   for (i in seq_along(docs)) {        scores_to_sum[i] <- (2^docs[i] - 1) / (log(i + 1, base = 2))      }  sum(scores_to_sum)}c("docs1" = alt_discounted_cumulative_gain(docsG1),  "docs2" = alt_discounted_cumulative_gain(docsG2))##    docs1    docs2 ## 21.34718 10.94846

One potential drawback of DCG is that it conflates retrieval performance with the quality of data available in your index. Imagine we have query1and query2. Query1 has lots of relevant documents in the index while query2 does not.

query1 <- c(4,4,3,3,3)query2 <- c(2,1,1,1,0)c("docs1" = alt_discounted_cumulative_gain(query1),  "docs2" = alt_discounted_cumulative_gain(query2))##     docs1     docs2 ## 33.686652  4.561606

If the search engine is performing as well as it possibly could for query2 (eg no higher graded documents are lurking lower in the results) then we want to know that, instead of seeing query2 as a place where we could improve.

Normalized discounted cumulative gain

The goal of normalized discounted cumulative gain (nDCG) is to capture this idea of how well the search engine is doing relative to the best possible performance given it’s index. By dividing DCG by the ideal-DCG(the top N documents in the entire index, not just the results page), the value is scaled to between 0-1.

normalized_discounted_cumulative_gain <- function(docs) {   real <- discounted_cumulative_gain(docs[1:5])  ideal <- discounted_cumulative_gain(sort(docs, decreasing = TRUE)[1:5])   real / ideal}c("docs1" = normalized_discounted_cumulative_gain(query1),  "docs2" = normalized_discounted_cumulative_gain(query2))## docs1 docs2 ##     1     1

A value of 1 indicates best performance possible, meaning no further gains can be squeezed out, which is the case for our perfectly ordered results in query1 and query2.

But anything less than 1 tells us there is room for improvement (eg the ordering of the results is not ideal).

query3 <- c(3,2,1,4,0)normalized_discounted_cumulative_gain(query3)## [1] 0.8854504

And the further the ordering was from ideal the lower the value produced by nDCG.

query4 <- c(0,1,2,3,4)normalized_discounted_cumulative_gain(query4)## [1] 0.6104174

To highlight the fact that the ideal score is calculated from the entire index, imagine how the entire index is 10 documents, while we are still computing nDCG@5.

docs_all <- c(4,3,2,1,1,0,3,4,0,0)normalized_discounted_cumulative_gain(docs_all)## [1] 0.7641958

We can see there is still room to improve search performance by pulling the documents at positions 7 and 8 up into the first five.

Flavors of ideal

While the formal definition of nDCG defines “ideal” as the best documents available in the entire index, there are other variants that deserve mentioning.

Flavor of Ideal	How is it calculated?	What does it tell you?
Global (formal)	The best N documents from the entire index	How good is the search engine performing
Maximum	Maximum value from your grading scale (eg 4) repeated N times	How good are the search results (scaled version of DCG)
Local	Best ordering possible from the top N search results	Nothing useful

The best aspect of nDCG is the ability to compare relative performance across queries. Due to the normalization every query is on a level playing field, on a consistent 0-1 scale, so you end up measuring the search engine’s performance and not the quality of the index.

I see a lot of confusion around the fact that nDCG reports values that are normalized to the best possible. Imagine a situation where your best is bad. Let’s say the best documents you have available for a given query only contain some relevant information, so a grade of 1. As long as the engine returns five documents scored 1, then the nDCG@5 is 1 or perfect. Your users likely won’t agree that these results are perfect, because they ultimately care about the quality of results! If you only want to measure retrieval engine performance, nDCG might be right for you, but always be aware of this potential blind spot!

Conclusion

There isn’t a single metric that will give you a perfect summary value for your search’s performance, but a combination of metrics can get you close. How search results are displayed (large-grid vs vertical-list) should influence which metric you choose, because accounting for position/rank makes less sense when results are shown in bulk and meant to be perused.

Also remember that these metrics will be used primarily as you iterate on your search engine configuration, comparing different system configurations. So when you are testing if a new system configuration is really an improvement, you will be looking at the change in a metric for the same query across the two configurations.

Acknowledgements

Huge thanks to Edward Ribeiro foreditting, bug-squashing and porting all of the code examples to Python!