Extractive Explanations for Interpretable Text Ranking

Extractive Explanations

for Interpretable Text Ranking

Jurek Leonhardt, Koustav Rudra, Avishek Anand

Ranking should be explainable

Neural ranking in a nutshell:

(of course, the following is vastly simplified)

$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} \underbrace{ (\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]}) }_{\textbf{LLM inputs}} $

LLMs (i.e., Transformers or BERT) are complex and hard to interpret/explain.

What if the trained model is biased? Racist? Sexist?

Extractive explanations

Example document for query: "how long to hold bow in yoga" 🤔

Why should we even show the ranking model the whole document?

Select-and-Rank

Assumption: $k$ sentences of a document are enough to estimate its relevance w.r.t. a query.

The Select-and-Rank paradigm.

A document $d$ is split into sentences $s_i$.

The selector assigns a score to each sentence $s_i$ w.r.t. the query $q$.

The ranker sees only the $k$ highest scoring sentences.

Select-and-Rank

The Select-and-Rank paradigm.

The selector $\Psi$ computes a weight $w_i$ for each sentence:

$\left(w_1, \dots, w_{|d|} \right) = \Psi(q, d)$

End-to-end training: We draw a relaxed $k$-hot sample:¹

$\left(\hat{w}_1, \dots, \hat{w}_{|d|} \right) = \operatorname{SubsetSample}(w, k, \tau)$

The query-document relevance is computed by the ranker $\Phi$ using the selected sentences $\hat{d}$:

$\phi(q, d) = \Phi \left(q, \hat{d} \right)$

Each token in $\hat{d}$ is multiplied by its corresponding $\hat{w_i}$ in order to preserve the gradients.

¹Sang Michael Xie and Stefano Ermon. Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019.

Selectors

S&R-Lin

S&R-Attn

Ranker

$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} (\texttt{[CLS]}, \underbrace{q_1, \dots, q_n}_{\textbf{input query}}, \texttt{[SEP]}, \underbrace{\hat{d}_1, \dots, \hat{d}_m}_{\textbf{selected sentences}}, \texttt{[SEP]}) $

Select-and-Rank example

Example selections on Fever. Highlighted sentences contain the answer.

BEIR benchmark

Results of Select-and-Rank models on the BEIR benchmark.

BEIR benchmark

There is a trade-off between the number of sentences $k$ and the effectiveness:

Performance on Fever.

Performance on SciFact.

Comprehensiveness

Comprehensiveness measures the quality of rationales:

How well does the ranking model perform using the document without the selected sentences?

Ranking performance on TREC-DL-Doc’19 using $k = 20$, where $N$ sentences are removed (leaving $k-N$ sentences).

Faithfulness

Faithfulness measures the degree to which the explanations represent the model's reasoning:

How well do the selected sentences represent the document they originate from?

We perform a user study to determine the utility of Select-and-Rank for humans.

The user study interface.

Faithfulness

240 query-document pairs from 30 queries judged by 80 users (4 judgments per instance).

Accuracy of relevance judgments.

Time taken to complete relevance judgments.

Application: Detecting label leakage

Unaltered relevant document

Relevant document with label leakage

Application: Detecting label leakage

Documents where the leakage sentence has been selected.

Distribution of the ranks of the leakage sentence.

Summary

We proposed Select-and-Rank, a ranking framework that is interpretable by design.
We showed how Select-and-Rank can be used to explain the decisions for a large number of ranking tasks.
We performed a user study to highlight the utility of our extractive explanations to humans.

github.com/mrjleo/ranking-models

What's next?

Efficiency

Select-and-Rank requires a BERT forward pass for each document.

Can we change the architecture to allow for pre-computations?

Granularity

Select-and-Rank operates strictly on a sentence level.

Can we make the extractive explanations more fine-grained?

github.com/mrjleo/ranking-models