Extractive Explanations

for Interpretable Text Ranking


Jurek Leonhardt, Koustav Rudra, Avishek Anand

Ranking should be explainable


Neural ranking in a nutshell:

(of course, the following is vastly simplified)

$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} \underbrace{ (\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]}) }_{\textbf{LLM inputs}} $


LLMs (i.e., Transformers or BERT) are complex and hard to interpret/explain.

What if the trained model is biased? Racist? Sexist?

Extractive explanations


Example document for query: "how long to hold bow in yoga" 🤔


Why should we even show the ranking model the whole document?

Select-and-Rank


Assumption: $k$ sentences of a document are enough to estimate its relevance w.r.t. a query.


The Select-and-Rank paradigm.

A document $d$ is split into sentences $s_i$.


The selector assigns a score to each sentence $s_i$ w.r.t. the query $q$.


The ranker sees only the $k$ highest scoring sentences.

Select-and-Rank


The Select-and-Rank paradigm.

The selector $\Psi$ computes a weight $w_i$ for each sentence:

$\left(w_1, \dots, w_{|d|} \right) = \Psi(q, d)$


End-to-end training: We draw a relaxed $k$-hot sample:1

$\left(\hat{w}_1, \dots, \hat{w}_{|d|} \right) = \operatorname{SubsetSample}(w, k, \tau)$


The query-document relevance is computed by the ranker $\Phi$ using the selected sentences $\hat{d}$:

$\phi(q, d) = \Phi \left(q, \hat{d} \right)$

Each token in $\hat{d}$ is multiplied by its corresponding $\hat{w_i}$ in order to preserve the gradients.


1Sang Michael Xie and Stefano Ermon. Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019.

Selectors


S&R-Lin

S&R-Attn

Ranker


$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} (\texttt{[CLS]}, \underbrace{q_1, \dots, q_n}_{\textbf{input query}}, \texttt{[SEP]}, \underbrace{\hat{d}_1, \dots, \hat{d}_m}_{\textbf{selected sentences}}, \texttt{[SEP]}) $

Select-and-Rank example


Example selections on Fever. Highlighted sentences contain the answer.

BEIR benchmark


Results of Select-and-Rank models on the BEIR benchmark.

BEIR benchmark


There is a trade-off between the number of sentences $k$ and the effectiveness:


Performance on Fever.

Performance on SciFact.

Comprehensiveness


Comprehensiveness measures the quality of rationales:

How well does the ranking model perform using the document without the selected sentences?


Ranking performance on TREC-DL-Doc’19 using $k = 20$, where $N$ sentences are removed (leaving $k-N$ sentences).

Faithfulness


Faithfulness measures the degree to which the explanations represent the model's reasoning:

How well do the selected sentences represent the document they originate from?


We perform a user study to determine the utility of Select-and-Rank for humans.

The user study interface.

Faithfulness


240 query-document pairs from 30 queries judged by 80 users (4 judgments per instance).


Accuracy of relevance judgments.

Time taken to complete relevance judgments.

Application: Detecting label leakage


Unaltered relevant document

Relevant document with label leakage

Application: Detecting label leakage


Documents where the leakage sentence has been selected.

Distribution of the ranks of the leakage sentence.

Summary


  • We proposed Select-and-Rank, a ranking framework that is interpretable by design.
  • We showed how Select-and-Rank can be used to explain the decisions for a large number of ranking tasks.
  • We performed a user study to highlight the utility of our extractive explanations to humans.

What's next?


Efficiency

Select-and-Rank requires a BERT forward pass for each document.

Can we change the architecture to allow for pre-computations?

Granularity

Select-and-Rank operates strictly on a sentence level.

Can we make the extractive explanations more fine-grained?