Jurek Leonhardt, Koustav Rudra, Avishek Anand
Neural ranking in a nutshell:
(of course, the following is vastly simplified)
$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} \underbrace{ (\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]}) }_{\textbf{LLM inputs}} $
LLMs (i.e., Transformers or BERT) are complex and hard to interpret/explain.
What if the trained model is biased? Racist? Sexist?
Example document for query: "how long to hold bow in yoga" 🤔
Why should we even show the ranking model the whole document?
Assumption: $k$ sentences of a document are enough to estimate its relevance w.r.t. a query.
The Select-and-Rank paradigm.
A document $d$ is split into sentences $s_i$.
The selector assigns a score to each sentence $s_i$ w.r.t. the query $q$.
The ranker sees only the $k$ highest scoring sentences.
The Select-and-Rank paradigm.
The selector $\Psi$ computes a weight $w_i$ for each sentence:
$\left(w_1, \dots, w_{|d|} \right) = \Psi(q, d)$
End-to-end training: We draw a relaxed $k$-hot sample:1
$\left(\hat{w}_1, \dots, \hat{w}_{|d|} \right) = \operatorname{SubsetSample}(w, k, \tau)$
The query-document relevance is computed by the ranker $\Phi$ using the selected sentences $\hat{d}$:
$\phi(q, d) = \Phi \left(q, \hat{d} \right)$
Each token in $\hat{d}$ is multiplied by its corresponding $\hat{w_i}$ in order to preserve the gradients.
1Sang Michael Xie and Stefano Ermon. Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019.
S&R-Lin
S&R-Attn
$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} (\texttt{[CLS]}, \underbrace{q_1, \dots, q_n}_{\textbf{input query}}, \texttt{[SEP]}, \underbrace{\hat{d}_1, \dots, \hat{d}_m}_{\textbf{selected sentences}}, \texttt{[SEP]}) $
Example selections on Fever. Highlighted sentences contain the answer.
Results of Select-and-Rank models on the BEIR benchmark.
There is a trade-off between the number of sentences $k$ and the effectiveness:
Performance on Fever.
Performance on SciFact.
Comprehensiveness measures the quality of rationales:
How well does the ranking model perform using the document without the selected sentences?
Ranking performance on TREC-DL-Doc’19 using $k = 20$, where $N$ sentences are removed (leaving $k-N$ sentences).
Faithfulness measures the degree to which the explanations represent the model's reasoning:
How well do the selected sentences represent the document they originate from?
We perform a user study to determine the utility of Select-and-Rank for humans.
The user study interface.
240 query-document pairs from 30 queries judged by 80 users (4 judgments per instance).
Accuracy of relevance judgments.
Time taken to complete relevance judgments.
Unaltered relevant document
Relevant document with label leakage
Documents where the leakage sentence has been selected.
Distribution of the ranks of the leakage sentence.
Select-and-Rank requires a BERT forward pass for each document.
Can we change the architecture to allow for pre-computations?
Select-and-Rank operates strictly on a sentence level.
Can we make the extractive explanations more fine-grained?