Jurek Leonhardt, Koustav Rudra, Avishek Anand
Neural ranking in a nutshell:
(of course, the following is vastly simplified)
$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} \underbrace{ (\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]}) }_{\textbf{LLM inputs}} $
LLMs (i.e., Transformers or BERT) are complex and hard to interpret/explain.
What if the trained model is biased? Racist? Sexist?
Why should we even show the ranking model the whole document?
Assumption: $k$ sentences of a document are enough to estimate its relevance w.r.t. a query.
A document $d$ is split into sentences $s_i$.
The selector assigns a score to each sentence $s_i$ w.r.t. the query $q$.
The ranker sees only the $k$ highest scoring sentences.
The selector $\Psi$ computes a weight $w_i$ for each sentence:
$\left(w_1, \dots, w_{|d|} \right) = \Psi(q, d)$
End-to-end training: We draw a relaxed $k$-hot sample:1
$\left(\hat{w}_1, \dots, \hat{w}_{|d|} \right) = \operatorname{SubsetSample}(w, k, \tau)$
The query-document relevance is computed by the ranker $\Phi$ using the selected sentences $\hat{d}$:
$\phi(q, d) = \Phi \left(q, \hat{d} \right)$
Each token in $\hat{d}$ is multiplied by its corresponding $\hat{w_i}$ in order to preserve the gradients.
1Sang Michael Xie and Stefano Ermon. Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019.
$ \underbrace{\phi(q, d)}_{\textbf{relevance}} = \operatorname{BERT} (\texttt{[CLS]}, \underbrace{q_1, \dots, q_n}_{\textbf{input query}}, \texttt{[SEP]}, \underbrace{\hat{d}_1, \dots, \hat{d}_m}_{\textbf{selected sentences}}, \texttt{[SEP]}) $
There is a trade-off between the number of sentences $k$ and the effectiveness:
Comprehensiveness measures the quality of rationales:
How well does the ranking model perform using the document without the selected sentences?
Faithfulness measures the degree to which the explanations represent the model's reasoning:
How well do the selected sentences represent the document they originate from?
We perform a user study to determine the utility of Select-and-Rank for humans.
240 query-document pairs from 30 queries judged by 80 users (4 judgments per instance).
Select-and-Rank requires a BERT forward pass for each document.
Can we change the architecture to allow for pre-computations?
Select-and-Rank operates strictly on a sentence level.
Can we make the extractive explanations more fine-grained?