Jurek Leonhardt
PhD thesis defense, December 13th, 2023
L3S Research Center, Hannover, Germany
L3S Research Center
www.l3s.de
L3S Research Center: Home
L3S is a German research center internationally renowned for its work in the field of digital transformation and the development of AI methods.
Wikipedia
de.wikipedia.org › wiki › L3S
L3S
Das Forschungszentrum L3S ist eine im Jahr 2001 gegründete Forschungseinrichtung für Grundlagenforschung und anwendungsorientierte Forschung zu ...
de.linkedin.com › company
L3S Research Center
The L3S Research Center, founded in 2001, develops methods and technologies for digital transformation and researches the effects of digitalisation.
Given an input query $q$, return a list of $k$ documents $d_i$ from a corpus, ordered by their relevance with respect to $q$.
$d_1$
L3S Research Center
www.l3s.de
$d_2$
Wikipedia
de.wikipedia.org › wiki › L3S
$d_3$
de.linkedin.com › company
. . .
$d_k$
Ranking for the query "L3S Research Center".
Web Search
Ad-hoc retrieval (this thesis)
Personal assistants
Question answering
Online shopping,
video streaming
Recommender systems
thus...
Data from: Scells, Zhuang, and Zuccon. “Reduce, Reuse, Recycle: Green Information Retrieval Research”. SIGIR 2022.
But state-of-the-art neural models (often) aren't.
Neural ranking in a nutshell:
(of course, the following is vastly simplified)
$ \underbrace{\phi(q, d)}_{\text{relevance}} = \operatorname{BERT} \underbrace{ (\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]}) }_{\text{LLM inputs}} $
LLMs (i.e., Transformers or BERT) are complex and hard to interpret/explain.
What if the trained model is biased? Racist? Sexist?
The big picture
Sparse retrieval (mostly) uses lexical matching for ad-hoc retrieval.
Sparse retrieval example.
Lexical matching is fast and efficient, explainable by design, limited to matching terms.
Semantic matching employs representation learning in order to capture the semantics of queries and documents.
Cross-encoder
Dual-encoders
Neural semantic models are often computationally expensive and not explainable.
Retrieve-and-re-rank is a telescoping setting that has two stages.
📚
Corpus
→
1
Lexical retrieval
→
Candidates
→
2
Semantic re-ranking
→
#1 📄 D371 (0.95)
#2 📄 D222 (0.85)
#3 📄 D984 (0.71)
Final ranking
Dense retrieval uses semantic models directly for retrieval.
Documents and query are embedded in a common vector space using dual-encoder models.
Document representations are pre-computed during the indexing stage.
Retrieval is performed as an (approximate) nearest neighbor search operation.
The vector space is often $768$-dimensional.
Jurek Leonhardt, Koustav Rudra, Megha Khosla, Abhijit Anand, and Avishek Anand
Efficient Neural Ranking Using Forward Indexes, WWW 2022 (full paper)
Jurek Leonhardt, Henrik Müller, Koustav Rudra, Megha Khosla, Abhijit Anand, and Avishek Anand
Efficient Neural Ranking Using Forward Indexes and Lightweight Encoders, TOIS (journal paper, just accepted)
Interpolating lexical retrieval and semantic re-ranking scores improves the performance:
$\phi (q, d) = \alpha \cdot \underbrace{\phi_S (q, d)}_{\text{retrieval}} + (1 - \alpha) \ \cdot$ $\underbrace{\phi_D(q, d)}_{\text{re-ranking}}$
Our approach: Efficient re-ranking with dual-encoders.
A Fast-Forward index is a look-up table for pre-computed document representations:
$\phi_D^{FF} (q, d)$ $= \zeta (q) \cdot \eta^{FF} (d)$
The look-up operation is much more efficient than actual re-ranking:
TCT-ColBERT on TREC-DL-Doc’19.
Coming up: Techniques to improve efficiency even further.
Look-up operation in a coalesced index.
Long documents are often chunked into passages, so they have multiple representations.
Sequential coalescing merges similar representations of subsequent passages within a document.
TCT-ColBERT on TREC-DL-Doc’19 with sequential coalescing.
Many tasks require only a small number of documents (e.g., the top-$10$).
Early stopping exploits correlations of retriever and re-ranker to limit the number of documents to be re-ranked.
ANCE on TREC-DL-Psg’19 with early stopping.
Queries and documents have different characteristics in terms of length and complexity.
The distribution of query and passage lengths in the MS MARCO corpus.
Intuition: Encoders should be adapted to these characteristics.
Queries are short and concise; a simple model should be able to represent them.
Self-attention-based query encoder.
Hyperparameters:
BERTbase: $L=12$, $A=12$, $H=768$. Can we go lower?
Latency and ranking performance. Latency measured with batch size $256$.
Lightweight query encoders are much faster and can achieve good performance.
Documents are often long; encoding (indexing) is expensive.
Complex encoders are required to capture the semantics of the documents.
Can we shorten documents instead?
[...]
This is a
web article
which is
unnecessarily long
and
contains
ads, repetitions, and, finally, some
useful content.
[...]
Example document.
Selective document encoder.
Selective document encoders dynamically shorten batches of input documents during inference.
Each document token is assigned a score, and the lowest scoring tokens are removed.
The degree to which documents are shortened is controlled by a hyperparameter $p \in [0,1]$.
Indexing latency and ranking performance on MSM-Psg-Dev. Latency measured with batch size $256$.
Up to 50% of the document tokens can be removed without difference in performance.
Jurek Leonhardt, Fabian Beringer, and Avishek Anand
Exploiting Sentence-Level Representations for Passage Ranking, LWDA 2021 (workshop paper)
LLM-based cross-attention ranker.
Cross-attention re-rankers (e.g., BERT) allow for query-document attention, which can be helpful for complex queries (e.g., question answering):
$\phi_{\operatorname{BERT}}(q, d) =$ $\operatorname{BERT}_\texttt{[CLS]}$ $(\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]})$
Often times, only the $\texttt{[CLS]}$ output is used, and other outputs are discarded.
Can we utilize the contextualized sentence representations that BERT outputs?
Dynamic memory networks1 (DMNs) were proposed to enable cross-sentence reasoning for question answering.
DMNs operate on all query and document tokens:
$\phi_{\operatorname{DMN}}(q, d) = \operatorname{DMN}\left(q_1, \dots, q_{|q|}, d_1, \dots, d_{|d|}\right)$,
where $q_i$ and $d_i$ are word embeddings.
What if we used contextual BERT representations instead?
1Ankit Kumar et al. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”. PMLR 2016.
BERT-DMN architechture.
BERT-DMNlite architechture.
BERT-DMN shows improvements over BERT.
BERT-DMNlite performance is comparable to BERT.
Performance on passage ranking datasets.
Training batches per second.
BERT-DMNlite reduces the number of trainable parameters roughly from 110M to 3M.
For each query-document pair, BERT outputs can be cached after the first epoch.
Jurek Leonhardt, Koustav Rudra, and Avishek Anand
Extractive Explanations for Interpretable Text Ranking, TOIS 2023 (journal paper)
Example document for query: "how long to hold bow in yoga" 🤔
Why should we even show the ranking model the whole document?
Assumption: $k$ sentences of a document are enough to estimate its relevance w.r.t. a query.
The Select-and-Rank paradigm.
A document $d$ is split into sentences $s_i$.
The selector assigns a score to each sentence $s_i$ w.r.t. the query $q$.
The ranker sees only the $k$ highest scoring sentences.
The Select-and-Rank paradigm.
The selector $\Psi$ computes a weight $w_i$ for each sentence:
$\left(w_1, \dots, w_{|d|} \right) = \Psi(q, d)$
End-to-end training: We draw a relaxed $k$-hot sample:1
$\left(\hat{w}_1, \dots, \hat{w}_{|d|} \right) = \operatorname{SubsetSample}(w, k, \tau)$
The query-document relevance is computed by the ranker $\Phi$ using the selected sentences $\hat{d}$:
$\phi(q, d) = \Phi \left(q, \hat{d} \right)$
Each token in $\hat{d}$ is multiplied by its corresponding $\hat{w_i}$ in order to preserve the gradients.
1Sang Michael Xie and Stefano Ermon. Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019.
Results of Select-and-Rank models on the BEIR benchmark.
There is a trade-off between the number of sentences $k$ and the effectiveness:
Performance on Fever.
Performance on SciFact.
Faithfulness measures the degree to which the explanations represent the model's reasoning:
How well do the selected sentences represent the document they originate from?
We perform a user study to determine the utility of Select-and-Rank for humans.
The user study interface.
240 query-document pairs from 30 queries judged by 80 users (4 judgments per instance).
Accuracy of relevance judgments.
Time taken to complete relevance judgments.
Jurek Leonhardt, Avishek Anand, and Megha Khosla
Boilerplate Removal Using a Neural Sequence Labeling Model, WWW 2020 (demo paper)
https://en.wikipedia.org/wiki/The_Web_Conference
The Web Conference
The Web Conference (formerly known as International World Wide Web Conference, abbreviated as WWW) is a yearly international academic conference on the topic of the future direction of the World Wide Web. The first conference of many was held and organized by Robert Cailliau in 1994 at CERN in Geneva, Switzerland. The conference has been organized by the International World Wide Web Conference Committee (IW3C2), also founded by Robert Cailliau and colleague Joseph Hardin, every year since. The conference’s location rotates among North America, Europe, and Asia and its events usually span a period of five days. [...]
Boilerplate removal is the process of extracting content in plain text from a web page.
Example use cases:
An example web page and its elements.
Task: Classify each leaf node in the HTML tree as content or not content.
Assumption: The position of a leaf node within the HTML tree is important for classification.
We therefore feed the sequence of leaf representations into the model and classify each element.
A web page is represented as a sequence of its leaf nodes ($\texttt{\#text}$) in the HTML tree.
Example web page and its representations.
The BoilerNet architecture.
The BoilerNet model consists of
On CleanEval, BoilerNet performs similarly to Web2Text, whose features were manually optimized for this dataset.
Performance on CleanEval.
On GoogleTrends-2017 (our dataset), BoilerNet outperforms competitors.
Performance on GoogleTrends-2017.
In this thesis, we focused on:
Recall the trade-off between efficiency, explainability, and effectiveness.
Can we achieve efficiency and explainability?
Lexical matching estimates the relevance of query-document pairs based on overlapping terms and their frequencies (tf).
Query: "What is the meaning of life?" 🤔
Hybrid retrieval uses a sparse retriever to retrieve $K^q_S$ and a dense retriever to retrieve $K^q_D$ for a query $q$.
The documents are ranked using a combination of the scores.
Long documents are often chunked into passages and scored using a $\max$ operation:
$\phi_D^{FF} (q, d) = \max_{p_i \in d} (\zeta (q) \cdot \eta^{FF} (p_i))$
Sparse retrieval is fast, but not competitive on its own.
Dense retrieval shows good nDCG, but recall suffers and it is slow.
Re-ranking using dense retrieval models works well.
Score interpolation further helps:
$\phi (q, d)$ $= \alpha$ $\phi_S (q, d)$ $+\ (1 - \alpha) $ $\phi_D(q, d)$
Results on TREC-DL-Doc’19.
$p_0$
$p_1$
$\mathcal{A}$
$p_2$
$\mathcal{A}$
$\hat{p}_1$
$p_3$
$\hat{p}_2$
$p_4$
$\hat{p}_3$
$p_5$
$p_6$
$\mathcal{A}$
$\hat{p}_4$
All passage representations of a document in the embedding space.
def coalesce(P):
P_new = []
A = []
A_avg = None
first_iteration = True
for p in P:
if first_iteration:
first_iteration = False
elif distance(p, A_avg) >= delta:
P_new.append(A_avg)
A = []
A.append(p)
A_avg = np.mean(A, axis=0)
P_new.append(A_avg)
return P_new
The early stopping approach illustrated.
We set $\alpha = 0.5$:
$\phi (q, d) = 0.5 \cdot \phi_S (q, d) + 0.5 \cdot \phi_D (q, d)$
Stop the computation once the highest possible score $\phi_{\text{max}} (q, d)$ is too low to make a difference.
$\phi_D$ is not normalized. We approximate its maximum as the highest observed score:
$\phi_{\text{max}} (q, \text{D224}) = 0.5 \cdot 0.73 + 0.5 \cdot 0.67 = 0.70$
$\phi_S$ | $\phi_D$ | $\phi$ ↓ | |
---|---|---|---|
|
|
|
|
|
|
|
|
Ranked documents.
We set $\alpha = 0.5$:
$\phi (q, d) = 0.5 \cdot \phi_S (q, d) + 0.5 \cdot \phi_D (q, d)$
Stop the computation once the highest possible score $\phi_{\text{max}} (q, d)$ is too low to make a difference.
$\phi_D$ is not normalized. We approximate its maximum as the highest observed score:
$\phi_{\text{max}} (q, \text{D224}) = 0.5\ \cdot$ $0.73$ $ +\ 0.5\ \cdot$ $0.67$ $ \gt $ $0.68$
$\phi_S$ | $\phi_D$ | $\phi$ ↓ | |
---|---|---|---|
|
|
|
|
|
|
|
|
Ranked documents.
We set $\alpha = 0.5$:
$\phi (q, d) = 0.5 \cdot \phi_S (q, d) + 0.5 \cdot \phi_D (q, d)$
Stop the computation once the highest possible score $\phi_{\text{max}} (q, d)$ is too low to make a difference.
$\phi_D$ is not normalized. We approximate its maximum as the highest observed score:
$\phi_{\text{max}} (q, \text{D105}) = 0.5\ \cdot$ $0.49$ $ +\ 0.5\ \cdot$ $0.71$ $ \lt $ $0.72$
$\phi_S$ | $\phi_D$ | $\phi$ ↓ | |
---|---|---|---|
|
|
|
|
|
|
|
|
Ranked documents.
Document ranking results.
Results on TREC-DL-Psg’19.
Passage ranking results at retrieval depth $k_S = 5000$.
Query encoding latency in seconds.
nDCG@10 on MSM-Psg-Dev.
nDCG@10 on TREC-DL-Psg’19.
nDCG@10 on TREC-DL-Psg’20.
Selective document encoders incur a small performance hit, but decrease the indexing latency notably:
Encoding, batch size $256$.
MSM-Psg-Dev, $L=12$.
MSM-Psg-Dev, $L=0$.
The plot shows the average cosine similarity of BERT representations.
BERT-DMN shows much lower diffusion.
This suggests that fine-tuning vanilla BERT merely “aggregates” results.
The diffusion of information.
Ranking performance with $k = 20$ selected sentences.
Ranking performance with $k = 20$ selected sentences.
Comprehensiveness measures the quality of rationales:
How well does the ranking model perform using the document without the selected sentences?
Ranking performance on TREC-DL-Doc’19 using $k = 20$, where $N$ sentences are removed (leaving $k-N$ sentences).
Example rankings from TREC-DL-Doc’19 with the most relevant selected sentences. Suffix (+/-) indicates the relevance.
Example selections on Fever. Highlighted sentences contain the answer.
Unaltered relevant document
Relevant document with label leakage
Documents where the leakage sentence has been selected.
Distribution of the ranks of the leakage sentence.