Jurek Leonhardt
PhD thesis defense, December 13th, 2023
L3S Research Center, Hannover, Germany
L3S Research Center
www.l3s.de
L3S Research Center: Home
L3S is a German research center internationally renowned for its work in the field of digital transformation and the development of AI methods.
Wikipedia
de.wikipedia.org › wiki › L3S
L3S
Das Forschungszentrum L3S ist eine im Jahr 2001 gegründete Forschungseinrichtung für Grundlagenforschung und anwendungsorientierte Forschung zu ...
de.linkedin.com › company
L3S Research Center
The L3S Research Center, founded in 2001, develops methods and technologies for digital transformation and researches the effects of digitalisation.
Given an input query $q$, return a list of $k$ documents $d_i$ from a corpus, ordered by their relevance with respect to $q$.
$d_1$
L3S Research Center
www.l3s.de
$d_2$
Wikipedia
de.wikipedia.org › wiki › L3S
$d_3$
de.linkedin.com › company
. . .
$d_k$
Web Search
Ad-hoc retrieval (this thesis)
Personal assistants
Question answering
Online shopping,
video streaming
Recommender systems
thus...
Data from: Scells, Zhuang, and Zuccon. “Reduce, Reuse, Recycle: Green Information Retrieval Research”. SIGIR 2022.
But state-of-the-art neural models (often) aren't.
Neural ranking in a nutshell:
(of course, the following is vastly simplified)
$ \underbrace{\phi(q, d)}_{\text{relevance}} = \operatorname{BERT} \underbrace{ (\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]}) }_{\text{LLM inputs}} $
LLMs (i.e., Transformers or BERT) are complex and hard to interpret/explain.
What if the trained model is biased? Racist? Sexist?
The big picture
Sparse retrieval (mostly) uses lexical matching for ad-hoc retrieval.
Lexical matching is fast and efficient, explainable by design, limited to matching terms.
Semantic matching employs representation learning in order to capture the semantics of queries and documents.
Neural semantic models are often computationally expensive and not explainable.
Retrieve-and-re-rank is a telescoping setting that has two stages.
📚
Corpus
→
1
Lexical retrieval
→
Candidates
→
2
Semantic re-ranking
→
#1 📄 D371 (0.95)
#2 📄 D222 (0.85)
#3 📄 D984 (0.71)
Final ranking
Dense retrieval uses semantic models directly for retrieval.
Documents and query are embedded in a common vector space using dual-encoder models.
Document representations are pre-computed during the indexing stage.
Retrieval is performed as an (approximate) nearest neighbor search operation.
Jurek Leonhardt, Koustav Rudra, Megha Khosla, Abhijit Anand, and Avishek Anand
Efficient Neural Ranking Using Forward Indexes, WWW 2022 (full paper)
Jurek Leonhardt, Henrik Müller, Koustav Rudra, Megha Khosla, Abhijit Anand, and Avishek Anand
Efficient Neural Ranking Using Forward Indexes and Lightweight Encoders, TOIS (journal paper, just accepted)
Interpolating lexical retrieval and semantic re-ranking scores improves the performance:
$\phi (q, d) = \alpha \cdot \underbrace{\phi_S (q, d)}_{\text{retrieval}} + (1 - \alpha) \ \cdot$ $\underbrace{\phi_D(q, d)}_{\text{re-ranking}}$
Our approach: Efficient re-ranking with dual-encoders.
A Fast-Forward index is a look-up table for pre-computed document representations:
$\phi_D^{FF} (q, d)$ $= \zeta (q) \cdot \eta^{FF} (d)$
The look-up operation is much more efficient than actual re-ranking:
Coming up: Techniques to improve efficiency even further.
Long documents are often chunked into passages, so they have multiple representations.
Sequential coalescing merges similar representations of subsequent passages within a document.
Many tasks require only a small number of documents (e.g., the top-$10$).
Early stopping exploits correlations of retriever and re-ranker to limit the number of documents to be re-ranked.
Queries and documents have different characteristics in terms of length and complexity.
Intuition: Encoders should be adapted to these characteristics.
Queries are short and concise; a simple model should be able to represent them.
Hyperparameters:
BERTbase: $L=12$, $A=12$, $H=768$. Can we go lower?
Lightweight query encoders are much faster and can achieve good performance.
Documents are often long; encoding (indexing) is expensive.
Complex encoders are required to capture the semantics of the documents.
Can we shorten documents instead?
[...]
This is a
web article
which is
unnecessarily long
and
contains
ads, repetitions, and, finally, some
useful content.
[...]
Selective document encoders dynamically shorten batches of input documents during inference.
Each document token is assigned a score, and the lowest scoring tokens are removed.
The degree to which documents are shortened is controlled by a hyperparameter $p \in [0,1]$.
Up to 50% of the document tokens can be removed without difference in performance.
Jurek Leonhardt, Fabian Beringer, and Avishek Anand
Exploiting Sentence-Level Representations for Passage Ranking, LWDA 2021 (workshop paper)
Cross-attention re-rankers (e.g., BERT) allow for query-document attention, which can be helpful for complex queries (e.g., question answering):
$\phi_{\operatorname{BERT}}(q, d) =$ $\operatorname{BERT}_\texttt{[CLS]}$ $(\texttt{[CLS]}\ q\ \texttt{[SEP]}\ d\ \texttt{[SEP]})$
Often times, only the $\texttt{[CLS]}$ output is used, and other outputs are discarded.
Can we utilize the contextualized sentence representations that BERT outputs?
Dynamic memory networks1 (DMNs) were proposed to enable cross-sentence reasoning for question answering.
DMNs operate on all query and document tokens:
$\phi_{\operatorname{DMN}}(q, d) = \operatorname{DMN}\left(q_1, \dots, q_{|q|}, d_1, \dots, d_{|d|}\right)$,
where $q_i$ and $d_i$ are word embeddings.
What if we used contextual BERT representations instead?
1Ankit Kumar et al. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”. PMLR 2016.
BERT-DMN shows improvements over BERT.
BERT-DMNlite performance is comparable to BERT.
BERT-DMNlite reduces the number of trainable parameters roughly from 110M to 3M.
For each query-document pair, BERT outputs can be cached after the first epoch.
Jurek Leonhardt, Koustav Rudra, and Avishek Anand
Extractive Explanations for Interpretable Text Ranking, TOIS 2023 (journal paper)
Why should we even show the ranking model the whole document?
Assumption: $k$ sentences of a document are enough to estimate its relevance w.r.t. a query.
A document $d$ is split into sentences $s_i$.
The selector assigns a score to each sentence $s_i$ w.r.t. the query $q$.
The ranker sees only the $k$ highest scoring sentences.
The selector $\Psi$ computes a weight $w_i$ for each sentence:
$\left(w_1, \dots, w_{|d|} \right) = \Psi(q, d)$
End-to-end training: We draw a relaxed $k$-hot sample:1
$\left(\hat{w}_1, \dots, \hat{w}_{|d|} \right) = \operatorname{SubsetSample}(w, k, \tau)$
The query-document relevance is computed by the ranker $\Phi$ using the selected sentences $\hat{d}$:
$\phi(q, d) = \Phi \left(q, \hat{d} \right)$
Each token in $\hat{d}$ is multiplied by its corresponding $\hat{w_i}$ in order to preserve the gradients.
1Sang Michael Xie and Stefano Ermon. Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019.
There is a trade-off between the number of sentences $k$ and the effectiveness:
Faithfulness measures the degree to which the explanations represent the model's reasoning:
How well do the selected sentences represent the document they originate from?
We perform a user study to determine the utility of Select-and-Rank for humans.
240 query-document pairs from 30 queries judged by 80 users (4 judgments per instance).
Jurek Leonhardt, Avishek Anand, and Megha Khosla
Boilerplate Removal Using a Neural Sequence Labeling Model, WWW 2020 (demo paper)
The Web Conference
The Web Conference (formerly known as International World Wide Web Conference, abbreviated as WWW) is a yearly international academic conference on the topic of the future direction of the World Wide Web. The first conference of many was held and organized by Robert Cailliau in 1994 at CERN in Geneva, Switzerland. The conference has been organized by the International World Wide Web Conference Committee (IW3C2), also founded by Robert Cailliau and colleague Joseph Hardin, every year since. The conference’s location rotates among North America, Europe, and Asia and its events usually span a period of five days. [...]
Boilerplate removal is the process of extracting content in plain text from a web page.
Example use cases:
Task: Classify each leaf node in the HTML tree as content or not content.
Assumption: The position of a leaf node within the HTML tree is important for classification.
We therefore feed the sequence of leaf representations into the model and classify each element.
A web page is represented as a sequence of its leaf nodes ($\texttt{\#text}$) in the HTML tree.
The BoilerNet model consists of
On CleanEval, BoilerNet performs similarly to Web2Text, whose features were manually optimized for this dataset.
On GoogleTrends-2017 (our dataset), BoilerNet outperforms competitors.
In this thesis, we focused on:
Recall the trade-off between efficiency, explainability, and effectiveness.
Can we achieve efficiency and explainability?
Lexical matching estimates the relevance of query-document pairs based on overlapping terms and their frequencies (tf).
Hybrid retrieval uses a sparse retriever to retrieve $K^q_S$ and a dense retriever to retrieve $K^q_D$ for a query $q$.
The documents are ranked using a combination of the scores.
Long documents are often chunked into passages and scored using a $\max$ operation:
$\phi_D^{FF} (q, d) = \max_{p_i \in d} (\zeta (q) \cdot \eta^{FF} (p_i))$
Sparse retrieval is fast, but not competitive on its own.
Dense retrieval shows good nDCG, but recall suffers and it is slow.
Re-ranking using dense retrieval models works well.
Score interpolation further helps:
$\phi (q, d)$ $= \alpha$ $\phi_S (q, d)$ $+\ (1 - \alpha) $ $\phi_D(q, d)$
$p_0$
$p_1$
$\mathcal{A}$
$p_2$
$\mathcal{A}$
$\hat{p}_1$
$p_3$
$\hat{p}_2$
$p_4$
$\hat{p}_3$
$p_5$
$p_6$
$\mathcal{A}$
$\hat{p}_4$
def coalesce(P):
P_new = []
A = []
A_avg = None
first_iteration = True
for p in P:
if first_iteration:
first_iteration = False
elif distance(p, A_avg) >= delta:
P_new.append(A_avg)
A = []
A.append(p)
A_avg = np.mean(A, axis=0)
P_new.append(A_avg)
return P_new
We set $\alpha = 0.5$:
$\phi (q, d) = 0.5 \cdot \phi_S (q, d) + 0.5 \cdot \phi_D (q, d)$
Stop the computation once the highest possible score $\phi_{\text{max}} (q, d)$ is too low to make a difference.
$\phi_D$ is not normalized. We approximate its maximum as the highest observed score:
$\phi_{\text{max}} (q, \text{D224}) = 0.5 \cdot 0.73 + 0.5 \cdot 0.67 = 0.70$
$\phi_S$ | $\phi_D$ | $\phi$ ↓ | |
---|---|---|---|
|
|
|
|
|
|
|
|
We set $\alpha = 0.5$:
$\phi (q, d) = 0.5 \cdot \phi_S (q, d) + 0.5 \cdot \phi_D (q, d)$
Stop the computation once the highest possible score $\phi_{\text{max}} (q, d)$ is too low to make a difference.
$\phi_D$ is not normalized. We approximate its maximum as the highest observed score:
$\phi_{\text{max}} (q, \text{D224}) = 0.5\ \cdot$ $0.73$ $ +\ 0.5\ \cdot$ $0.67$ $ \gt $ $0.68$
$\phi_S$ | $\phi_D$ | $\phi$ ↓ | |
---|---|---|---|
|
|
|
|
|
|
|
|
We set $\alpha = 0.5$:
$\phi (q, d) = 0.5 \cdot \phi_S (q, d) + 0.5 \cdot \phi_D (q, d)$
Stop the computation once the highest possible score $\phi_{\text{max}} (q, d)$ is too low to make a difference.
$\phi_D$ is not normalized. We approximate its maximum as the highest observed score:
$\phi_{\text{max}} (q, \text{D105}) = 0.5\ \cdot$ $0.49$ $ +\ 0.5\ \cdot$ $0.71$ $ \lt $ $0.72$
$\phi_S$ | $\phi_D$ | $\phi$ ↓ | |
---|---|---|---|
|
|
|
|
|
|
|
|
Selective document encoders incur a small performance hit, but decrease the indexing latency notably:
The plot shows the average cosine similarity of BERT representations.
BERT-DMN shows much lower diffusion.
This suggests that fine-tuning vanilla BERT merely “aggregates” results.
Comprehensiveness measures the quality of rationales:
How well does the ranking model perform using the document without the selected sentences?