Introduction to Boolean Retrieval

Bag-of-Words (BOW)

Assumption

A document is a collection of words
e.g.

Doc1: Mary married John
Doc2: John married Mary
These two documents are the same under BoW assumption

We will use the BoW assumption throughout IR part(not care about ordering). NLP part will cover other approaches that care about ordering.

Field (Zone) in Document

Document is a semi-structured data
e.g. Title, Author, Published, Date, Body
Someone may want to limit search scope within a certain field

Basic Field Index

We can see in the figure, the text and title is divided. The index indicates the Document ID of documents.

Field in Posting

Limitations of Boolean Retrieval

Thus far, our queries have all been Boolean. It means documents either match or don’t.
Good for expert users with precise understanding of their needs and the collection
Not good for the majority of users
— Most users are incapable of writing Boolean queries
— Or they are, but they think it’s too much work
Boolean queries often result in either too few or too manyresults
— Query1: “bluetooth pairing iphone” => 100,000 hits
— Query2: “bluetooth pairing iphone sony mdr-xb50” => 0 hits

Ranked Retrieval

Concept

Given a query, rank documents according to some criterion so that the “best” results appear early in the result list displayed to the user.(Simply, we will find how relevant the query and the document)
The goal if ranked retrieval is to find a scoring function

$Score(d, q)$

$where \; d \; is \; a \; document, \; q \; is \; a \; query$

Advantage

When a system produces a ranked result set, large result sets are not an issue. Because we could just select the top K(maybe 10 or more). And it doesn’t overwhelm the user.

Weighted Fields Approach

Advanced search is for experts
– Still majority users use a set of keywords as a query
Importance of term is not the same
– Terms in headline of news article is more important than terms in main text.

Scoring with Weighted Fields

Rank by Term Frequency

Definition

$tf_{t,d}$ is the number occurrences of term t in document d.

Rank based on the frequency of query terms in documents
Let q be a set of query terms (t1, t2, …, tm), a term frequency score of documents given query q is $Score_{tf}(d, q) = \sum_{i=1}^m tf_{t_{i}, d}$

TF Rank Example

If our query is “car insurance”, then score of each document is:
– Score(doc1, q) = tfcar,doc1 + tfinsurance,doc1 = 3
– Score(doc2, q) = tfcar,doc2 + tfinsurance,doc2 = 5
rank of doc2 is higher than doc1
Every query term has an equal importance
– What if insurance is more important than car?

Importance of Terms

In reality, every term has a different weight
– e.g., A collection of documents on the auto industry is likely to have the term car in almost every document.
How to mitigate the effect of terms that occur too often in the collection?
– Use document frequency of term.

Document Frequency (DF)

Definition

Document Frequency $df_{t}$ : the number of documents in the collection that contain term t.

df is a good way to measure an importance of a term.
– High frequency => not important (like stopwords)
– Low frequency => important
Why not collection frequency? (The total number of occurrences of a term in the collection.) We can see in the figure, although the cf is the same for these two terms, there is a big difference in df. Because try may appear in most of the documents and insurance only appear in relevant documents and may appear lots of times.

Inverse Document Frequency (IDF)

Definition

Let $df_{t}$ be the number of documents in the collection that contain a term t. The IDF can be defined as follows:

$idf_{t} = log\frac{N}{df_{t}}$

$where \; N \; is \; the \; total \; number \; of \; documents$

The idf of a rare term is high, whereas the idf of a frequent term is likely to be low.
– E.g., Let N = 100, $df{car}$ = 60, $df{insurance}$ = 10
– $idf{car}$ = 0. 22, $idf{insurance}$ = 1

TF-IDF

Definition

The tf-idf weight of term t in document d is as follows:

$tf-idf_{t, d} = tf_{t, d} \times idf_{t}$

With tf-idf weighting scheme, the score of document d given query q = (t1, t2, …, tm) is $Score_{tf-idf}(d, q) = \sum_{i=1}^m tf-idf_{t_{i}, d}$

Example

Given query “car insurance”, then score of each document is:
– Score(doc1, q) = $tf-idf{car,doc1}$ + $tf-idf{insurance,doc1}$ = 2.22
– Score(doc2, q) = $tf-idf{car,doc2}$ + $tf-idf{insurance,doc2}$ = 1.1
Unlike tf-based scoring approach, score of doc1 is greater than doc2.

Limitation of tf-idf scoring

tf-idf still heavily relies on the frequency of terms.
Assume
– $tf{car,doc1}$ = 20
– $tf{car,doc2}$ = 1
– If our query contains car, $tf_{car}$ score of doc1 is 20 times significant than doc2.
Is there a big difference between frequency 10 and 20?
Score linearly increases with respect to frequency of term
After a certain frequency, the absolute frequency isn’t important.

Sublinear tf scaling

Use logarithmically weighted term frequency (wf)
Logarithmic term frequency version of tfidf $wf-idf_{t, d} = wf_{t, d} \times idf_{t}$

Limitation of tf-idf/wf-idf scoring

Assume we have document d
We create a new document d’ by appending a copy of d to itself (d’ = d X 2).
d’ should be the most relevant document to d by every query, while their scores are different!
– $Score{tf-idf}$ (d’, q) >= $Score{tf-idf}$ (d, q)
– $Score{wf-idf}$ (d’, q) >= $Score{wf-idf}$ (d, q)
– Both scoring prefers longer documents.

Maximum tf normalization

Let $tf_{max}(d)$ be the maximum frequency of document d
Normalized term frequency is defined as
Maximum value of ntf is 1
Minimum value of ntf is α
this approach has a limitation too (but not showing here).

Document as Vectors

Given a term-document matrix
– a document can be represented as a vector of length V
– V = size of vocabulary
Document vectors:
– d1 = (1, 4), d2 = (7, 2), d3 = (6, 11)

Document Similarity in Vector Space

Plot document vectors in vector space
How to find similar documents in vector space?
– Distance from vector to vector
– Angle difference between vectors

Angle Difference

Cosine similarity:

Standard way of quantifying similarity between documents
– 1 if directions of two vectors are the same
– 0 if directions of two vectors are orthogonal

Query as Document

Query can be converted as vector too
Compute the similarity between query and document in the same way

永缘空的博客

COMP4650(Document Analysis) Ranked Retrieval

Introduction to Boolean Retrieval

Bag-of-Words (BOW)

Assumption

Field (Zone) in Document

Basic Field Index

Field in Posting

Limitations of Boolean Retrieval

Ranked Retrieval

Concept

Advantage

Weighted Fields Approach

Scoring with Weighted Fields

Rank by Term Frequency

Definition

TF Rank Example

Importance of Terms

Document Frequency (DF)

Definition

Inverse Document Frequency (IDF)

Definition

TF-IDF

Definition

Example

Limitation of tf-idf scoring

Sublinear tf scaling

Limitation of tf-idf/wf-idf scoring

Maximum tf normalization

Document as Vectors

Document Similarity in Vector Space

Angle Difference

Query as Document