Introduction to Boolean Retrieval
Bag-of-Words (BOW)
Assumption
A document is a collection of words
e.g.
- Doc1: Mary married John
- Doc2: John married Mary
- These two documents are the same under BoW assumption
We will use the BoW assumption throughout IR part(not care about ordering). NLP part will cover other approaches that care about ordering.
Field (Zone) in Document
- Document is a semi-structured data
e.g. Title, Author, Published, Date, Body - Someone may want to limit search scope within a certain field
Basic Field Index
We can see in the figure, the text and title is divided. The index indicates the Document ID of documents.
Field in Posting
Limitations of Boolean Retrieval
- Thus far, our queries have all been Boolean. It means documents either match or don’t.
- Good for expert users with precise understanding of their needs and the collection
- Not good for the majority of users
— Most users are incapable of writing Boolean queries
— Or they are, but they think it’s too much work - Boolean queries often result in either too few or too manyresults
— Query1: “bluetooth pairing iphone” => 100,000 hits
— Query2: “bluetooth pairing iphone sony mdr-xb50” => 0 hits
Ranked Retrieval
Concept
Given a query, rank documents according to some criterion so that the “best” results appear early in the result list displayed to the user.(Simply, we will find how relevant the query and the document)
The goal if ranked retrieval is to find a scoring function
$where \; d \; is \; a \; document, \; q \; is \; a \; query$
Advantage
When a system produces a ranked result set, large result sets are not an issue. Because we could just select the top K(maybe 10 or more). And it doesn’t overwhelm the user.
Weighted Fields Approach
- Advanced search is for experts
– Still majority users use a set of keywords as a query - Importance of term is not the same
– Terms in headline of news article is more important than terms in main text.
Scoring with Weighted Fields
Rank by Term Frequency
Definition
$tf_{t,d}$ is the number occurrences of term t in document d.
- Rank based on the frequency of query terms in documents
- Let q be a set of query terms (t1, t2, …, tm), a term frequency score of documents given query q is
TF Rank Example
- If our query is “car insurance”, then score of each document is:
– Score(doc1, q) = tfcar,doc1 + tfinsurance,doc1 = 3
– Score(doc2, q) = tfcar,doc2 + tfinsurance,doc2 = 5 - rank of doc2 is higher than doc1
- Every query term has an equal importance
– What if insurance is more important than car?
Importance of Terms
- In reality, every term has a different weight
– e.g., A collection of documents on the auto industry is likely to have the term car in almost every document. - How to mitigate the effect of terms that occur too often in the collection?
– Use document frequency of term.
Document Frequency (DF)
Definition
Document Frequency $df_{t}$ : the number of documents in the collection that contain term t.
- df is a good way to measure an importance of a term.
– High frequency => not important (like stopwords)
– Low frequency => important - Why not collection frequency? (The total number of occurrences of a term in the collection.) We can see in the figure, although the cf is the same for these two terms, there is a big difference in df. Because try may appear in most of the documents and insurance only appear in relevant documents and may appear lots of times.
Inverse Document Frequency (IDF)
Definition
Let $df_{t}$ be the number of documents in the collection that contain a term t. The IDF can be defined as follows:
$where \; N \; is \; the \; total \; number \; of \; documents$
- The idf of a rare term is high, whereas the idf of a frequent term is likely to be low.
– E.g., Let N = 100, $df{car}$ = 60, $df{insurance}$ = 10
– $idf{car}$ = 0. 22, $idf{insurance}$ = 1
TF-IDF
Definition
The tf-idf weight of term t in document d is as follows:
- With tf-idf weighting scheme, the score of document d given query q = (t1, t2, …, tm) is
Example
- Given query “car insurance”, then score of each document is:
– Score(doc1, q) = $tf-idf{car,doc1}$ + $tf-idf{insurance,doc1}$ = 2.22
– Score(doc2, q) = $tf-idf{car,doc2}$ + $tf-idf{insurance,doc2}$ = 1.1 - Unlike tf-based scoring approach, score of doc1 is greater than doc2.
Limitation of tf-idf scoring
- tf-idf still heavily relies on the frequency of terms.
Assume
– $tf{car,doc1}$ = 20
– $tf{car,doc2}$ = 1
– If our query contains car, $tf_{car}$ score of doc1 is 20 times significant than doc2. - Is there a big difference between frequency 10 and 20?
- Score linearly increases with respect to frequency of term
- After a certain frequency, the absolute frequency isn’t important.
Sublinear tf scaling
- Use logarithmically weighted term frequency (wf)
- Logarithmic term frequency version of tfidf
Limitation of tf-idf/wf-idf scoring
- Assume we have document d
- We create a new document d’ by appending a copy of d to itself (d’ = d X 2).
- d’ should be the most relevant document to d by every query, while their scores are different!
– $Score{tf-idf}$ (d’, q) >= $Score{tf-idf}$ (d, q)
– $Score{wf-idf}$ (d’, q) >= $Score{wf-idf}$ (d, q)
– Both scoring prefers longer documents.
Maximum tf normalization
- Let $tf_{max}(d)$ be the maximum frequency of document d
- Normalized term frequency is defined as
- Maximum value of ntf is 1
- Minimum value of ntf is α
- this approach has a limitation too (but not showing here).
Document as Vectors
- Given a term-document matrix
– a document can be represented as a vector of length V
– V = size of vocabulary - Document vectors:
– d1 = (1, 4), d2 = (7, 2), d3 = (6, 11)
Document Similarity in Vector Space
- Plot document vectors in vector space
- How to find similar documents in vector space?
– Distance from vector to vector
– Angle difference between vectors
Angle Difference
Cosine similarity:
- Standard way of quantifying similarity between documents
– 1 if directions of two vectors are the same
– 0 if directions of two vectors are orthogonal
Query as Document
- Query can be converted as vector too
- Compute the similarity between query and document in the same way