Introduction to Information Retrieval

What is information retrieval

Concept

Google Defination
Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Textbook Defination
Textbook
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Case

Web search
E-mail search
Searching your laptop
Corporate knowledge bases (企业知识库)
Image search, video search

Why we need information retrieval

Information overload

If there are too much information, it would make the person have difficulty to understand an issue and make wrong decision.

How to perform information retrieval

Basic Assumptions

Collection
A set of documents (Assume it as a static collection for the moment)
Goal
Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task (User’s information need is often underspecified
)

Classic Search Model

In the process, we need to continously refine the query to obtain better result. Like first we search “CONNECT BLUETOOTH HEADPHONE”, then we find query “PAIRING BLUETOOTH HEADPHONE” is better. That is Query Refinement.

Key Objectives

Effectiveness

There are more than 130 trillion pages are indexed by Google. So effectiveness is important.

Accuracy

We need to find top 10 pages from 130 trillion pages.

IR vs NLP

IR	NLP
Computational approaches	Cognitive, symbolic and computational approaches
Statistical (shallow) understanding of language	Semantic (deep) understanding of language
Handle large scale problems	(often times) small scale problems

IR and NLP are getting closer

IR => NLP	NLP => IR
Larger data collections	Deep analysis of text documents and queries
Scalable/robust NLP techniques, e.g., translation models	Information extraction for structured IR tasks

Search with Boolean query

Procedures

Lookup query term in the dictionary
Retrieve the posting lists
Operation
- AND : intersect the posting lists
- OR : union the posting list
- NOT : diff the posting list
E.g., “obama” AND “healthcare” NOT “news” . It will intersect the querying result of “obama” and “healthcare”, then throw the parts containing of “news”.

Retrieval procedure in modern IR

Boolean model provides all the ranking candidates
- Locate documents satisfying Boolean condition
- E.g., “obama healthcare” -> “obama” OR “healthcare”
Rank candidates by relevance
- Older ranking candidate means that it will return old matching documents
Efficiency consideration
- Top-k retrieval (Google)

Term-document incidence matrices

We could see in the figure, the left column is the terms and the top row is the play. If the play contains the term, it will record as 1, and if not, it will record as 0. For example, we want to query Brutus AND Caesar BUT NOT Calpurnia. We will focus on these three terms as below.

So for each column, it should be 110. If it satisfies this, we will record as 1, else 0. So the record is 100100.

Efficiency

Bigger Collections
- 1 million documents
- Each 1,000 words long
Avg 6 bytes/word including spaces/punctuation
- 6GB of data in the documents.
Assume there are M = 500K distinct terms among these.
- Corresponds to a matrix with 500 billion entries
- But it has no more than one billion 1’s
- Extremely sparse matrix!

So this method is not very efficient.

Inverted Index

Concept

Inverted Index consists of a dictionary and postings

Dictionary : a set of unique terms
Posting : variable-size array that keeps the list of documents given terms
For each term t, we must store a list of all documents that contain t. (Identify each doc by a docID)
We need variable-size postings lists

Inverted Index Construction

Initial Stages of Text Processing

Brief Summary

Tokenization
- Cut character sequence into word tokens
- e.g. “I am a boy” => “I”, “am”, “a”, “boy”
Normalization
- Map text and query term to same form
- e.g. “U.S.A.” equals to “USA”
Stemming
- We may wish different forms of a root to match
- e.g. “authorize”, “authorization”
Stop words
- We may omit very common words (or not)
- e.g. “the”, “a”, “of”, “to”

Indexer Step

Token sequence
- Scan documents for indexable terms
- Keep list of (token, docID) pairs.
Sort tuples by terms (and then docID)
Merging
- Multiple term entries in a single document are merged
- Split into Dictionary and Postings
- Doc. frequency information is added.

Boolean Retrieval with Inverted Index

Easy to retrieve all documents containing term t
How to find documents containing BRUTUS AND CEASAR using postings?
- Linear time retrieval algorithm

A Step back

Tokenization
Stopwords
Normalization
Stemming and Lemmatization

I have introduced in the before blog Kaggle比赛第一次尝试

永缘空的博客

COMP4650(Document Analysis) Information Retrieval

Introduction to Information Retrieval

What is information retrieval

Concept

Case

Why we need information retrieval

Information overload

How to perform information retrieval

Basic Assumptions

Classic Search Model

Key Objectives

Effectiveness

Accuracy

IR vs NLP

IR and NLP are getting closer

Search with Boolean query

Procedures

Retrieval procedure in modern IR

Term-document incidence matrices

Efficiency

Inverted Index

Concept

Inverted Index Construction

Initial Stages of Text Processing

Brief Summary

Indexer Step

Boolean Retrieval with Inverted Index

A Step back