0%

COMP4650(Document Analysis) Information Retrieval

Introduction to Information Retrieval

What is information retrieval

Concept

  • Google Defination
    Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.
  • Textbook Defination
    Textbook
    Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Case

  • Web search
  • E-mail search
  • Searching your laptop
  • Corporate knowledge bases (企业知识库)
  • Image search, video search

Why we need information retrieval

Information overload

If there are too much information, it would make the person have difficulty to understand an issue and make wrong decision.

How to perform information retrieval

Basic Assumptions

  • Collection
    A set of documents (Assume it as a static collection for the moment)
  • Goal
    Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task (User’s information need is often underspecified
    )

Classic Search Model

In the process, we need to continously refine the query to obtain better result. Like first we search “CONNECT BLUETOOTH HEADPHONE”, then we find query “PAIRING BLUETOOTH HEADPHONE” is better. That is Query Refinement.

Key Objectives

Effectiveness

There are more than 130 trillion pages are indexed by Google. So effectiveness is important.

Accuracy

We need to find top 10 pages from 130 trillion pages.

IR vs NLP

IR NLP
Computational approaches Cognitive, symbolic and computational approaches
Statistical (shallow) understanding of language Semantic (deep) understanding of language
Handle large scale problems (often times) small scale problems

IR and NLP are getting closer

IR => NLP NLP => IR
Larger data collections Deep analysis of text documents and queries
Scalable/robust NLP techniques, e.g., translation models Information extraction for structured IR tasks

Search with Boolean query

Procedures

  • Lookup query term in the dictionary
  • Retrieve the posting lists
  • Operation
    • AND : intersect the posting lists
    • OR : union the posting list
    • NOT : diff the posting list
  • E.g., “obama” AND “healthcare” NOT “news” . It will intersect the querying result of “obama” and “healthcare”, then throw the parts containing of “news”.

Retrieval procedure in modern IR

  1. Boolean model provides all the ranking candidates
    • Locate documents satisfying Boolean condition
    • E.g., “obama healthcare” -> “obama” OR “healthcare”
  2. Rank candidates by relevance
    • Older ranking candidate means that it will return old matching documents
  3. Efficiency consideration
    • Top-k retrieval (Google)

Term-document incidence matrices

We could see in the figure, the left column is the terms and the top row is the play. If the play contains the term, it will record as 1, and if not, it will record as 0. For example, we want to query Brutus AND Caesar BUT NOT Calpurnia. We will focus on these three terms as below.

So for each column, it should be 110. If it satisfies this, we will record as 1, else 0. So the record is 100100.

Efficiency

  • Bigger Collections
    • 1 million documents
    • Each 1,000 words long
  • Avg 6 bytes/word including spaces/punctuation
    • 6GB of data in the documents.
  • Assume there are M = 500K distinct terms among these.
    • Corresponds to a matrix with 500 billion entries
    • But it has no more than one billion 1’s
    • Extremely sparse matrix!

So this method is not very efficient.

Inverted Index

Concept

Inverted Index consists of a dictionary and postings

  • Dictionary : a set of unique terms
  • Posting : variable-size array that keeps the list of documents given terms
  • For each term t, we must store a list of all documents that contain t. (Identify each doc by a docID)
  • We need variable-size postings lists

Inverted Index Construction

Initial Stages of Text Processing

Brief Summary

  1. Tokenization
    • Cut character sequence into word tokens
    • e.g. “I am a boy” => “I”, “am”, “a”, “boy”
  2. Normalization
    • Map text and query term to same form
    • e.g. “U.S.A.” equals to “USA”
  3. Stemming
    • We may wish different forms of a root to match
    • e.g. “authorize”, “authorization”
  4. Stop words
    • We may omit very common words (or not)
    • e.g. “the”, “a”, “of”, “to”

Indexer Step

  1. Token sequence
    • Scan documents for indexable terms
    • Keep list of (token, docID) pairs.
  2. Sort tuples by terms (and then docID)
  3. Merging
    • Multiple term entries in a single document are merged
    • Split into Dictionary and Postings
    • Doc. frequency information is added.

Boolean Retrieval with Inverted Index

  • Easy to retrieve all documents containing term t
  • How to find documents containing BRUTUS AND CEASAR using postings?
    • Linear time retrieval algorithm

A Step back

  • Tokenization
  • Stopwords
  • Normalization
  • Stemming and Lemmatization

I have introduced in the before blog Kaggle比赛第一次尝试

-------------本文结束感谢您的阅读-------------