Introduction to Information Retrieval
What is information retrieval
Concept
- Google Defination
Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.
- Textbook Defination
Textbook
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Case
- Web search
- E-mail search
- Searching your laptop
- Corporate knowledge bases (企业知识库)
- Image search, video search
Why we need information retrieval
Information overload
If there are too much information, it would make the person have difficulty to understand an issue and make wrong decision.
How to perform information retrieval
Basic Assumptions
- Collection
A set of documents (Assume it as a static collection for the moment) - Goal
Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task (User’s information need is often underspecified
)
Classic Search Model
In the process, we need to continously refine the query to obtain better result. Like first we search “CONNECT BLUETOOTH HEADPHONE”, then we find query “PAIRING BLUETOOTH HEADPHONE” is better. That is Query Refinement.
Key Objectives
Effectiveness
There are more than 130 trillion pages are indexed by Google. So effectiveness is important.
Accuracy
We need to find top 10 pages from 130 trillion pages.
IR vs NLP
IR | NLP |
---|---|
Computational approaches | Cognitive, symbolic and computational approaches |
Statistical (shallow) understanding of language | Semantic (deep) understanding of language |
Handle large scale problems | (often times) small scale problems |
IR and NLP are getting closer
IR => NLP | NLP => IR |
---|---|
Larger data collections | Deep analysis of text documents and queries |
Scalable/robust NLP techniques, e.g., translation models | Information extraction for structured IR tasks |
Search with Boolean query
Procedures
- Lookup query term in the dictionary
- Retrieve the posting lists
- Operation
- AND : intersect the posting lists
- OR : union the posting list
- NOT : diff the posting list
- E.g., “obama” AND “healthcare” NOT “news” . It will intersect the querying result of “obama” and “healthcare”, then throw the parts containing of “news”.
Retrieval procedure in modern IR
- Boolean model provides all the ranking candidates
- Locate documents satisfying Boolean condition
- E.g., “obama healthcare” -> “obama” OR “healthcare”
- Rank candidates by relevance
- Older ranking candidate means that it will return old matching documents
- Efficiency consideration
- Top-k retrieval (Google)
Term-document incidence matrices
We could see in the figure, the left column is the terms and the top row is the play. If the play contains the term, it will record as 1, and if not, it will record as 0. For example, we want to query Brutus AND Caesar BUT NOT Calpurnia. We will focus on these three terms as below.
So for each column, it should be 110. If it satisfies this, we will record as 1, else 0. So the record is 100100.
Efficiency
- Bigger Collections
- 1 million documents
- Each 1,000 words long
- Avg 6 bytes/word including spaces/punctuation
- 6GB of data in the documents.
- Assume there are M = 500K distinct terms among these.
- Corresponds to a matrix with 500 billion entries
- But it has no more than one billion 1’s
- Extremely sparse matrix!
So this method is not very efficient.
Inverted Index
Concept
Inverted Index consists of a dictionary and postings
- Dictionary : a set of unique terms
- Posting : variable-size array that keeps the list of documents given terms
- For each term t, we must store a list of all documents that contain t. (Identify each doc by a docID)
- We need variable-size postings lists
Inverted Index Construction
Initial Stages of Text Processing
Brief Summary
- Tokenization
- Cut character sequence into word tokens
- e.g. “I am a boy” => “I”, “am”, “a”, “boy”
- Normalization
- Map text and query term to same form
- e.g. “U.S.A.” equals to “USA”
- Stemming
- We may wish different forms of a root to match
- e.g. “authorize”, “authorization”
- Stop words
- We may omit very common words (or not)
- e.g. “the”, “a”, “of”, “to”
Indexer Step
- Token sequence
- Scan documents for indexable terms
- Keep list of (token, docID) pairs.
- Sort tuples by terms (and then docID)
- Merging
- Multiple term entries in a single document are merged
- Split into Dictionary and Postings
- Doc. frequency information is added.
Boolean Retrieval with Inverted Index
- Easy to retrieve all documents containing term t
- How to find documents containing BRUTUS AND CEASAR using postings?
- Linear time retrieval algorithm
A Step back
- Tokenization
- Stopwords
- Normalization
- Stemming and Lemmatization
I have introduced in the before blog Kaggle比赛第一次尝试