Information Retrieval_INFSCI2140: January 2014

Friday, January 24, 2014

Week 2: Muddiest Point

Nowadays, there're many abbreviations such as "you"-"u", "are"-"r", "please"-"plz", "thanks"-"tks", the abbreviated words are generally simple characters or not integrated words with no meaning. If using the indexing phrases to solve the polysemy problem, actually the context are also abbreviated words which would make it more confusing, such as "r u goin 2 c the film b4 tmr?" How can these words be automatically indexed?

Week 3: Reading Notes

Chapter 4

4.1 The characteristics of computer hardware based on the information retrieval system:

1. Caching— keeping frequently used disk data in main memory.

2. Reading a single byte from disk can take as much time as reading the entire block because operating systems generally read and write entire blocks.

3. Data transferred from disk to memory are handled by the system bus, not by the processor.

4.2 Blocked sort-based indexing. The algorithm stores inverted blocks in ﬁles f1, . . . , f_n and the merged index in f_merged.

The last step: the algorithm simultaneously merges the ten blocks into one large merged index. For example, the algorithm simultaneously merges the ten blocks into one large merged index.

1 n ← 0

2 while (all documents have not been processed)

3 do n ← n + 1

4 block ← PARSENEXTBLOCK()

5 BSBI-INVERT(block)

6 WRITEBLOCKTODISK(block, f_n)

7 MERGEBLOCKS(f₁, . . . , f_n; f_merged)

4.3 Single-pass in-memory indexing.

The part of the algorithm that parses documents and turns them into a stream of term–docID pairs (tokens), has been omitted.

1 output_file = NEWFILE()

2 dictionary = NEWHASH()

3 while (free memory available)

4 do token ← next(token_stream)

5 if term(token) ∈/ dictionary

6 then postings_list = ADDTODICTIONARY(dictionary, term(token))

7 else postings_list = GETPOSTINGSLIST(dictionary, term(token))

8 if full(postings_list)

9 then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))

10 ADDTOPOSTINGSLIST(postings_list,docID(token))

11 sorted_terms ← SORTTERMS(dictionary)

12 WRITEBLOCKTODISK(sorted_terms,dictionary, output_file)

13 return output_file

4.4 Distributed indexing is an application of MapReduce, a general architecture for distributed computing.

The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time. The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time.

4.5 Dynamic indexing

Logarithmic merging. Each token (termID,docID) is initially added to in-memory index Z₀ by L merge add token. Logarithmic merging initializes Z₀ and indexes.

1 Z0 ← MERGE(Z0, {token})

2 if |Z0| = n

3 then for i ← 0 to ∞

4 do if Ii ∈ indexes

5 then Zi+1 ← MERGE(Ii, Zi)

6 (Zi+1 is a temporary index on disk.)

7 indexes ← indexes − {Ii}

8 else Ii ← Zi

(Zi becomes the permanent index Ii.)

9 indexes ← indexes ∪ {Ii}

10 BREAK

11 Z0 ← ∅

Chapter 5

5.1 Statistical properties of terms in IR:

Heap’s law: a better way of getting a handle on M

M = kT^b(T is the number of tokens, k is quite variable because vocabulary growth depends a lot on the nature of the collection and how it is processed.)

Zipf’s law: if t1 is the most common term in the collection, t₂ is the next most common, and so on, then the collection frequency cf_i of the i_thmost common term is proportional to 1/i

cf_i ∝1/i

5.2 A dictionary data structures that achieve increasingly higher compression ratios

1. The simplest data structure is to sort the vocabulary lexicographically and store it in an array of fixed-width entries.

2. further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block.

5.3

1. VB encoding and decoding. The functions div and mod compute integer division and remainder after integer division, respectively. PREPEND adds an element to the beginning of a list, for example, PREPEND(h1, 2i, 3) = h3, 1, 2i. EXTEND extends a list, for example, EXTEND(h1,2i,h3, 4i) = h1, 2, 3, 4i.

2. γcodes. It implements variable-length encoding by splitting the representation of a gap G into a pair of length and offset.

Friday, January 17, 2014

Week 2: Reading Notes

FOA

1.1

FOA— Finding Out About is the empirical grounding for theories referred to Information Retrieval. There’re three steps:
1. Asking a question;

Similar with QUERY— users’ information need
2. Constructing an answer;

Retrieval from the Corpus documents response to the QUERY

3. Assessing the answer.

Relevance feedback of how relevant the answer provided, that is, how match between descriptive features mentioned by users in their queries and documents sharing those same features.

IES

1.1

What is Information Retrieval?
1. Web Search— To implement and evaluate relevance ranking algorithms under a variety of contexts and requirements on the Web search engine.

2. Other Search Applications— A desktop search engine provides search and browsing facilities for files stored on a local hard disk and possibly on disks connected over a local network.

Need more creation times and greater awareness of file formats.

3. Other IR Applications— Storage, manipulation and retrieval of human-language data. More about categorization and filtering.

1.2

Information Retrieval System

1. Basic Architecture

Information Need (Users)-> Translate it into Query to Search Engine-> (Search Engine) manipulates inverted index for a document collection -> (Search Engine) returns ranked lists of results (perform relevance ranking)

2. Update Documents

Update is viewed as a deletion of the old page and an addition of the new page.

3. Perform Evaluation

PRP (Probability Ranking Principle): If an IR system’s response to each query is a ranking of the documents in the collection in order of decreasing probability of relevance, then the overall effectiveness of the system to its users will be maximized.

MIR

1.1 History of Information Retrieval, adoption of Information Retrieval in the Library Science area and currently it’s at the center of the stage with other technologies.

1.2 IR Problem: to retrieve all the documents that are relevant to a user query while retrieving as few non-relevant documents as possible

1.3 High level software architecture of an IR system. Additional module: Crawling

Indexing, retrieval and ranking of documents process

1.4 History of Web and how the web changed search.

Friday, January 10, 2014

Week 1: Muddiest Point

How can the search engine such as Google retrieve information cross different languages based on one query?

Week 1: Reading Notes

Chapter 1 Section 1.2

1. How to build an inverted index: core step— sorting the list to make the terms alphabetical

1) Collect the documents to be indexed

2) Tokenize the text, turning each document into a list of tokens

3) Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms

4) Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

2. For an in-memory postings list, two good ways are— singly linked lists or variable length arrays.

Chapter 2

2.1 Convert the byte sequence into a linear sequence of characters and choose suitable size document units together with a proper way of dividing or aggregating files.

2.2 How to determine the vocabulary of terms:

1) Divide the character sequence and a defined document unit into tokens

2) Drop stop words

3) Canonicalize tokens so that matches occur despite superficial differences in the character sequences of the tokens

4) Stemming and lemmatization

2.3 Extensions to postings list data structures and ways to increase the efficiency.

Chapter 3

3.1 Search structures for dictionaries: Hashing and Search Trees.

3.2 When the user is uncertain of spelling a query term, or the user is aware of multiple variants of spelling a term and seeks others and etc., they could use wildcard query.

3.3 Algorithms for spelling correction: Edit Distance, K-gram overlap

3.4 Generate a “phonetic hash” for each term to make the similar-sounding terms hash to the same value.