Friday, January 24, 2014

Week 3: Reading Notes


Chapter 4
4.1 The characteristics of computer hardware based on the information retrieval system:
1. Caching— keeping frequently used disk data in main memory.
2. Reading a single byte from disk can take as much time as reading the entire block because operating systems generally read and write entire blocks.
3. Data transferred from disk to memory are handled by the system bus, not by the processor.

4.2 Blocked sort-based indexing. The algorithm stores inverted blocks in files f1, . . . , fn and the merged index in fmerged.
The last step: the algorithm simultaneously merges the ten blocks into one large merged index. For example, the algorithm simultaneously merges the ten blocks into one large merged index.
1 n 0
2 while (all documents have not been processed)
3 do n n + 1
4 block PARSENEXTBLOCK()
5 BSBI-INVERT(block)
6 WRITEBLOCKTODISK(block, fn)
7 MERGEBLOCKS(f1, . . . , fn; fmerged)

4.3 Single-pass in-memory indexing.
The part of the algorithm that parses documents and turns them into a stream of term–docID pairs (tokens), has been omitted.
1 output_file = NEWFILE()
2 dictionary = NEWHASH()
3 while (free memory available)
4 do token next(token_stream)
5 if term(token) / dictionary
6 then postings_list = ADDTODICTIONARY(dictionary, term(token))
7 else postings_list = GETPOSTINGSLIST(dictionary, term(token))
8 if full(postings_list)
9 then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))
10 ADDTOPOSTINGSLIST(postings_list,docID(token))
11 sorted_terms SORTTERMS(dictionary)
12 WRITEBLOCKTODISK(sorted_terms,dictionary, output_file)
13 return output_file

4.4 Distributed indexing is an application of MapReduce, a general architecture for distributed computing.
The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time. The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time.

4.5 Dynamic indexing
Logarithmic merging. Each token (termID,docID) is initially added to in-memory index Z0 by L merge add token. Logarithmic merging initializes Z0 and indexes.
1 Z0 MERGE(Z0, {token})
2 if |Z0| = n
3 then for i 0 to
4 do if Ii indexes
5 then Zi+1 MERGE(Ii, Zi)
6 (Zi+1 is a temporary index on disk.)
7 indexes indexes − {Ii}
8 else Ii Zi
(Zi becomes the permanent index Ii.)
9 indexes indexes {Ii}
10 BREAK
11 Z0

Chapter 5
5.1 Statistical properties of terms in IR:
Heap’s law: a better way of getting a handle on M
M = kT(T is the number of tokens, k is quite variable because vocabulary growth depends a lot on the nature of the collection and how it is processed.)
Zipf’s law: if t1 is the most common term in the collection, t2 is the next most common, and so on, then the collection frequency cfi of the ith most common term is proportional to 1/i
cfi 1/i

5.2 A dictionary data structures that achieve increasingly higher compression ratios
1. The simplest data structure is to sort the vocabulary lexicographically and store it in an array of fixed-width entries.
2. further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block.

5.3
1. VB encoding and decoding. The functions div and mod compute integer division and remainder after integer division, respectively. PREPEND adds an element to the beginning of a list, for example, PREPEND(h1, 2i, 3) = h3, 1, 2i. EXTEND extends a list, for example, EXTEND(h1,2i,h3, 4i) = h1, 2, 3, 4i.
2. γcodes. It implements variable-length encoding by splitting the representation of a gap G into a pair of length and offset.

No comments:

Post a Comment