Information Retrieval_INFSCI2140: Week 3: Reading Notes

Chapter 4

4.1 The characteristics of computer hardware based on the information retrieval system:

1. Caching— keeping frequently used disk data in main memory.

2. Reading a single byte from disk can take as much time as reading the entire block because operating systems generally read and write entire blocks.

3. Data transferred from disk to memory are handled by the system bus, not by the processor.

4.2 Blocked sort-based indexing. The algorithm stores inverted blocks in ﬁles f1, . . . , f_n and the merged index in f_merged.

The last step: the algorithm simultaneously merges the ten blocks into one large merged index. For example, the algorithm simultaneously merges the ten blocks into one large merged index.

1 n ← 0

2 while (all documents have not been processed)

3 do n ← n + 1

4 block ← PARSENEXTBLOCK()

5 BSBI-INVERT(block)

6 WRITEBLOCKTODISK(block, f_n)

7 MERGEBLOCKS(f₁, . . . , f_n; f_merged)

4.3 Single-pass in-memory indexing.

The part of the algorithm that parses documents and turns them into a stream of term–docID pairs (tokens), has been omitted.

1 output_file = NEWFILE()

2 dictionary = NEWHASH()

3 while (free memory available)

4 do token ← next(token_stream)

5 if term(token) ∈/ dictionary

6 then postings_list = ADDTODICTIONARY(dictionary, term(token))

7 else postings_list = GETPOSTINGSLIST(dictionary, term(token))

8 if full(postings_list)

9 then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))

10 ADDTOPOSTINGSLIST(postings_list,docID(token))

11 sorted_terms ← SORTTERMS(dictionary)

12 WRITEBLOCKTODISK(sorted_terms,dictionary, output_file)

13 return output_file

4.4 Distributed indexing is an application of MapReduce, a general architecture for distributed computing.

The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time. The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time.

4.5 Dynamic indexing

Logarithmic merging. Each token (termID,docID) is initially added to in-memory index Z₀ by L merge add token. Logarithmic merging initializes Z₀ and indexes.

1 Z0 ← MERGE(Z0, {token})

2 if |Z0| = n

3 then for i ← 0 to ∞

4 do if Ii ∈ indexes

5 then Zi+1 ← MERGE(Ii, Zi)

6 (Zi+1 is a temporary index on disk.)

7 indexes ← indexes − {Ii}

8 else Ii ← Zi

(Zi becomes the permanent index Ii.)

9 indexes ← indexes ∪ {Ii}

10 BREAK

11 Z0 ← ∅

Chapter 5

5.1 Statistical properties of terms in IR:

Heap’s law: a better way of getting a handle on M

M = kT^b(T is the number of tokens, k is quite variable because vocabulary growth depends a lot on the nature of the collection and how it is processed.)

Zipf’s law: if t1 is the most common term in the collection, t₂ is the next most common, and so on, then the collection frequency cf_i of the i_thmost common term is proportional to 1/i

cf_i ∝1/i

5.2 A dictionary data structures that achieve increasingly higher compression ratios

1. The simplest data structure is to sort the vocabulary lexicographically and store it in an array of fixed-width entries.

2. further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block.

5.3

1. VB encoding and decoding. The functions div and mod compute integer division and remainder after integer division, respectively. PREPEND adds an element to the beginning of a list, for example, PREPEND(h1, 2i, 3) = h3, 1, 2i. EXTEND extends a list, for example, EXTEND(h1,2i,h3, 4i) = h1, 2, 3, 4i.

2. γcodes. It implements variable-length encoding by splitting the representation of a gap G into a pair of length and offset.

Information Retrieval_INFSCI2140

Friday, January 24, 2014

Week 3: Reading Notes

No comments:

Post a Comment