Chapter 4
4.1 The characteristics of computer hardware
based on the information retrieval system:
1. Caching— keeping frequently used disk
data in main memory.
2. Reading a single byte from disk can take
as much time as reading the entire block because operating systems generally
read and write entire blocks.
3. Data transferred from disk to memory are
handled by the system bus, not by the processor.
4.2 Blocked sort-based indexing. The
algorithm stores inverted blocks in files f1, . . . , fn and the
merged index in fmerged.
The last step: the algorithm simultaneously
merges the ten blocks into one large merged index. For example, the algorithm
simultaneously merges the ten blocks into one large merged index.
1 n ← 0
2 while (all documents have not been
processed)
3 do n ← n
+ 1
4 block ←
PARSENEXTBLOCK()
5 BSBI-INVERT(block)
6 WRITEBLOCKTODISK(block, fn)
7 MERGEBLOCKS(f1, . . . , fn;
fmerged)
4.3 Single-pass in-memory indexing.
The part of the algorithm that parses documents
and turns them into a stream of term–docID pairs (tokens), has been omitted.
1 output_file = NEWFILE()
2 dictionary = NEWHASH()
3 while (free memory available)
4 do token ←
next(token_stream)
5 if term(token) ∈/
dictionary
6 then postings_list = ADDTODICTIONARY(dictionary,
term(token))
7 else postings_list =
GETPOSTINGSLIST(dictionary, term(token))
8 if full(postings_list)
9 then postings_list =
DOUBLEPOSTINGSLIST(dictionary, term(token))
10
ADDTOPOSTINGSLIST(postings_list,docID(token))
11 sorted_terms ←
SORTTERMS(dictionary)
12 WRITEBLOCKTODISK(sorted_terms,dictionary,
output_file)
13 return output_file
4.4 Distributed indexing is an application
of MapReduce, a general architecture for distributed computing.
The map and reduce phases of MapReduce split
up the computing job into chunks that standard machines can process in a short
time. The map and reduce phases of MapReduce split up the computing job into
chunks that standard machines can process in a short time.
4.5 Dynamic indexing
Logarithmic merging. Each token
(termID,docID) is initially added to in-memory index Z0 by L merge
add token. Logarithmic merging initializes Z0 and indexes.
1 Z0 ←
MERGE(Z0, {token})
2 if |Z0| = n
3 then for i ← 0
to ∞
4 do if Ii ∈
indexes
5 then Zi+1 ←
MERGE(Ii, Zi)
6 (Zi+1 is a temporary index on disk.)
7 indexes ←
indexes − {Ii}
8 else Ii ←
Zi
(Zi becomes the permanent index Ii.)
9 indexes ←
indexes ∪ {Ii}
10 BREAK
11 Z0 ← ∅
Chapter 5
5.1 Statistical properties of terms in IR:
Heap’s law: a better way of getting a handle
on M
M = kTb
(T is the number of tokens, k is quite variable because vocabulary
growth depends a lot on the nature of the collection and how it is processed.)
Zipf’s law: if t1 is the most common term in
the collection, t2 is the next most common, and so on, then the
collection frequency cfi of the ith most common term is
proportional to 1/i
cfi ∝1/i
5.2 A dictionary data structures that achieve
increasingly higher compression ratios
1. The simplest data structure is to sort
the vocabulary lexicographically and store it in an array of fixed-width
entries.
2. further compress the dictionary by
grouping terms in the string into blocks of size k and keeping a term pointer
only for the first term of each block.
5.3
1. VB encoding and decoding. The functions
div and mod compute integer division and remainder after integer division,
respectively. PREPEND adds an element to the beginning of a list, for example,
PREPEND(h1, 2i, 3) = h3, 1, 2i. EXTEND extends a list, for example,
EXTEND(h1,2i,h3, 4i) = h1, 2, 3, 4i.
2. γcodes.
It implements variable-length encoding by splitting the representation of a gap
G into a pair of length and offset.
No comments:
Post a Comment