Chapter 1 Section 1.2
1. How to build an
inverted index: core step— sorting the list to make the terms alphabetical
1)
Collect
the documents to be indexed
2)
Tokenize
the text, turning each document into a list of tokens
3)
Do
linguistic preprocessing, producing a list of normalized tokens, which are the
indexing terms
4)
Index
the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.
2. For an in-memory
postings list, two good ways are— singly linked lists or variable length
arrays.
Chapter 2
2.1 Convert the byte sequence into a linear sequence of characters and
choose suitable size document units together with a proper way of dividing or
aggregating files.
2.2 How to determine the vocabulary of terms:
1) Divide the character sequence and a defined
document unit into tokens
2) Drop stop words
3) Canonicalize tokens so that matches occur
despite superficial differences in the character sequences of the tokens
4) Stemming and lemmatization
2.3 Extensions to postings list data structures and ways to increase the
efficiency.
Chapter 3
3.1 Search structures for dictionaries: Hashing and Search Trees.
3.2 When the user is uncertain of spelling a query term, or the user is
aware of multiple variants of spelling a term and seeks others and etc., they
could use wildcard query.
3.3 Algorithms for spelling correction: Edit Distance, K-gram overlap
3.4 Generate a “phonetic hash” for each term to make the
similar-sounding terms hash to the same value.
No comments:
Post a Comment