Friday, January 10, 2014

Week 1: Reading Notes

Chapter 1 Section 1.2
1. How to build an inverted index: core step— sorting the list to make the terms alphabetical
1)   Collect the documents to be indexed
2)   Tokenize the text, turning each document into a list of tokens
3)   Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
4)   Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
2. For an in-memory postings list, two good ways are— singly linked lists or variable length arrays.

Chapter 2
2.1 Convert the byte sequence into a linear sequence of characters and choose suitable size document units together with a proper way of dividing or aggregating files.
2.2 How to determine the vocabulary of terms:
1) Divide the character sequence and a defined document unit into tokens
2) Drop stop words
3) Canonicalize tokens so that matches occur despite superficial differences in the character sequences of the tokens
4) Stemming and lemmatization
2.3 Extensions to postings list data structures and ways to increase the efficiency.

Chapter 3
3.1 Search structures for dictionaries: Hashing and Search Trees.
3.2 When the user is uncertain of spelling a query term, or the user is aware of multiple variants of spelling a term and seeks others and etc., they could use wildcard query.
3.3 Algorithms for spelling correction: Edit Distance, K-gram overlap

3.4 Generate a “phonetic hash” for each term to make the similar-sounding terms hash to the same value.

No comments:

Post a Comment