Nowadays, there're many abbreviations such as "you"-"u", "are"-"r", "please"-"plz", "thanks"-"tks", the abbreviated words are generally simple characters or not integrated words with no meaning. If using the indexing phrases to solve the polysemy problem, actually the context are also abbreviated words which would make it more confusing, such as "r u goin 2 c the film b4 tmr?" How can these words be automatically indexed?
Friday, January 24, 2014
Week 3: Reading Notes
Chapter 4
4.1 The characteristics of computer hardware
based on the information retrieval system:
1. Caching— keeping frequently used disk
data in main memory.
2. Reading a single byte from disk can take
as much time as reading the entire block because operating systems generally
read and write entire blocks.
3. Data transferred from disk to memory are
handled by the system bus, not by the processor.
4.2 Blocked sort-based indexing. The
algorithm stores inverted blocks in files f1, . . . , fn and the
merged index in fmerged.
The last step: the algorithm simultaneously
merges the ten blocks into one large merged index. For example, the algorithm
simultaneously merges the ten blocks into one large merged index.
1 n ← 0
2 while (all documents have not been
processed)
3 do n ← n
+ 1
4 block ←
PARSENEXTBLOCK()
5 BSBI-INVERT(block)
6 WRITEBLOCKTODISK(block, fn)
7 MERGEBLOCKS(f1, . . . , fn;
fmerged)
4.3 Single-pass in-memory indexing.
The part of the algorithm that parses documents
and turns them into a stream of term–docID pairs (tokens), has been omitted.
1 output_file = NEWFILE()
2 dictionary = NEWHASH()
3 while (free memory available)
4 do token ←
next(token_stream)
5 if term(token) ∈/
dictionary
6 then postings_list = ADDTODICTIONARY(dictionary,
term(token))
7 else postings_list =
GETPOSTINGSLIST(dictionary, term(token))
8 if full(postings_list)
9 then postings_list =
DOUBLEPOSTINGSLIST(dictionary, term(token))
10
ADDTOPOSTINGSLIST(postings_list,docID(token))
11 sorted_terms ←
SORTTERMS(dictionary)
12 WRITEBLOCKTODISK(sorted_terms,dictionary,
output_file)
13 return output_file
4.4 Distributed indexing is an application
of MapReduce, a general architecture for distributed computing.
The map and reduce phases of MapReduce split
up the computing job into chunks that standard machines can process in a short
time. The map and reduce phases of MapReduce split up the computing job into
chunks that standard machines can process in a short time.
4.5 Dynamic indexing
Logarithmic merging. Each token
(termID,docID) is initially added to in-memory index Z0 by L merge
add token. Logarithmic merging initializes Z0 and indexes.
1 Z0 ←
MERGE(Z0, {token})
2 if |Z0| = n
3 then for i ← 0
to ∞
4 do if Ii ∈
indexes
5 then Zi+1 ←
MERGE(Ii, Zi)
6 (Zi+1 is a temporary index on disk.)
7 indexes ←
indexes − {Ii}
8 else Ii ←
Zi
(Zi becomes the permanent index Ii.)
9 indexes ←
indexes ∪ {Ii}
10 BREAK
11 Z0 ← ∅
Chapter 5
5.1 Statistical properties of terms in IR:
Heap’s law: a better way of getting a handle
on M
M = kTb
(T is the number of tokens, k is quite variable because vocabulary
growth depends a lot on the nature of the collection and how it is processed.)
Zipf’s law: if t1 is the most common term in
the collection, t2 is the next most common, and so on, then the
collection frequency cfi of the ith most common term is
proportional to 1/i
cfi ∝1/i
5.2 A dictionary data structures that achieve
increasingly higher compression ratios
1. The simplest data structure is to sort
the vocabulary lexicographically and store it in an array of fixed-width
entries.
2. further compress the dictionary by
grouping terms in the string into blocks of size k and keeping a term pointer
only for the first term of each block.
5.3
1. VB encoding and decoding. The functions
div and mod compute integer division and remainder after integer division,
respectively. PREPEND adds an element to the beginning of a list, for example,
PREPEND(h1, 2i, 3) = h3, 1, 2i. EXTEND extends a list, for example,
EXTEND(h1,2i,h3, 4i) = h1, 2, 3, 4i.
2. γcodes.
It implements variable-length encoding by splitting the representation of a gap
G into a pair of length and offset.
Friday, January 17, 2014
Week 2: Reading Notes
FOA
1.1
FOA— Finding Out About is the empirical grounding for
theories referred to Information Retrieval. There’re three steps:
1. Asking a question;
1. Asking a question;
Similar with QUERY— users’ information need
2. Constructing an answer;
2. Constructing an answer;
Retrieval from the Corpus documents response
to the QUERY
3. Assessing the answer.
Relevance feedback of how relevant the answer
provided, that is, how match between descriptive features mentioned by users in
their queries and documents sharing those same features.
IES
1.1
What is Information
Retrieval?
1. Web Search— To implement and evaluate relevance ranking algorithms under a variety of contexts and requirements on the Web search engine.
1. Web Search— To implement and evaluate relevance ranking algorithms under a variety of contexts and requirements on the Web search engine.
2. Other
Search Applications— A desktop search engine provides search and browsing
facilities for files stored on a local hard disk and possibly on disks
connected over a local network.
Need
more creation times and greater awareness of file formats.
3. Other
IR Applications— Storage, manipulation and retrieval of human-language data.
More about categorization and filtering.
1.2
Information
Retrieval System
1. Basic
Architecture
Information
Need (Users)-> Translate it into Query to Search Engine-> (Search Engine)
manipulates inverted index for a document collection -> (Search Engine)
returns ranked lists of results (perform relevance ranking)
2. Update
Documents
Update
is viewed as a deletion of the old page and an addition of the new page.
3. Perform
Evaluation
PRP
(Probability Ranking Principle): If an IR system’s response to each query is a
ranking of the documents in the collection in order of decreasing probability
of relevance, then the overall effectiveness of the system to its users will be
maximized.
MIR
1.1
History
of Information Retrieval, adoption of Information Retrieval in the Library
Science area and currently it’s at the center of the stage with other
technologies.
1.2
IR
Problem: to retrieve all the documents that are relevant to a user query while
retrieving as few non-relevant documents as possible
1.3
High
level software architecture of an IR system. Additional module: Crawling
Indexing, retrieval and ranking of documents process
1.4 History of Web and
how the web changed search.
Friday, January 10, 2014
Week 1: Muddiest Point
How can the search engine such as Google retrieve information cross different languages based on one query?
Week 1: Reading Notes
Chapter 1 Section 1.2
1. How to build an
inverted index: core step— sorting the list to make the terms alphabetical
1)
Collect
the documents to be indexed
2)
Tokenize
the text, turning each document into a list of tokens
3)
Do
linguistic preprocessing, producing a list of normalized tokens, which are the
indexing terms
4)
Index
the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.
2. For an in-memory
postings list, two good ways are— singly linked lists or variable length
arrays.
Chapter 2
2.1 Convert the byte sequence into a linear sequence of characters and
choose suitable size document units together with a proper way of dividing or
aggregating files.
2.2 How to determine the vocabulary of terms:
1) Divide the character sequence and a defined
document unit into tokens
2) Drop stop words
3) Canonicalize tokens so that matches occur
despite superficial differences in the character sequences of the tokens
4) Stemming and lemmatization
2.3 Extensions to postings list data structures and ways to increase the
efficiency.
Chapter 3
3.1 Search structures for dictionaries: Hashing and Search Trees.
3.2 When the user is uncertain of spelling a query term, or the user is
aware of multiple variants of spelling a term and seeks others and etc., they
could use wildcard query.
3.3 Algorithms for spelling correction: Edit Distance, K-gram overlap
3.4 Generate a “phonetic hash” for each term to make the
similar-sounding terms hash to the same value.
Subscribe to:
Comments (Atom)