Monday, February 10, 2014

Week 5: Reading Notes


Chapter 11
11.1
Probability theory:
P(A,B) = P(AÇB) = P(A|B)P(B) = P(B|A)P(A)
Bayes’ Rule:

Odds:


11.2
PRP with retrieval costs
C1—cost of not retrieving a relevant document
C0—cost of retrieval of a nonrelevant document

11.3
BIM—Binary Independence Model, same as the multivariate Bernoulli Naïve Bayes model
Assumption: the relevance of each document is independent of the relevance of other documents, and each term is independent with each other.


Assumption: relevant documents are a very small percentage of the collection, it is plausible to approximate statistics ••documents by statistics from the whole collection. ut (the probability of term occurrence in nonrelevant documents for a query) is dft/N and
log[(1− ut)/ut] = log[(N −dft)/dft] ≈ log N/dft

11.4 BIM assumption:
• a Boolean representation of documents/queries/relevance
• term independence
• terms not in the query don’t affect the outcome
• document relevance values are independent


But some of them could be removed, such that the term independence because some phrases usually appear as term pairs like Hong and Kong which are strongly dependent.

Chapter 12
12.1
A language model is a function that puts a probability measure over strings drawn from some vocabulary.
There’re many kinds of language models, such as bigram language model, unigram language model (multinomial unigram language model and multiple-Bernoulli model).

12.2
The query likelihood model
Based on the Bayes rule P(d|q) = P(q|d)P(d)/P(q)
Documents are ranked by the probability that a query would be observed as a random sample from the respective document model.


Term frequency is directly represented in tf-idf models, but recent work has recognized the importance of document length normalization.

No comments:

Post a Comment