Chapter 11
11.1
Probability theory:
P(A,B) = P(AÇB) =
P(A|B)P(B) = P(B|A)P(A)
Odds:
11.2
PRP with retrieval costs
11.3
BIM—Binary Independence Model,
same as the multivariate Bernoulli Naïve Bayes model
Assumption: the relevance
of each document is independent of the relevance of other documents, and each term
is independent with each other.
Assumption: relevant
documents are a very small percentage of the collection, it is plausible to
approximate statistics ••documents by statistics
from the whole collection. ut (the probability of term occurrence in
nonrelevant documents for a query) is dft/N and
log[(1− ut)/ut] = log[(N
−dft)/dft] ≈ log N/dft
11.4 BIM assumption:
• a Boolean representation
of documents/queries/relevance
• term independence
• terms not in the query
don’t affect the outcome
• document relevance values
are independent
But some of them could be
removed, such that the term independence because some phrases usually appear as
term pairs like Hong and Kong which are strongly dependent.
Chapter 12
A language model is a
function that puts a probability measure over strings drawn from some
vocabulary.
There’re many kinds of
language models, such as bigram language model, unigram language model
(multinomial unigram language model and multiple-Bernoulli model).
12.2
The query likelihood model
Based on the Bayes rule P(d|q)
= P(q|d)P(d)/P(q)
Documents are ranked by the
probability that a query would be observed as a random sample from the
respective document model.
Term frequency is directly
represented in tf-idf models, but recent work has recognized the importance of
document length normalization.



No comments:
Post a Comment