Information Retrieval_INFSCI2140: Week 5: Reading Notes

Chapter 11

11.1

Probability theory:

P(A,B) = P(AÇB) = P(A|B)P(B) = P(B|A)P(A)

Bayes’ Rule:

Odds:

11.2

PRP with retrieval costs

C₁—cost of not retrieving a relevant document

C₀—cost of retrieval of a nonrelevant document

11.3

BIM—Binary Independence Model, same as the multivariate Bernoulli Naïve Bayes model

Assumption: the relevance of each document is independent of the relevance of other documents, and each term is independent with each other.

Assumption: relevant documents are a very small percentage of the collection, it is plausible to approximate statistics ••documents by statistics from the whole collection. ut (the probability of term occurrence in nonrelevant documents for a query) is dft/N and

log[(1− ut)/ut] = log[(N −dft)/dft] ≈ log N/dft

11.4 BIM assumption:

• a Boolean representation of documents/queries/relevance

• term independence

• terms not in the query don’t affect the outcome

• document relevance values are independent

But some of them could be removed, such that the term independence because some phrases usually appear as term pairs like Hong and Kong which are strongly dependent.

Chapter 12

12.1

A language model is a function that puts a probability measure over strings drawn from some vocabulary.

There’re many kinds of language models, such as bigram language model, unigram language model (multinomial unigram language model and multiple-Bernoulli model).

12.2

The query likelihood model

Based on the Bayes rule P(d|q) = P(q|d)P(d)/P(q)

Documents are ranked by the probability that a query would be observed as a random sample from the respective document model.

Term frequency is directly represented in tf-idf models, but recent work has recognized the importance of document length normalization.

Information Retrieval_INFSCI2140

Monday, February 10, 2014

Week 5: Reading Notes

No comments:

Post a Comment