Information Retrieval_INFSCI2140: Week 12: Reading Notes

IIR Chapter 13
Text classification and Naive Bayes
1. The text classification problem

TRAINMULTINOMIALNB(C,D)

1 V ← EXTRACTVOCABULARY(D)

2 N ← COUNTDOCS(D)

3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)

5 prior[c] ← Nc/N

6 textc ← CONCATENATETEXTOFALLDOCSINCLASS(D, c)

7 for each t ∈ V

8 do Tct ← COUNTTOKENSOFTERM(textc, t)

9 for each t ∈ V

10 do condprob[t][c] ← Tct+1

åt′ (Tct′+1)

11 return V, prior, condprob

APPLYMULTINOMIALNB(C,V, prior, condprob, d)

1 W ← EXTRACTTOKENSFROMDOC(V, d)

2 for each c ∈ C

3 do score[c] ← log prior[c]

4 for each t ∈ W

5 do score[c] += log condprob[t][c]

6 return argmaxc∈C score[c]

2. Relation to multinomial unigram language model

3. The Bernoulli model

4. Properties of Naive Bayes

4.1 A variant of the multinomial model

5. Feature selection: is the process of selecting a subset of the terms occurring

in the training set and using only this subset as features in text classification.

5.1 Mutual information

Information Retrieval_INFSCI2140