Text classification and Naive Bayes
1. The text classification problem
TRAINMULTINOMIALNB(C,D)
1 V ← EXTRACTVOCABULARY(D)
2 N ← COUNTDOCS(D)
3 for each c ∈ C
4 do Nc ← COUNTDOCSINCLASS(D, c)
5 prior[c] ← Nc/N
6 textc ← CONCATENATETEXTOFALLDOCSINCLASS(D, c)
7 for each t ∈ V
8 do Tct ← COUNTTOKENSOFTERM(textc, t)
9 for each t ∈ V
10 do condprob[t][c] ← Tct+1
åt′ (Tct′+1)
11 return V, prior, condprob
APPLYMULTINOMIALNB(C,V, prior, condprob, d)
1 W ← EXTRACTTOKENSFROMDOC(V, d)
2 for each c ∈ C
3 do score[c] ← log prior[c]
4 for each t ∈ W
5 do score[c] += log condprob[t][c]
6 return argmaxc∈C score[c]
2. Relation to multinomial unigram language model
3. The Bernoulli model
4. Properties of Naive Bayes
4.1 A variant of the multinomial model
5. Feature selection: is the process of selecting a subset of the terms occurring
in the training set and using only this subset as features in text classification.
5.1 Mutual information
No comments:
Post a Comment