show the code of naive bayes

贝叶斯就是把无法计算的概率,转化为容易计算的概率,然后算出来的过程。中间需要一些假设,而且即使这些假设不成立,分类效果依旧很好。

  • 离散型数据

  • 连续型数据

Naive bayes - 离散型数据(垃圾邮件分类模型)

本例是一个帖子分类问题。输入评论的句子,判断帖子是否有侮辱性。--> 再复杂点就是垃圾邮件分类。

对于单词的统计有2种策略,1是只统计有无出现,2是统计出现的次数。

  • 从文本,构建文本所有单词列表。

  • 然后构建每个文本中是否出现该单词,1出现,0没有出现。

  • 计算每个单词出现的频率: w表示词向量, ci表示帖子分类

  • 训练模型就是计算每个分类下,每个单词的频率。

  • 测试模型,就是输入单词次数,使用计算好的各个分类下的单词频率,计算频率最高的分类

get data

In [1]:
import re
def loadDataSet():
    postingList=[
        'my dog has flea problems help please',
        'mybe not take him to dog park stupid',
        'my dalmation is so cute I love him',
        'stop posting stupid worthless garbage',
        'mr licks ate my steak how to stop him',
        'quit buying worthless dog food stupid'
    ]
    classVec=[0,1,0,1,0,1] #是否有侮辱性词汇,1有,0没;
    postingList2=[]
    for s in postingList:
        postingList2.append( re.split(' ',s) )
    return postingList2, classVec
# test
loadDataSet()
Out[1]:
([['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
  ['mybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']],
 [0, 1, 0, 1, 0, 1])

get word vector

In [2]:
def createVocabList(dataSet):
    vocabSet=set()
    for words in dataSet:
        vocabSet=vocabSet | set(words) # 求并集
    return list(vocabSet) #集合 to list

def setOfWords2Vec(vocabList, inputSet):
    rsVect=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            rsVect[vocabList.index(word)]=1
        else:
            print('the word: %s is not in my Vocabulary!' % word) #这个应该用不到
    return rsVect

# test
posts,tags=loadDataSet()
vocabList=createVocabList(posts)
print(vocabList)
['how', 'park', 'ate', 'buying', 'problems', 'dalmation', 'is', 'mybe', 'dog', 'not', 'him', 'posting', 'stop', 'flea', 'worthless', 'take', 'please', 'to', 'quit', 'my', 'so', 'cute', 'I', 'garbage', 'food', 'steak', 'licks', 'stupid', 'love', 'mr', 'has', 'help']
In [3]:
print( setOfWords2Vec(vocabList, posts[0]) )
print( setOfWords2Vec(vocabList, posts[2]) )
[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]
In [4]:
# get word vector matrix
train_X=[]
for post in posts:
    train_X.append(setOfWords2Vec(vocabList, post))
print(train_X)
[[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]]

get prob from word vector

1.由词向量,我们知道一个词是否出现在某个文档中,也知道了某个文档的分类。

  • p(ci|w) = p(w|ci)*p(ci) / p(w)
  • 输入句子,判断属于什么分类 p(ci|w)的问题,就变成计算
  • 出现某个分类的概率p(ci),和该分类中某个单词的频率 p(w|ci).

2.而计算 p(w|ci)=p(w0,w1,...,wn|ci) 就是统计该分类下,每个单词出现的概率。

  • 这里要使用朴素贝叶斯的假设了: 各个条件相互独立,则
  • p(w|ci)=p(w0,w1,...,wn|ci)=p(w0|ci)p(w1|ci)...p(wn|ci)
  • 这样就极大的简化了计算过程。
In [5]:
import numpy as np
def trainNB0(train_X, train_Y):
    numTrainDocs=len(train_X)
    # 单词总数
    numWords=len(train_X[0])
    # 1分类的百分比
    pAbusive=sum(train_Y)/float(numTrainDocs)
    
    p0Num=np.zeros(numWords); p1Num=np.zeros(numWords);
    p0Denom=0.0; p1Denom=0.0;
    for i in range(numTrainDocs):
        if train_Y[i]==1:
            p1Num += train_X[i] #该分类下,每个单词的数量
            p1Denom += sum(train_X[i]) #该分类下,总单词数
        else:
            p0Num += train_X[i]
            p0Denom += sum(train_X[i])
    p1Vect=p1Num/p1Denom; # num ro frequncy
    p0Vect=p0Num/p0Denom;
    return p0Vect, p1Vect, pAbusive
# test
pV0,pV1,pAb=trainNB0(train_X, train_Y=tags)
In [6]:
print(pV0)
print(pV1)
print(pAb)
[0.04166667 0.         0.04166667 0.         0.04166667 0.04166667
 0.04166667 0.         0.04166667 0.         0.08333333 0.
 0.04166667 0.04166667 0.         0.         0.04166667 0.04166667
 0.         0.125      0.04166667 0.04166667 0.04166667 0.
 0.         0.04166667 0.04166667 0.         0.04166667 0.04166667
 0.04166667 0.04166667]
[0.         0.05263158 0.         0.05263158 0.         0.
 0.         0.05263158 0.10526316 0.05263158 0.05263158 0.05263158
 0.05263158 0.         0.10526316 0.05263158 0.         0.05263158
 0.05263158 0.         0.         0.         0.         0.05263158
 0.05263158 0.         0.         0.15789474 0.         0.
 0.         0.        ]
0.5

check

In [7]:
# 'cute' 在0出现1次,1出现0次

根据实际情况修改分类器

  • 不能出现零,否则相乘后都是0.

  • 同时,x>0时,f(x)与 f(ln(x)) 的单调性相同。所以把乘法转为取log后的加法。

In [8]:
# 训练模型

import numpy as np
def trainNB1(train_X, train_Y):
    numTrainDocs=len(train_X)
    # 单词总数
    numWords=len(train_X[0])
    # 1分类的百分比
    pAbusive=sum(train_Y)/float(numTrainDocs)
    
    p0Num=np.ones(numWords); p1Num=np.ones(numWords);
    p0Denom=2.0; p1Denom=2.0;
    for i in range(numTrainDocs):
        if train_Y[i]==1:
            p1Num += train_X[i] #该分类下,每个单词的数量
            p1Denom += sum(train_X[i]) #该分类下,总单词数
        else:
            p0Num += train_X[i]
            p0Denom += sum(train_X[i])
    p1Vect=np.log(p1Num/p1Denom) # num ro frequncy
    p0Vect=np.log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive
# test
pV0,pV1,pAb=trainNB1(train_X, train_Y=tags)
print(pV0)
print(pV1)
print(pAb)
[-2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.15948425 -3.25809654
 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936
 -3.25809654 -1.87180218 -2.56494936 -2.56494936 -2.56494936 -3.25809654
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -2.56494936]
[-3.04452244 -2.35137526 -3.04452244 -2.35137526 -3.04452244 -3.04452244
 -3.04452244 -2.35137526 -1.94591015 -2.35137526 -2.35137526 -2.35137526
 -2.35137526 -3.04452244 -1.94591015 -2.35137526 -3.04452244 -2.35137526
 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -2.35137526
 -2.35137526 -3.04452244 -3.04452244 -1.65822808 -3.04452244 -3.04452244
 -3.04452244 -3.04452244]
0.5

test

In [9]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1=sum(vec2Classify*p1Vec) + np.log(pClass1)
    p0=sum(vec2Classify*p0Vec) + np.log(1.0-pClass1)
    print(p0, p1)
    if p1>p0:
        return 1
    else:
        return 0
# test
In [10]:
entryPost=['love', 'my', 'dalmation']
entryVect=np.array(setOfWords2Vec(vocabList, entryPost))
print( entryVect )
classifyNB(entryVect, pV0, pV1, pAb)
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0]
-7.694848072384611 -9.826714493730215
Out[10]:
0
In [11]:
entryPost=['stupid', 'garbage']
entryVect=np.array(setOfWords2Vec(vocabList, entryPost))
print( entryVect )
classifyNB(entryVect, pV0, pV1, pAb)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0]
-7.20934025660291 -4.702750514326955
Out[11]:
1

词集模型(词是否出现);词袋模型(词出现的次数)

In [12]:
def bagOfWords2Vec(vocabList, inputSet):
    rsVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            rsVec[vocabList.index(word)] += 1
    return rsVec
In [13]:
# 训练模型
import numpy as np
def trainNB2(train_X, train_Y):
    numTrainDocs=len(train_X)
    # 单词总数
    numWords=len(train_X[0])
    # 1分类的百分比
    pAbusive=sum(train_Y)/float(numTrainDocs)
    
    p0Num=np.ones(numWords); p1Num=np.ones(numWords);
    p0Denom=2.0; p1Denom=2.0;
    for i in range(numTrainDocs):
        if train_Y[i]==1:
            p1Num += train_X[i] #该分类下,每个单词的数量
            p1Denom += sum(train_X[i]) #该分类下,总单词数
        else:
            p0Num += train_X[i]
            p0Denom += sum(train_X[i])
    p1Vect=np.log(p1Num/p1Denom) # num ro frequncy
    p0Vect=np.log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive
# test

# get word vector matrix
train_X2=[]
for post in posts:
    train_X2.append(bagOfWords2Vec(vocabList, post))
print(train_X2)


pV0_2,pV1_2,pAb_2=trainNB2(train_X2, train_Y=tags)
print(pV0_2)
print(pV1_2)
print(pAb_2)
[[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]]
[-2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.15948425 -3.25809654
 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936
 -3.25809654 -1.87180218 -2.56494936 -2.56494936 -2.56494936 -3.25809654
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936
 -2.56494936 -2.56494936]
[-3.04452244 -2.35137526 -3.04452244 -2.35137526 -3.04452244 -3.04452244
 -3.04452244 -2.35137526 -1.94591015 -2.35137526 -2.35137526 -2.35137526
 -2.35137526 -3.04452244 -1.94591015 -2.35137526 -3.04452244 -2.35137526
 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -2.35137526
 -2.35137526 -3.04452244 -3.04452244 -1.65822808 -3.04452244 -3.04452244
 -3.04452244 -3.04452244]
0.5
In [14]:
entryPost=['love', 'my', 'dalmation']
entryVect=np.array(bagOfWords2Vec(vocabList, entryPost))
print( entryVect )
classifyNB(entryVect, pV0_2, pV1_2, pAb_2)
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0]
-7.694848072384611 -9.826714493730215
Out[14]:
0
In [15]:
entryPost=['stupid', 'garbage']
entryVect=np.array(bagOfWords2Vec(vocabList, entryPost))
print( entryVect )
classifyNB(entryVect, pV0_2, pV1_2, pAb_2)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0]
-7.20934025660291 -4.702750514326955
Out[15]:
1

接着还有多分类(n>2)怎么处理?

连续型变量怎么处理?

   性别  身高(英尺) 体重(磅)  脚掌(英寸)
  男    6       180     12
  男    5.92     190     11
  男    5.58     170     12
  男    5.92     165     10
  女    5       100     6
  女    5.5      150     8
  女    5.42     130     7
  女    5.75     150     9

已知某人身高6英尺、体重130磅,脚掌8英寸,请问该人是男是女?######
P(ci|w)=P(w|ci)*p(ci) / p(w), 我们可以忽略掉p(w),因为它是一个恒定的频率。
而p(ci)很容易计算。
根据朴素贝叶斯假设,条件都是相对独立的,p(w|ci)=p(w0|ci)p(w1|ci)p(w2|ci)...p(wn|ci)


######
根据概率密度函数,计算概率

假设都是正态分布 X~ N(mu, sigma^2),则需要计算正态分布的2个参数 

heights=c(6,5.92,5.58,5.92)
mean2=mean(heights);mean2 #5.855
var2=var(heights);var2 #0.03503333
# R
1/sqrt(2*3.1415*0.035)*exp(-(6-5.855)^2/(2*0.035))

## js
1/Math.sqrt(2*3.1415*0.035)*Math.exp(-((6-5.855)**2)/(2*0.035)) #1.579206773964085
In [16]:
1/np.sqrt(2*3.1415*0.035)*np.exp(-((6-5.855)**2)/(2*0.035))
Out[16]:
1.579206773964085

load data

In [17]:
def loadDataSet2():
    dataSet=[ #列名: 身高(英尺) 体重(磅)  脚掌(英寸)
        [6,180,12],
        [5.92,190,11],
        [5.58,170,12],
        [5.92,165,10],
        [5,100,6],
        [5.5,150,8],
        [5.42,130,7],
        [5.75,150,9]
    ]
    tags=[1,1,1,1, 0,0,0,0] #tags 1=男的; 0=女的
    return dataSet, tags
dataSet, tags=loadDataSet2()
print(dataSet)
print(tags)
[[6, 180, 12], [5.92, 190, 11], [5.58, 170, 12], [5.92, 165, 10], [5, 100, 6], [5.5, 150, 8], [5.42, 130, 7], [5.75, 150, 9]]
[1, 1, 1, 1, 0, 0, 0, 0]

get paras: trainBayes()

In [18]:
import numpy as np
def trainBayes(dataSet, tags):
    paras={}
    uniqTag=set(tags)
    print('uniqTag=',uniqTag)
    for tag in uniqTag:
        paras[tag]=[]
        dataOfThisTag=[]
        for i in range(len(tags)):
            if tags[i]==tag:
                dataOfThisTag.append(dataSet[i])
        dataOfThisTag=np.array(dataOfThisTag)
        # cal mean, var by column
        paras[tag].append( dataOfThisTag.mean(axis=0))
        paras[tag].append( dataOfThisTag.var(axis=0, ddof=1))
        paras[tag].append( dataOfThisTag.shape[0] )
    return paras
# test
paras=trainBayes(dataSet, tags)
paras
# 女: 三列的平均值,三列的方差, 观察条目个数
# 男: 三列的平均值,三列的方差, 观察条目个数
uniqTag= {0, 1}
Out[18]:
{0: [array([  5.4175, 132.5   ,   7.5   ]),
  array([9.72250000e-02, 5.58333333e+02, 1.66666667e+00]),
  4],
 1: [array([  5.855, 176.25 ,  11.25 ]),
  array([3.50333333e-02, 1.22916667e+02, 9.16666667e-01]),
  4]}

cal prob for a new entry

P(身高=6|男) x P(体重=130|男) x P(脚掌=8|男) x P(男)
    = 6.1984 x e-9

P(身高=6|女) x P(体重=130|女) x P(脚掌=8|女) x P(女)
    = 5.3778 x e-4

可以看到,女性的概率比男性要高出将近10000倍,所以判断该人为女性。
In [19]:
def predictBayes(newEntry, paras):
    newEntry=np.array(newEntry)
    uniqTag=set(paras.keys())
    rsDict={}
    totalbyGroup={} # get pre-prob for each class
    for tag in uniqTag:
        rsDict[tag]=[]
        totalbyGroup[tag]=paras[tag][-1]
        # print(tag, paras[tag])
        for col in range(len(paras[tag][0])):
            mean1=paras[tag][0][col]
            var1=paras[tag][1][col]
            
            X=newEntry[col]
            p=1/np.sqrt(2*3.1415*var1)*np.exp(-((X-mean1)**2)/(2*var1))
            rsDict[tag].append(p)
    #
    #print(rsDict)
    # calc post-prob for each class
    postProbs={}
    for tag in uniqTag:
        p0=totalbyGroup[tag]/sum(totalbyGroup.values())
        p= np.sum(np.log( np.array( rsDict[tag] ) ) )+ np.log(np.array(p0)) #np把乘法转为加法,会更方便
        print(tag, p)
        # 记录后验概率最大的p和分类,并返回p最大的分类
        if 'max' not in postProbs:
            postProbs['max']=p
            postProbs['tag']=tag
        elif p>postProbs['max']:
            postProbs['max']=p
            postProbs['tag']=tag
    return postProbs['tag']
# test
predictBayes([6, 130, 8],paras)
0 -7.527996461433017
1 -18.89914470021889
Out[19]:
0
In [20]:
a0=np.exp(-7.527996461433017) #女
a1=np.exp(-18.89914470021889) #男
print(a0, a1, a0/a1)
0.0005378147104813586 6.197346005195689e-09 86781.4561314584

validation

In [21]:
# 数量太少了,只好用原始数据进行验证了
In [22]:
i=-1
for item in dataSet:
    i=i+1
    pred=predictBayes(item,paras)
    print('>>>>>>>>>>',item, '; Pred=',pred, '; Actual=', tags[i], '; ', pred==tags[i])
    #
0 -15.542921834567345
1 -4.800531449063267
>>>>>>>>>> [6, 180, 12] ; Pred= 1 ; Actual= 1 ;  True
0 -13.636833095951348
1 -4.9998969370642765
>>>>>>>>>> [5.92, 190, 11] ; Pred= 1 ; Actual= 1 ;  True
0 -13.172573780545862
1 -5.681484213984493
>>>>>>>>>> [5.58, 170, 12] ; Pred= 1 ; Actual= 1 ;  True
0 -9.82190772281702
1 -5.563841467110501
>>>>>>>>>> [5.92, 165, 10] ; Pred= 1 ; Actual= 1 ;  True
0 -8.219747784529547
1 -53.254230985347604
>>>>>>>>>> [5, 100, 6] ; Pred= 0 ; Actual= 0 ;  True
0 -6.086702033597909
1 -14.499412403294258
>>>>>>>>>> [5.5, 150, 8] ; Pred= 0 ; Actual= 0 ;  True
0 -5.783074887763696
1 -25.390624675999543
>>>>>>>>>> [5.42, 130, 7] ; Pred= 0 ; Actual= 0 ;  True
0 -7.220258217706933
1 -9.858118397585406
>>>>>>>>>> [5.75, 150, 9] ; Pred= 0 ; Actual= 0 ;  True

test on iris

In [23]:
import os
os.getcwd()
Out[23]:
'G:\\ML_MachineLearning\\NB'
In [24]:
import pandas as pd
def loadDataSet3():
    return pd.read_csv('../iris_data/iris.csv', index_col=0)
iris=loadDataSet3()
iris.head()
Out[24]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
In [44]:
import numpy as np
def splitData(df, test_ratio):
    # 索引范围为[0, n), 随机选x个不重复
    n=df.shape[0]
    x=round(n*test_ratio)
    index = np.random.choice(np.arange(n), size=x, replace=False)
    #
    test_index = np.array(index)
    train_index = np.delete(np.arange(n), test_index)
    return df.iloc[train_index,],df.iloc[test_index,]
np.random.seed(1)
train_set, test_set=splitData(iris, 0.2)
print(train_set.shape)
print(test_set.shape)
(120, 5)
(30, 5)
In [50]:
def np2vec(npArray):
    arrayVec=[]
    for i in range(npArray.shape[0]):
        arrayVec.append([npArray.iloc[i,0], npArray.iloc[i,1],npArray.iloc[i,2],npArray.iloc[i,3]])
    #
    tags2=[]
    for item in npArray['Species']:
        tags2.append(item)
    return arrayVec,tags2
#test
trainX, trainY=np2vec(train_set)
testX, testY=np2vec(test_set)
In [51]:
paras=trainBayes(trainX, tags=trainY)
paras
uniqTag= {'versicolor', 'virginica', 'setosa'}
Out[51]:
{'setosa': [array([4.96153846, 3.36666667, 1.46666667, 0.23333333]),
  array([0.11032389, 0.13385965, 0.02596491, 0.01122807]),
  39],
 'versicolor': [array([5.94594595, 2.73243243, 4.22972973, 1.30540541]),
  array([0.28255255, 0.10947447, 0.21936937, 0.04052553]),
  37],
 'virginica': [array([6.525     , 2.95227273, 5.53409091, 2.02045455]),
  array([0.39540698, 0.09418076, 0.31020613, 0.08073467]),
  44]}
In [52]:
predictBayes([4.9, 3.1, 1.5, 0.1],paras)
versicolor -38.21661432381203
setosa 0.28235339339129295
virginica -53.71715459357284
Out[52]:
'setosa'
In [59]:
i=-1
rsArr=[]
j=0
for item in testX:
    i=i+1
    pred=predictBayes(item,paras)
    rs=(pred==testY[i])
    rsArr.append(rs)
    if rs==True:
        j+=1
    print('>>>>>>>>>>',i,item, '; Pred=',pred, '; Actual=', testY[i], '; ', rs, '\n')
print(i,j, round(j/i,2)*100, '%')
versicolor -44.12681589706359
setosa -4.725057243408674
virginica -58.4842691139446
>>>>>>>>>> 0 [5.8, 4.0, 1.2, 0.2] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -6.23331524726756
setosa -80.23697354330952
virginica -20.44157443119337
>>>>>>>>>> 1 [5.1, 2.5, 3.0, 1.1] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -2.01334834711614
setosa -237.59636387441716
virginica -5.666667398840877
>>>>>>>>>> 2 [6.6, 3.0, 4.4, 1.4] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -37.18421270508493
setosa -2.3274555821143568
virginica -52.71759580830226
>>>>>>>>>> 3 [5.4, 3.9, 1.3, 0.4] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -29.40310036990041
setosa -646.109605523668
virginica -8.607352700945972
>>>>>>>>>> 4 [7.9, 3.8, 6.4, 2.0] ; Pred= virginica ; Actual= virginica ;  True 

versicolor -4.0208001079355515
setosa -291.250757714522
virginica -4.112394673839194
>>>>>>>>>> 5 [6.3, 3.3, 4.7, 1.6] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -16.911779620304813
setosa -460.325929832495
virginica -2.2716282172237214
>>>>>>>>>> 6 [6.9, 3.1, 5.1, 2.3] ; Pred= virginica ; Actual= virginica ;  True 

versicolor -29.70967746466003
setosa -4.26305828835927
virginica -45.1224477481749
>>>>>>>>>> 7 [5.1, 3.8, 1.9, 0.4] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -35.33656523332123
setosa 0.572611985453209
virginica -51.19848902773319
>>>>>>>>>> 8 [4.7, 3.2, 1.6, 0.2] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -20.494030134778374
setosa -551.055960649682
virginica -2.222214324223158
>>>>>>>>>> 9 [6.9, 3.2, 5.7, 2.3] ; Pred= virginica ; Actual= virginica ;  True 

versicolor -0.9718411960002352
setosa -196.66539022178353
virginica -8.69338570847679
>>>>>>>>>> 10 [5.6, 2.7, 4.2, 1.3] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -32.20679997407876
setosa -2.840969095627875
virginica -47.515776108518644
>>>>>>>>>> 11 [5.4, 3.9, 1.7, 0.4] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -17.585265145646748
setosa -553.4969291329096
virginica -1.875424229112653
>>>>>>>>>> 12 [7.1, 3.0, 5.9, 2.1] ; Pred= virginica ; Actual= virginica ;  True 

versicolor -2.7498929609395093
setosa -256.7347481527813
virginica -4.936968939002526
>>>>>>>>>> 13 [6.4, 3.2, 4.5, 1.5] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -1.51998631771603
setosa -252.95417148174593
virginica -4.954447188880095
>>>>>>>>>> 14 [6.0, 2.9, 4.5, 1.5] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -40.62014736000626
setosa -0.7390381509216919
virginica -56.6466648556547
>>>>>>>>>> 15 [4.4, 3.2, 1.3, 0.2] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -1.128043327160095
setosa -169.2010211660469
virginica -10.475953236298665
>>>>>>>>>> 16 [5.8, 2.6, 4.0, 1.2] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -1.9253351799537741
setosa -249.60290018423612
virginica -5.685455590035668
>>>>>>>>>> 17 [5.6, 3.0, 4.5, 1.5] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -30.413552751053114
setosa -0.7556196570674242
virginica -46.34802054571772
>>>>>>>>>> 18 [5.4, 3.4, 1.5, 0.4] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -39.332698888393494
setosa -0.15111228928023512
virginica -55.258412720860115
>>>>>>>>>> 19 [5.0, 3.2, 1.2, 0.2] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -1.3880543088253825
setosa -209.43701918241192
virginica -9.419562441919206
>>>>>>>>>> 20 [5.5, 2.6, 4.4, 1.2] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -2.2409904036757404
setosa -248.6267533952453
virginica -6.203908758687335
>>>>>>>>>> 21 [5.4, 3.0, 4.5, 1.5] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -5.359437586202645
setosa -349.0227263062297
virginica -2.33683306650311
>>>>>>>>>> 22 [6.7, 3.0, 5.0, 1.7] ; Pred= virginica ; Actual= versicolor ;  False 

versicolor -37.06244522290436
setosa 0.5722622258804022
virginica -52.951342710986815
>>>>>>>>>> 23 [5.0, 3.5, 1.3, 0.3] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -14.695334112175255
setosa -526.4818759887595
virginica -2.743074702763753
>>>>>>>>>> 24 [7.2, 3.2, 6.0, 1.8] ; Pred= virginica ; Actual= virginica ;  True 

versicolor -0.9194932042391687
setosa -186.4942859392572
virginica -8.703508224656774
>>>>>>>>>> 25 [5.7, 2.8, 4.1, 1.3] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -44.268411809321776
setosa -2.6648383351163787
virginica -58.85557175039779
>>>>>>>>>> 26 [5.5, 4.2, 1.4, 0.2] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -36.67994830038669
setosa 0.3705988062353285
virginica -52.135441518064816
>>>>>>>>>> 27 [5.1, 3.8, 1.5, 0.3] ; Pred= setosa ; Actual= setosa ;  True 

versicolor -1.4568748191287013
setosa -248.62595103038046
virginica -6.831996600019302
>>>>>>>>>> 28 [6.1, 2.8, 4.7, 1.2] ; Pred= versicolor ; Actual= versicolor ;  True 

versicolor -6.935750313052443
setosa -373.65440225642624
virginica -2.889856103657743
>>>>>>>>>> 29 [6.3, 2.5, 5.0, 1.9] ; Pred= virginica ; Actual= virginica ;  True 

29 29 100.0 %

完全正确!