Select Language

AI社区

AI技术百科

朴素贝叶斯算法,python实现代码

概念:

 贝叶斯定理:贝叶斯理论是以18世纪的一位神学家托马斯.贝叶斯(Thomas Bayes)命名。通常,事件A在事件B(发生)的条件下的概率,与事件B在事件A(发生)的条件下的概率是不一样的;然而,这两者是有确定的关系的,贝叶斯定理就是这种关系的陈述

  朴素贝叶斯:朴素贝叶斯方法是基于贝叶斯定理和特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率(Maximum A Posteriori)最大的输出y。

通俗的来讲,在给定数据集的前提下,对于一个新样本(未分类),在数据集中找到和新样本特征相同的样本,最后根据这些样本算出每个类的概率,概率最高的类即为新样本的类。

运算公式:

P( h | d) = P ( d | h ) * P( h) / P(d)

这里:
P ( h | d ):是因子h基于数据d的假设概率,叫做后验概率
P ( d | h ) : 是假设h为真条件下的数据d的概率
P( h) : 是假设条件h为真的时候的概率(和数据无关),它叫做h的先验概率
P(d) : 数据d的概率,和先验条件无关.

算法实现分解:

1 数据处理:加载数据并把他们分成训练数据和测试数据
2 汇总数据:汇总训练数据的概率以便后续计算概率和做预测
3 结果预测: 通过给定的测试数据和汇总的训练数据做预测
4 评估准确性:使用测试数据来评估预测的准确性

代码实现:

# Example of Naive Bayes implemented from Scratch in Python
import csv
import random
import math

def loadCsv(filename):
        lines = csv.reader(open(filename, "rb"))
        dataset = list(lines)
        for i in range(len(dataset)):
                dataset[i] = [float(x) for x in dataset[i]]
        return dataset

def splitDataset(dataset, splitRatio):
        trainSize = int(len(dataset) * splitRatio)
        trainSet = []
        copy = list(dataset)
        while len(trainSet) < trainSize:
                index = random.randrange(len(copy))
                trainSet.append(copy.pop(index))
        return [trainSet, copy]

def separateByClass(dataset):
        separated = {}
        for i in range(len(dataset)):
                vector = dataset[i]
                if (vector[-1] not in separated):
                        separated[vector[-1]] = []
                separated[vector[-1]].append(vector)
        return separated

def mean(numbers):
        return sum(numbers)/float(len(numbers))

def stdev(numbers):
        avg = mean(numbers)
        variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
        return math.sqrt(variance)

def summarize(dataset):
        summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
        del summaries[-1]
        return summaries

def summarizeByClass(dataset):
        separated = separateByClass(dataset)
        summaries = {}
        for classValue, instances in separated.iteritems():
                summaries[classValue] = summarize(instances)
        return summaries

def calculateProbability(x, mean, stdev):
        exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
        return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

def calculateClassProbabilities(summaries, inputVector):
        probabilities = {}
        for classValue, classSummaries in summaries.iteritems():
                probabilities[classValue] = 1
                for i in range(len(classSummaries)):
                        mean, stdev = classSummaries[i]
                        x = inputVector[i]
                        probabilities[classValue] *= calculateProbability(x, mean, stdev)
        return probabilities

def predict(summaries, inputVector):
        probabilities = calculateClassProbabilities(summaries, inputVector)
        bestLabel, bestProb = None, -1
        for classValue, probability in probabilities.iteritems():
                if bestLabel is None or probability > bestProb:
                        bestProb = probability
                        bestLabel = classValue
        return bestLabel

def getPredictions(summaries, testSet):
        predictions = []
        for i in range(len(testSet)):
                result = predict(summaries, testSet[i])
                predictions.append(result)
        return predictions

def getAccuracy(testSet, predictions):
        correct = 0
        for i in range(len(testSet)):
                if testSet[i][-1] == predictions[i]:
                        correct += 1
        return (correct/float(len(testSet))) * 100.0

def main():
        filename = 'pima-indians-diabetes.data.csv'
        splitRatio = 0.67
        dataset = loadCsv(filename)
        trainingSet, testSet = splitDataset(dataset, splitRatio)
        print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))
        # prepare model
        summaries = summarizeByClass(trainingSet)
        # test model
        predictions = getPredictions(summaries, testSet)
        accuracy = getAccuracy(testSet, predictions)
        print('Accuracy: {0}%').format(accuracy)

main()



pima-indians-diabetes.data.csv的下载地址:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv 

 

参考文档:

1 https://en.wikipedia.org/wiki/Naive_Bayes_classifier

2 https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

3 https://machinelearningmastery.com/naive-bayes-for-machine-learning/



作者:虚生
出处:https://www.cnblogs.com/dylancao/

我要发帖
贝叶斯算法
2021-05-12 17:05:54加入圈子
  • 3

    条内容
贝叶斯分类算法是统计学的一种分类方法,它是一类利用概率统计知识进行分类的算法。在许多场合,朴素贝叶斯(Naïve Bayes,NB)分类算法可以与决策树和神经网络分类算法相媲美,该算法能运用到大型数据库中,而且方法简单、分类准确率高、速度快。
由于贝叶斯定理假设一个属性值对给定类的影响独立于其它属性的值,而此假设在实际情况中经常是不成立的,因此其分类准确率可能会下降。为此,就衍生出许多降低独立性假设的贝叶斯分类算法,如TAN(tree augmented Bayes network)算法。