Python
20230415_基于逻辑回归实现乳腺癌预测
Song Wei
2023年4月15日 21:32
243
Scikit-learn:
Scikit-learn简称sklearn是一个用于机器学习的Python库,它包含了各种分类、回归和聚类等机器学习算法,并提供了用于模型选择、数据预处理、特征提取等功能的工具。主要包含以下两个特点:数据预处理和特征提取:Scikit-learn提供了一些用于数据预处理和特征提取的工具,如标准化、归一化、特征选择、特征提取等。这些工具可以帮助用户对数据进行预处理,提高模型的精度和效率。丰富的机器学习算法库:Scikit-learn提供了大量的机器学习算法,包括分类、回归、聚类、降维等多种类型,可以满足不同的数据分析和建模需求。其中包括了经典的算法,如支持向量机、决策树、随机森林、K-近邻等。逻辑回归是一种经典的二分类模型,可以用于预测乳腺癌的患病风险。
测试数据集:
sklearn.datasets中收录了一些标准数据集,例如鸢尾花数据集、葡萄酒数据集、乳腺癌数据集等。这些数据集通过一系列load函数加载,例如sklearn.datasets.load_iris函数可以加载鸢尾花数据集。load函数的返回值是一个sklearn.utils.Bunch类型的边栏,其中最重要的成员是data和target,分别表示数据集的特征和标签。乳腺癌数据集(Breast Cancer Data Set)一共包含569条数据,其中有357例乳腺癌数据集以及212例非乳腺癌数据集,数据集中包含30个特征,分别如下所示:
基于逻辑回归实现乳腺癌预测:
#加载乳腺癌数据集
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
#逻辑回归模型的构造与训练
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, test_size = 0.2 )
model = LogisticRegression()
model.fit(X_train,y_train)
train_score = model.score(X_train,y_train)
print("train_score:",train_score)
print("=================================================================")
test_score = model.score(X_test,y_test)
print("test_score:",test_score)
print("=================================================================")
print('train score: {train_score:.6f}; test score: {test_score:.6f}'.format(train_score = train_score, test_score =test_score ))
print("=================================================================")
#模型评估
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
#print(y_pred)
accuracy_score_value = accuracy_score(y_test,y_pred)
print("准确率:",accuracy_score_value)
print("=================================================================")
recall_score_value = recall_score(y_test,y_pred)
print("召回率:",recall_score_value)
print("=================================================================")
precision_score_value = precision_score(y_test,y_pred)
print("精确率:",precision_score_value)
print("=================================================================")
classification_report_value = classification_report(y_test,y_pred)
print(classification_report_value)
print("=================================================================")
print("ok")
代码执行结果:
train_score: 0.9428571428571428
=================================================================
test_score: 0.9385964912280702
=================================================================
train score: 0.942857; test score: 0.938596
=================================================================
准确率: 0.9385964912280702
=================================================================
召回率: 0.9605263157894737
=================================================================
精确率: 0.948051948051948
=================================================================
precision recall f1-score support
0 0.92 0.89 0.91 38
1 0.95 0.96 0.95 76
accuracy 0.94 114
macro avg 0.93 0.93 0.93 114
weighted avg 0.94 0.94 0.94 114
=================================================================
ok
附录:
http://renpeter.cn/2022/06/02/3%E5%A4%A7%E6%A0%91%E6%A8%A1%E5%9E%8B%E5%AE%9E%E6%88%98%E4%B9%B3%E8%85%BA%E7%99%8C%E9%A2%84%E6%B5%8B%E5%88%86%E7%B1%BB.html
标签:
bioinfo
北京 天气
晴
0℃