文章目录
- 1. 导入数据
- 2. 数据处理
- 2.1 患病占比
- 2.2 相关性分析
- 2.3 年龄与患病探究
- 3. 特征选择
- 4. 构建数据集
- 4.1 数据集划分与标准化
- 4.2 构建加载
- 5. 构建模型
- 6. 模型训练
- 6.1 构建训练函数
- 6.2 构建测试函数
- 6.3 设置超参数
- 7. 模型训练
- 8. 模型评估
- 8.1 结果图
- 8.2 混淆矩阵
- 9. 总结:
- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊
1. 导入数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
plt.rcParams["font.sans-serif"] = ["Microsoft YaHei"] # 显示中文
plt.rcParams['axes.unicode_minus'] = False # 显示负号
data_df = pd.read_csv("alzheimers_disease_data.csv")
data_df.head()
PatientID | Age | Gender | Ethnicity | EducationLevel | BMI | Smoking | AlcoholConsumption | PhysicalActivity | DietQuality | ... | MemoryComplaints | BehavioralProblems | ADL | Confusion | Disorientation | PersonalityChanges | DifficultyCompletingTasks | Forgetfulness | Diagnosis | DoctorInCharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4751 | 73 | 0 | 0 | 2 | 22.927749 | 0 | 13.297218 | 6.327112 | 1.347214 | ... | 0 | 0 | 1.725883 | 0 | 0 | 0 | 1 | 0 | 0 | XXXConfid |
1 | 4752 | 89 | 0 | 0 | 0 | 26.827681 | 0 | 4.542524 | 7.619885 | 0.518767 | ... | 0 | 0 | 2.592424 | 0 | 0 | 0 | 0 | 1 | 0 | XXXConfid |
2 | 4753 | 73 | 0 | 3 | 1 | 17.795882 | 0 | 19.555085 | 7.844988 | 1.826335 | ... | 0 | 0 | 7.119548 | 0 | 1 | 0 | 1 | 0 | 0 | XXXConfid |
3 | 4754 | 74 | 1 | 0 | 1 | 33.800817 | 1 | 12.209266 | 8.428001 | 7.435604 | ... | 0 | 1 | 6.481226 | 0 | 0 | 0 | 0 | 0 | 0 | XXXConfid |
4 | 4755 | 89 | 0 | 0 | 0 | 20.716974 | 0 | 18.454356 | 6.310461 | 0.795498 | ... | 0 | 0 | 0.014691 | 0 | 0 | 1 | 1 | 0 | 0 | XXXConfid |
5 rows × 35 columns
# 标签中文化
data_df.rename(columns={ "Age": "年龄", "Gender": "性别", "Ethnicity": "种族", "EducationLevel": "教育水平", "BMI": "身体质量指数(BMI)", "Smoking": "吸烟状况", "AlcoholConsumption": "酒精摄入量", "PhysicalActivity": "体育活动时间", "DietQuality": "饮食质量评分", "SleepQuality": "睡眠质量评分", "FamilyHistoryAlzheimers": "家族阿尔茨海默病史", "CardiovascularDisease": "心血管疾病", "Diabetes": "糖尿病", "Depression": "抑郁症史", "HeadInjury": "头部受伤", "Hypertension": "高血压", "SystolicBP": "收缩压", "DiastolicBP": "舒张压", "CholesterolTotal": "胆固醇总量", "CholesterolLDL": "低密度脂蛋白胆固醇(LDL)", "CholesterolHDL": "高密度脂蛋白胆固醇(HDL)", "CholesterolTriglycerides": "甘油三酯", "MMSE": "简易精神状态检查(MMSE)得分", "FunctionalAssessment": "功能评估得分", "MemoryComplaints": "记忆抱怨", "BehavioralProblems": "行为问题", "ADL": "日常生活活动(ADL)得分", "Confusion": "混乱与定向障碍", "Disorientation": "迷失方向", "PersonalityChanges": "人格变化", "DifficultyCompletingTasks": "完成任务困难", "Forgetfulness": "健忘", "Diagnosis": "诊断状态", "DoctorInCharge": "主诊医生" },inplace=True)
data_df.columns
Index(['PatientID', '年龄', '性别', '种族', '教育水平', '身体质量指数(BMI)', '吸烟状况', '酒精摄入量',
'体育活动时间', '饮食质量评分', '睡眠质量评分', '家族阿尔茨海默病史', '心血管疾病', '糖尿病', '抑郁症史',
'头部受伤', '高血压', '收缩压', '舒张压', '胆固醇总量', '低密度脂蛋白胆固醇(LDL)',
'高密度脂蛋白胆固醇(HDL)', '甘油三酯', '简易精神状态检查(MMSE)得分', '功能评估得分', '记忆抱怨', '行为问题',
'日常生活活动(ADL)得分', '混乱与定向障碍', '迷失方向', '人格变化', '完成任务困难', '健忘', '诊断状态',
'主诊医生'],
dtype='object')
2. 数据处理
data_df.isnull().sum()
PatientID 0
年龄 0
性别 0
种族 0
教育水平 0
身体质量指数(BMI) 0
吸烟状况 0
酒精摄入量 0
体育活动时间 0
饮食质量评分 0
睡眠质量评分 0
家族阿尔茨海默病史 0
心血管疾病 0
糖尿病 0
抑郁症史 0
头部受伤 0
高血压 0
收缩压 0
舒张压 0
胆固醇总量 0
低密度脂蛋白胆固醇(LDL) 0
高密度脂蛋白胆固醇(HDL) 0
甘油三酯 0
简易精神状态检查(MMSE)得分 0
功能评估得分 0
记忆抱怨 0
行为问题 0
日常生活活动(ADL)得分 0
混乱与定向障碍 0
迷失方向 0
人格变化 0
完成任务困难 0
健忘 0
诊断状态 0
主诊医生 0
dtype: int64
from sklearn.preprocessing import LabelEncoder
# 创建 LabelEncoder 实例
label_encoder = LabelEncoder()
# 对非数值型列进行标签编码
data_df['主诊医生'] = label_encoder.fit_transform(data_df['主诊医生'])
data_df.head()
PatientID | 年龄 | 性别 | 种族 | 教育水平 | 身体质量指数(BMI) | 吸烟状况 | 酒精摄入量 | 体育活动时间 | 饮食质量评分 | ... | 记忆抱怨 | 行为问题 | 日常生活活动(ADL)得分 | 混乱与定向障碍 | 迷失方向 | 人格变化 | 完成任务困难 | 健忘 | 诊断状态 | 主诊医生 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4751 | 73 | 0 | 0 | 2 | 22.927749 | 0 | 13.297218 | 6.327112 | 1.347214 | ... | 0 | 0 | 1.725883 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 4752 | 89 | 0 | 0 | 0 | 26.827681 | 0 | 4.542524 | 7.619885 | 0.518767 | ... | 0 | 0 | 2.592424 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 4753 | 73 | 0 | 3 | 1 | 17.795882 | 0 | 19.555085 | 7.844988 | 1.826335 | ... | 0 | 0 | 7.119548 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
3 | 4754 | 74 | 1 | 0 | 1 | 33.800817 | 1 | 12.209266 | 8.428001 | 7.435604 | ... | 0 | 1 | 6.481226 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4755 | 89 | 0 | 0 | 0 | 20.716974 | 0 | 18.454356 | 6.310461 | 0.795498 | ... | 0 | 0 | 0.014691 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
5 rows × 35 columns
2.1 患病占比
# 计算是否患病, 人数
counts = data_df["诊断状态"].value_counts()
# 计算百分比
sizes = counts / counts.sum() * 100
# 绘制环形图
fig, ax = plt.subplots()
wedges, texts, autotexts = ax.pie(sizes, labels=sizes.index, autopct='%1.2ff%%', startangle=90, wedgeprops=dict(width=0.3))
plt.title("患病占比(1患病,0没有患病)")
plt.show()
2.2 相关性分析
plt.figure(figsize=(40, 35))
sns.heatmap(data_df.corr(), annot=True, fmt=".2f")
plt.show()
2.3 年龄与患病探究
data_df['年龄'].min(), data_df['年龄'].max()
(np.int64(60), np.int64(90))
# 计算每一个年龄段患病人数
age_bins = range(60, 91)
grouped = data_df.groupby('年龄').agg({'诊断状态': ['sum', 'size']}) # 分组、聚合函数: sum求和,size总大小
grouped.columns = ['患病', '总人数']
grouped['不患病'] = grouped['总人数'] - grouped['患病'] # 计算不患病的人数
# 设置绘图风格
sns.set(style="whitegrid")
plt.figure(figsize=(12, 5))
# 获取x轴标签(即年龄)
x = grouped.index.astype(str) # 将年龄转换为字符串格式便于显示
# 画图
plt.bar(x, grouped["不患病"], 0.35, label="不患病", color='skyblue')
plt.bar(x, grouped["患病"], 0.35, label="患病", color='salmon')
# 设置标题
plt.title("患病年龄分布", fontproperties='Microsoft YaHei')
plt.xlabel("年龄", fontproperties='Microsoft YaHei')
plt.ylabel("人数", fontproperties='Microsoft YaHei')
# 如果需要对图例也应用相同的字体
plt.legend(prop={'family': 'Microsoft YaHei'})
# 展示
plt.tight_layout()
plt.show()
3. 特征选择
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
data = data_df.copy()
X = data_df.iloc[:, 1:-2]
y = data_df.iloc[:, -2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 标准化
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# 模型创建
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
pred = tree.predict(X_test)
reporter = classification_report(y_test, pred)
print(reporter)
precision recall f1-score support
0 0.91 0.92 0.92 277
1 0.85 0.84 0.85 153
accuracy 0.89 430
macro avg 0.88 0.88 0.88 430
weighted avg 0.89 0.89 0.89 430
# 特征展示
feature_importances = tree.feature_importances_
features_rf = pd.DataFrame({'特征': X.columns, '重要度': feature_importances})
features_rf.sort_values(by='重要度', ascending=False, inplace=True)
plt.figure(figsize=(20, 10))
sns.barplot(x='重要度', y='特征', data=features_rf)
plt.xlabel('重要度')
plt.ylabel('特征')
plt.title('随机森林特征图')
plt.show()
from sklearn.feature_selection import RFE
# 使用 RFE 来选择特征
rfe_selector = RFE(estimator=tree, n_features_to_select=20) # 选择前20个特征
rfe_selector.fit(X, y)
X_new = rfe_selector.transform(X)
feature_names = np.array(X.columns)
selected_feature_names = feature_names[rfe_selector.support_]
print(selected_feature_names)
[‘年龄’ ‘种族’ ‘教育水平’ ‘身体质量指数(BMI)’ ‘酒精摄入量’ ‘体育活动时间’ ‘饮食质量评分’ ‘睡眠质量评分’ ‘心血管疾病’
‘收缩压’ ‘舒张压’ ‘胆固醇总量’ ‘低密度脂蛋白胆固醇(LDL)’ ‘高密度脂蛋白胆固醇(HDL)’ ‘甘油三酯’
‘简易精神状态检查(MMSE)得分’ ‘功能评估得分’ ‘记忆抱怨’ ‘行为问题’ ‘日常生活活动(ADL)得分’]
4. 构建数据集
4.1 数据集划分与标准化
feature_selection = ['年龄', '种族','教育水平','身体质量指数(BMI)', '酒精摄入量', '体育活动时间', '饮食质量评分', '睡眠质量评分', '心血管疾病',
'收缩压', '舒张压', '胆固醇总量', '低密度脂蛋白胆固醇(LDL)', '高密度脂蛋白胆固醇(HDL)', '甘油三酯',
'简易精神状态检查(MMSE)得分', '功能评估得分', '记忆抱怨', '行为问题', '日常生活活动(ADL)得分']
X = data_df[feature_selection]
# 标准化, 标准化其实对应连续性数据,分类数据不适合,由于特征中只有种族是分类数据,这里我偷个“小懒”
sc = StandardScaler()
X = sc.fit_transform(X)
X = torch.tensor(np.array(X), dtype=torch.float32)
y = torch.tensor(np.array(y), dtype=torch.long)
# 再次进行特征选择
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, y_train.shape
(torch.Size([1719, 20]), torch.Size([1719]))
4.2 构建加载
batch_size = 32
train_dl = DataLoader(
TensorDataset(X_train, y_train),
batch_size=batch_size,
shuffle=True
)
test_dl = DataLoader(
TensorDataset(X_test, y_test),
batch_size=batch_size,
shuffle=False
)
5. 构建模型
class Rnn_Model(nn.Module):
def __init__(self):
super().__init__()
# 调用rnn
self.rnn = nn.RNN(input_size=20, hidden_size=200, num_layers=1, batch_first=True)
self.fc1 = nn.Linear(200, 50)
self.fc2 = nn.Linear(50, 2)
def forward(self, x):
x, hidden1 = self.rnn(x)
x = self.fc1(x)
x = self.fc2(x)
return x
# 数据不大,cpu即可
device = "cpu"
model = Rnn_Model().to(device)
model
Rnn_Model(
(rnn): RNN(20, 200, batch_first=True)
(fc1): Linear(in_features=200, out_features=50, bias=True)
(fc2): Linear(in_features=50, out_features=2, bias=True)
)
model(torch.randn(32, 20)).shape
torch.Size([32, 2])
6. 模型训练
6.1 构建训练函数
def train(data, model, loss_fn, opt):
size = len(data.dataset)
batch_num = len(data)
train_loss, train_acc = 0.0, 0.0
for X, y in data:
X, y = X.to(device), y.to(device)
pred = model(X)
loss = loss_fn(pred, y)
# 反向传播
opt.zero_grad() # 梯度清零
loss.backward() # 求导
opt.step() # 设置梯度
train_loss += loss.item()
train_acc += (pred.argmax(1) == y).type(torch.float).sum().item()
train_loss /= batch_num
train_acc /= size
return train_acc, train_loss
6.2 构建测试函数
def test(data, model, loss_fn):
size = len(data.dataset)
batch_num = len(data)
test_loss, test_acc = 0.0, 0.0
with torch.no_grad():
for X, y in data:
X, y = X.to(device), y.to(device)
pred = model(X)
loss = loss_fn(pred, y)
test_loss += loss.item()
test_acc += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= batch_num
test_acc /= size
return test_acc, test_loss
6.3 设置超参数
loss_fn = nn.CrossEntropyLoss() # 损失函数
learn_lr = 1e-4 # 超参数
optimizer = torch.optim.Adam(model.parameters(), lr=learn_lr)
7. 模型训练
train_acc = []
train_loss = []
test_acc = []
test_loss = []
epoches = 50
for i in range(epoches):
model.train()
epoch_train_acc, epoch_train_loss = train(train_dl, model, loss_fn, optimizer)
model.eval()
epoch_test_acc, epoch_test_loss = test(test_dl, model, loss_fn)
train_acc.append(epoch_train_acc)
train_loss.append(epoch_train_loss)
test_acc.append(epoch_test_acc)
test_loss.append(epoch_test_loss)
# 输出
template = ('Epoch:{:2d}, Train_acc:{:.1f}%, Train_loss:{:.3f}, Test_acc:{:.1f}%, Test_loss:{:.3f}')
print(template.format(i + 1, epoch_train_acc*100, epoch_train_loss, epoch_test_acc*100, epoch_test_loss))
print("Done")
Epoch: 1, Train_acc:57.9%, Train_loss:0.675, Test_acc:66.0%, Test_loss:0.608
Epoch: 2, Train_acc:67.2%, Train_loss:0.589, Test_acc:68.8%, Test_loss:0.556
Epoch: 3, Train_acc:75.1%, Train_loss:0.540, Test_acc:75.1%, Test_loss:0.506
Epoch: 4, Train_acc:79.1%, Train_loss:0.485, Test_acc:82.1%, Test_loss:0.460
Epoch: 5, Train_acc:83.0%, Train_loss:0.442, Test_acc:81.4%, Test_loss:0.427
Epoch: 6, Train_acc:83.5%, Train_loss:0.411, Test_acc:84.2%, Test_loss:0.407
Epoch: 7, Train_acc:83.3%, Train_loss:0.395, Test_acc:82.8%, Test_loss:0.400
Epoch: 8, Train_acc:84.1%, Train_loss:0.383, Test_acc:84.0%, Test_loss:0.396
Epoch: 9, Train_acc:84.1%, Train_loss:0.380, Test_acc:84.0%, Test_loss:0.394
Epoch:10, Train_acc:83.9%, Train_loss:0.375, Test_acc:84.0%, Test_loss:0.395
Epoch:11, Train_acc:84.5%, Train_loss:0.375, Test_acc:84.4%, Test_loss:0.396
Epoch:12, Train_acc:84.5%, Train_loss:0.374, Test_acc:83.5%, Test_loss:0.399
Epoch:13, Train_acc:83.7%, Train_loss:0.373, Test_acc:83.0%, Test_loss:0.401
Epoch:14, Train_acc:84.3%, Train_loss:0.372, Test_acc:84.0%, Test_loss:0.402
Epoch:15, Train_acc:84.1%, Train_loss:0.375, Test_acc:83.3%, Test_loss:0.400
Epoch:16, Train_acc:84.6%, Train_loss:0.370, Test_acc:83.0%, Test_loss:0.404
Epoch:17, Train_acc:84.2%, Train_loss:0.371, Test_acc:83.0%, Test_loss:0.406
Epoch:18, Train_acc:84.3%, Train_loss:0.377, Test_acc:83.3%, Test_loss:0.401
Epoch:19, Train_acc:84.8%, Train_loss:0.371, Test_acc:83.0%, Test_loss:0.402
Epoch:20, Train_acc:84.8%, Train_loss:0.372, Test_acc:83.5%, Test_loss:0.402
Epoch:21, Train_acc:84.9%, Train_loss:0.374, Test_acc:83.7%, Test_loss:0.399
Epoch:22, Train_acc:85.2%, Train_loss:0.369, Test_acc:83.7%, Test_loss:0.401
Epoch:23, Train_acc:84.7%, Train_loss:0.374, Test_acc:84.4%, Test_loss:0.401
Epoch:24, Train_acc:84.2%, Train_loss:0.371, Test_acc:84.2%, Test_loss:0.398
Epoch:25, Train_acc:84.3%, Train_loss:0.370, Test_acc:83.7%, Test_loss:0.399
Epoch:26, Train_acc:84.8%, Train_loss:0.373, Test_acc:83.7%, Test_loss:0.398
Epoch:27, Train_acc:84.6%, Train_loss:0.373, Test_acc:83.7%, Test_loss:0.395
Epoch:28, Train_acc:85.1%, Train_loss:0.372, Test_acc:83.5%, Test_loss:0.397
Epoch:29, Train_acc:84.4%, Train_loss:0.373, Test_acc:84.4%, Test_loss:0.399
Epoch:30, Train_acc:85.0%, Train_loss:0.371, Test_acc:83.7%, Test_loss:0.401
Epoch:31, Train_acc:84.7%, Train_loss:0.372, Test_acc:83.7%, Test_loss:0.401
Epoch:32, Train_acc:84.5%, Train_loss:0.372, Test_acc:84.0%, Test_loss:0.400
Epoch:33, Train_acc:84.4%, Train_loss:0.369, Test_acc:83.5%, Test_loss:0.397
Epoch:34, Train_acc:84.7%, Train_loss:0.369, Test_acc:83.7%, Test_loss:0.401
Epoch:35, Train_acc:84.6%, Train_loss:0.372, Test_acc:83.3%, Test_loss:0.396
Epoch:36, Train_acc:84.8%, Train_loss:0.370, Test_acc:83.3%, Test_loss:0.396
Epoch:37, Train_acc:84.8%, Train_loss:0.371, Test_acc:83.5%, Test_loss:0.399
Epoch:38, Train_acc:84.9%, Train_loss:0.369, Test_acc:83.5%, Test_loss:0.398
Epoch:39, Train_acc:85.0%, Train_loss:0.370, Test_acc:83.0%, Test_loss:0.395
Epoch:40, Train_acc:83.8%, Train_loss:0.371, Test_acc:83.7%, Test_loss:0.394
Epoch:41, Train_acc:84.6%, Train_loss:0.370, Test_acc:83.7%, Test_loss:0.394
Epoch:42, Train_acc:85.1%, Train_loss:0.370, Test_acc:84.2%, Test_loss:0.392
Epoch:43, Train_acc:84.4%, Train_loss:0.371, Test_acc:84.0%, Test_loss:0.393
Epoch:44, Train_acc:84.5%, Train_loss:0.372, Test_acc:84.7%, Test_loss:0.396
Epoch:45, Train_acc:85.3%, Train_loss:0.372, Test_acc:84.2%, Test_loss:0.396
Epoch:46, Train_acc:85.0%, Train_loss:0.368, Test_acc:84.4%, Test_loss:0.397
Epoch:47, Train_acc:85.0%, Train_loss:0.372, Test_acc:84.0%, Test_loss:0.395
Epoch:48, Train_acc:84.5%, Train_loss:0.370, Test_acc:84.4%, Test_loss:0.394
Epoch:49, Train_acc:85.1%, Train_loss:0.368, Test_acc:84.2%, Test_loss:0.400
Epoch:50, Train_acc:84.9%, Train_loss:0.370, Test_acc:84.2%, Test_loss:0.397
Done
8. 模型评估
8.1 结果图
import matplotlib.pyplot as plt
#隐藏警告
import warnings
warnings.filterwarnings("ignore") #忽略警告信息
from datetime import datetime
current_time = datetime.now() # 获取当前时间
epochs_range = range(epoches)
plt.figure(figsize=(12, 3))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_acc, label='Training Accuracy')
plt.plot(epochs_range, test_acc, label='Test Accuracy')
plt.legend(loc='lower right')
plt.title('Training Accuracy')
plt.xlabel(current_time) # 打卡请带上时间戳,否则代码截图无效
plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_loss, label='Training Loss')
plt.plot(epochs_range, test_loss, label='Test Loss')
plt.legend(loc='upper right')
plt.title('Training= Loss')
plt.show()
8.2 混淆矩阵
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
pred = model(X_test.to(device)).argmax(1).cpu().numpy()
# 计算混淆矩阵
cm = confusion_matrix(y_test, pred)
# 计算
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
# 标题
plt.title("混淆矩阵")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.tight_layout() # 自适应
plt.show()
9. 总结:
本周在上周的基础上更加完善了阿尔茨海默病诊断模型,加入了REF(递归特征消除)特征选择方法。并且通过实践更好的理解了模型以及该如何使用这种特征选择方法。