DAY 23 训练
- DAY23 机器学习管道 pipeline
- 基础概念
- 转换器(Transformer)
- 估计器(Estimator)
- 管道(Pipeline)
- 代码演示
- 没有 pipeline 的代码
- pipeline 的代码教学
- 导入库和数据加载
- 分离特征和标签,划分数据集
- 定义预处理步骤
- 构建完整 pipeline
- 使用 Pipeline 进行训练和评估
- 总结
DAY23 机器学习管道 pipeline
在机器学习领域,“管道”(pipeline)是一个至关重要的概念。它为我们提供了一种高效且结构化的方式来组织和执行机器学习工作流程。今天,我将带领大家一起深入理解机器学习管道的概念和应用,并通过代码演示如何构建一个完整的机器学习管道。
基础概念
转换器(Transformer)
转换器是用于对数据进行预处理和特征提取的 estimator。它实现了 transform 方法,用于对数据进行预处理和特征提取。常见的转换器包括数据缩放器(如 StandardScaler、MinMaxScaler)、特征选择器(如 SelectKBest、PCA)、特征提取器(如 CountVectorizer、TF-IDFVectorizer)等。
from sklearn.preprocessing import StandardScaler
# 初始化转换器
scaler = StandardScaler()
# 学习训练数据的缩放规则
scaler.fit(X_train)
# 应用规则到训练数据和测试数据
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
估计器(Estimator)
估计器是实现机器学习算法的对象或类。它用于拟合(fit)数据并进行预测(predict)。常见的估计器包括分类器(classifier)、回归器(regresser)、聚类器(clusterer)。
from sklearn.linear_model import LinearRegression
# 创建一个回归器
model = LinearRegression()
# 在训练集上训练模型
model.fit(X_train_scaled, y_train)
# 对测试集进行预测
y_pred = model.predict(X_test_scaled)
管道(Pipeline)
管道是一种将多个转换器和估计器按顺序连接在一起的机制,可以构建一个完整的数据处理和模型训练流程。在管道机制中,可以使用 Pipeline 类来组织和连接不同的转换器和估计器。
from sklearn.pipeline import Pipeline
# 构建一个简单的管道
pipeline = Pipeline([
('scaler', StandardScaler()), # 数据标准化
('classifier', RandomForestClassifier()) # 分类器
])
# 在训练集上训练管道
pipeline.fit(X_train, y_train)
# 在测试集上进行预测
y_pred = pipeline.predict(X_test)
代码演示
没有 pipeline 的代码
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# 设置中文字体(解决中文显示问题)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
data = pd.read_csv('data.csv')
# 先筛选字符串变量
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
# Home Ownership 标签编码
home_ownership_mapping = {
'Own Home': 1,
'Rent': 2,
'Have Mortgage': 3,
'Home Mortgage': 4
}
data['Home Ownership'] = data['Home Ownership'].map(home_ownership_mapping)
# Years in current job 标签编码
years_in_job_mapping = {
'< 1 year': 1,
'1 year': 2,
'2 years': 3,
'3 years': 4,
'4 years': 5,
'5 years': 6,
'6 years': 7,
'7 years': 8,
'8 years': 9,
'9 years': 10,
'10+ years': 11
}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)
# Purpose 独热编码
data = pd.get_dummies(data, columns=['Purpose'])
data2 = pd.read_csv("data.csv")
list_final = []
for i in data.columns:
if i not in data2.columns:
list_final.append(i)
for i in list_final:
data[i] = data[i].astype(int)
# Term 0 - 1 映射
term_mapping = {
'Short Term': 0,
'Long Term': 1
}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
# 连续特征用中位数补全
for feature in continuous_features:
mode_value = data[feature].mode()[0]
data[feature].fillna(mode_value, inplace=True)
from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1)
y = data['Credit Default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")
# --- 1. 默认参数的随机森林 ---
print("--- 1. 默认参数随机森林 (训练集 -> 测试集) ---")
import time
start_time = time.time()
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
end_time = time.time()
print(f"训练与预测耗时: {end_time - start_time:.4f} 秒")
print("\n默认随机森林 在测试集上的分类报告:")
print(classification_report(y_test, rf_pred))
print("默认随机森林 在测试集上的混淆矩阵:")
print(confusion_matrix(y_test, rf_pred))
pipeline 的代码教学
导入库和数据加载
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
warnings.filterwarnings("ignore")
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
data = pd.read_csv('data.csv')
print("原始数据加载完成,形状为:", data.shape)
分离特征和标签,划分数据集
y = data['Credit Default']
X = data.drop(['Credit Default'], axis=1)
print("\n特征和标签分离完成。")
print("特征 X 的形状:", X.shape)
print("标签 y 的形状:", y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\n数据集划分完成 (预处理之前)。")
print("X_train 形状:", X_train.shape)
print("X_test 形状:", X_test.shape)
print("y_train 形状:", y_train.shape)
print("y_test 形状:", y_test.shape)
定义预处理步骤
object_cols = X.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X.select_dtypes(exclude=['object']).columns.tolist()
ordinal_features = ['Home Ownership', 'Years in current job', 'Term']
ordinal_categories = [
['Own Home', 'Rent', 'Have Mortgage', 'Home Mortgage'],
['< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'],
['Short Term', 'Long Term']
]
ordinal_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder(categories=ordinal_categories, handle_unknown='use_encoded_value', unknown_value=-1))
])
nominal_features = ['Purpose']
nominal_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
continuous_features = [f for f in X.columns if f not in ordinal_features + nominal_features]
continuous_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(
transformers=[
('ordinal', ordinal_transformer, ordinal_features),
('nominal', nominal_transformer, nominal_features),
('continuous', continuous_transformer, continuous_features)
],
remainder='passthrough'
)
print("\nColumnTransformer (预处理器) 定义完成。")
构建完整 pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
print("\n完整的 Pipeline 定义完成。")
使用 Pipeline 进行训练和评估
print("\n--- 1. 默认参数随机森林 (训练集 -> 测试集) ---")
start_time = time.time()
pipeline.fit(X_train, y_train)
pipeline_pred = pipeline.predict(X_test)
end_time = time.time()
print(f"训练与预测耗时: {end_time - start_time:.4f} 秒")
print("\n默认随机森林 在测试集上的分类报告:")
print(classification_report(y_test, pipeline_pred))
print("默认随机森林 在测试集上的混淆矩阵:")
print(confusion_matrix(y_test, pipeline_pred))
总结
通过使用 pipeline,我们可以将整个机器学习工作流程封装成一个简洁的流程,提高代码的可读性和可维护性。同时,pipeline 还可以帮助我们防止数据泄露,简化超参数调优,提高模型的性能和稳定性。
在实际项目中,我们可以使用 pipeline 来构建复杂的机器学习工作流,提高我们的工作效率。希望今天的分享能够帮助大家更好地理解和使用机器学习管道。
@浙大疏锦行



















