机器学习-人与机器生数据的区分模型测试-数据处理

机器学习-人与机器生数据的区分模型测试-数据处理 - 续

news2026/5/15 18:45:59

这里继续机器学习-人与机器生数据的区分模型测试-数据处理1的内容

查看数据中1的情况

#查看数据1的分布情况
one_ratio_list = []
for col in data.columns:
    if col == 'city' or col == 'target' or col == 'city2':  # 跳过第一列
        continue
    else:
        one_ratio = data[col].mean()  # 计算1值占比
        print(f"{col}: {one_ratio}")
        one_ratio_list.append(one_ratio)


plt.figure(figsize=(8,4))
sns.histplot(one_ratio_list, bins=20, kde=True)
plt.title('Histogram of 1-Value Proportion Distribution')
plt.xlabel('Proportion of 1 value')
plt.show()

可以看每个区间的具体分布
在这里插入图片描述

应用Apriori算法挖掘频繁项集

查看数据组合有没有意义

# 数据预处理管道
def preprocess_for_apriori(data):
    """
    对输入的数据进行预处理，使其适合 Apriori 算法。
    Apriori 算法要求输入数据为二元数据（仅包含 0 和 1）。

    参数:
    data (pandas.DataFrame): 输入的原始数据，需要转换为适合 Apriori 算法的格式。

    返回:
    pandas.DataFrame: 经过预处理的二元数据，仅包含有效二元字段。
    """
    # 类型转换与验证
    # 将输入数据转换为整数类型，确保数据为数值型
    data_binary = data.astype(int)
    
    # 过滤无效字段
    # 找出所有元素仅为 0 或 1 的列，Apriori 算法要求输入为二元数据
    valid_cols = data_binary.columns[data_binary.isin([0,1]).all()]
    # 从转换后的二进制数据中选取有效列
    data_valid = data_binary[valid_cols]
    
    # 最终验证
    # 确保处理后的数据至少有一个有效二元字段，若没有则抛出异常
    assert data_valid.shape[1] > 0, "无有效二元字段可用"
    return data_valid

执行数据预处理

try:
    data_preprocessed = preprocess_for_apriori(data_clean)
    print(f"有效字段数量: {len(data_preprocessed.columns)}")
    
    # Apriori算法执行
    frequent_itemsets = apriori(data_preprocessed, 
                               min_support=0.05,
                               use_colnames=True,
                               low_memory=True)  # 启用内存优化
    
    if not frequent_itemsets.empty:
        print("Top10高频组合:")
        print(frequent_itemsets.sort_values('support', ascending=False).head(10))
    else:
        print("未找到满足支持度的频繁项集，尝试降低min_support值")
        
except Exception as e:
    print("处理失败:", str(e))



# 逐步降低阈值测试  
for support in [0.05, 0.03, 0.01]:  
    frequent_itemsets = apriori(data_preprocessed, min_support=support)  
    if not frequent_itemsets.empty:  
        print(f"min_support={support}时找到项集")  
        break