机器学习-数据集划分和特征工程

一.数据集划分

API函数：

sklearn.model_selection.train_test_split(*arrays，**options)

参数：

- arrays：多个数组，可以是列表，numpy数组，也可以是dataframe数据框等

- options：（包含以下参数）

- shuffle = True 默认随机抽取

- random_state=x，随机数种子，x是哪个都行，就是固定随机抽取的规则，保证每次都一样

- train_size=x，就是训练集的比例，默认是0.75，和test_size两者选一个就行

- stratify,如果数据集是多分类，则需要指定，比如是二分类，则指定为y（分层划分，这个留到后面再讲）

可以传入多个数据集，返回train与test数据集

# 进行API导入
from sklearn.model_selection  import train_test_split

# 随手创建两个数据集
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [10,20,30,40,50,60,70,80,90,100]

# 调用train_test_split函数，将数据集分为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, shuffle=True)

# 打印训练集和测试集
print(x_train, x_test, y_train, y_test)

[6, 1, 8, 3, 10, 5, 4, 7] [9, 2] [60, 10, 80, 30, 100, 50, 40, 70] [90, 20]

可以看到，两个数据集以完全相同的方式进行划分，在函数中都是用相同的下标进行抽取，使用的是相同的划分规则。

而在日常工作中，我们一般都会像上面一样传入两个数据集，一个就是特征数据集，一个就是标签数据集。我们以鸢尾花为例：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
# 获取特征
data = iris.data
# 获取标签
target = iris.target

# 进行划分
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(105, 4) (45, 4) (105,) (45,)

这里取出来的每一整条特征数据都完美对应自己的标签，这也为后续的模型训练奠定了基础。

二.特征工程

2.1 概念

就是对特征进行相关的处理

一般使用pandas来进行数据清洗和数据处理、使用sklearn来进行特征工程

特征工程是将任意数据(如文本或图像)转换为可用于机器学习的数字特征,比如:字典特征提取(特征离散化)、文本特征提取、图像特征提取。

2.2 API

实例化转换器对象（transfer），转换器类有很多：

DictVectorize	字典特征提取
CountVectorizer	文本特征提取
TfidfVectorizer	TF-IDF文本特征词的重要程度特征提取
MinMaxScaler	归一化
StandardScaler	标准化
VarianceThreshold	底方差过滤降维
PCA	主成分分析降维

转换器对象调用 fit_transform() 进行特定转换, fit用于计算数据，transform进行最终转换。fit_transform()可以使用fit()和transform()代替：

data_new = transfer.fit_transform(data)

可写成

transfer.fit(data)

data_new = transfer.transform(data)

2.3 DictVectorizer 字典列表特征提取

讲之前先了解一个概念，稀疏矩阵：

稀疏矩阵是指一个矩阵中大部分元素为零，只有少数元素是非零的矩阵。由于稀疏矩阵中零元素非常多，存储和处理稀疏矩阵时，通常会采用特殊的存储格式三元组表，以节省内存空间并提高计算效率。

比如：

(0,0) 10

(0,1) 20

(2,0) 90

(2,20) 8

(8,0) 70

表示除了列出的索引位置有值, 其余全是0

同样也有非稀疏矩阵（稠密矩阵），与稀疏矩阵相反。

API：

sklearn.feature_extraction.DictVectorizer(sparse=True)

参数：

sparse：默认以稀疏矩阵表示，为True时返回稀疏矩阵，为False时返回稠密矩阵。

这个API会返回一个字典列表特征提取器对象，再通过前面的fit_transform方法，就可以将字典列表转换为稀疏矩阵或稠密矩阵。

# 导入相关库
from sklearn.feature_extraction import DictVectorizer

# 创建一个简单的字典列表数据
data = [
    {'city': '北京', 'temperature': 10},
    {'city': '上海', 'temperature': 20},
    {'city': '广州', 'temperature': 30},
    {'city': '广州', 'temperature': 29}]

# 创建一个DictVectorizer对象
vec = DictVectorizer(sparse=False)

# 转换字典列表数据为特征矩阵
data_new = vec.fit_transform(data)

# 看看转换后的特征名称
print(vec.get_feature_names_out())

# 打印特征矩阵
print(data_new)

['city=上海' 'city=北京' 'city=广州' 'temperature']
[[ 0. 1. 0. 10.]
[ 1. 0. 0. 20.]
[ 0. 0. 1. 30.]
[ 0. 0. 1. 29.]]

可以看到原先的数据集中含有中文特殊字符，在进行转化时转换器就把每个包含中文的键值对当作一个特征名保存。而打印出的特征矩阵中，其中每一行每个数字对应的位置就是特征名列表中的位置，为1则代表是该特征，为0则代表不是该特征。

再来看看三元组的表示：

# 创建一个DictVectorizer对象
vec = DictVectorizer(sparse=True)

# 转换字典列表数据为特征矩阵
data_new = vec.fit_transform(data)

# 看看转换后的特征名称
print(vec.get_feature_names_out())

# 打印特征矩阵
print(data_new)

['city=上海' 'city=北京' 'city=广州' 'temperature']
<Compressed Sparse Row sparse matrix of dtype 'float64'
   with 8 stored elements and shape (4, 4)>
Coords   Values
(0, 1)   1.0
(0, 3)   10.0
(1, 0)   1.0
(1, 3)   20.0
(2, 2)   1.0
(2, 3)   30.0
(3, 2)   1.0
(3, 3)   29.0

三元组(稀疏矩阵)显示的就是除了0以外的所有数值的索引，也就可以根据索引可以还原之前的特征矩阵。

2.4 CountVectorizer 文本特征提取

API：

sklearn.feature_extraction.text.CountVectorizer()

参数：

- stop_words : 又叫黑名单，其中存储的是停用词即不会被提取的词，值类型为list

# 导入相关库
from sklearn.feature_extraction.text import CountVectorizer

# 创建一个简单的英文句子
data = ["I love machine learning.", "I love coding.", "I love both"]

# 创建一个CountVectorizer对象
cv = CountVectorizer(stop_words=["machine", "learning", "coding"])
# 转换句子为词频矩阵
data_new1 = cv.fit_transform(data)
# 查看特征名
print(cv.get_feature_names_out())
# 查看词频矩阵(三元组)
print(data_new1)
# 转化为特征矩阵
print(data_new1.toarray())

['both' 'love']
<Compressed Sparse Row sparse matrix of dtype 'int64'
   with 4 stored elements and shape (3, 2)>
Coords   Values
(0, 1)   1
(1, 1)   1
(2, 1)   1
(2, 0)   1
[[0 1]
[0 1]
[1 1]]

可以看到，其将每个英文单词提取出来作为特征，矩阵中的数字就是出现的次数(频率)。我们也可以不提取某些词，而且这个函数内部也隐式地去掉了一些特殊字符。

但，我们一般用的都是中文，而这又是特殊字符，能正常使用吗？

# 导入相关库
from sklearn.feature_extraction.text import CountVectorizer

# 创建一个简单的中文句子
data = ["第一个句子","第二个句子"]

# 创建一个CountVectorizer对象
cv = CountVectorizer()
# 转换句子为词频矩阵
data_new1 = cv.fit_transform(data)
# 查看特征名
print(cv.get_feature_names_out())
# 查看词频矩阵(三元组)
print(data_new1)
# 转化为特征矩阵
print(data_new1.toarray())

['第一个句子' '第二个句子']
<Compressed Sparse Row sparse matrix of dtype 'int64'
   with 2 stored elements and shape (2, 2)>
Coords   Values
(0, 0)   1
(1, 1)   1
[[1 0]
[0 1]]

很明显，这个地方有大问题，由于英文单词天生就用空格分隔，所以可以直接提取。但中文有自己的分词规则，我们需要用其他分词工具进行分词，比如jieba。

使用下面的命令进行下载：

pip install jieba

# 导入
from sklearn.feature_extraction.text import CountVectorizer
import jieba

data = "第一个句子"

# 用jieba分词，返回一个可迭代对象
data = jieba.cut(data)
data = list(data)
print(data)

# 再将分词结果转换为字符串
data = " ".join(data)
print(data)

# 用CountVectorizer进行特征提取
vec = CountVectorizer()
data_new = vec.fit_transform([data])

print(vec.get_feature_names_out())
print(data_new.toarray())

['第一个', '句子']
第一个句子
['句子' '第一个']
[[1 1]]

# 进行多个句子的提取
from sklearn.feature_extraction.text import CountVectorizer
import jieba

data = ["第一个句子","第二个句子"]

# 写一个函数完成分词和转化字符串
def cut_words(str):
    return " ".join(list(jieba.cut(str)))

# 给data里的句子分词
data = [cut_words(i) for i in data]

# 用CountVectorizer进行特征提取
vec = CountVectorizer()
data_new = vec.fit_transform(data)

# 打印特征提取后的结果
print(vec.get_feature_names_out())
print(data_new.toarray())

['句子' '第一个' '第二个']
[[1 1 0]
[1 0 1]]

2.5 TfidfVectorizer TF-IDF文本特征词的重要程度特征提取

词频(Term Frequency, TF), 表示一个词在当前篇文章中的重要性

逆文档频率(Inverse Document Frequency, IDF), 反映了词在整个文档集合中的稀有程度

API：

sklearn.feature_extraction.text.TfidfVectorizer()

参数：

- stop_words：表示词特征黑名单

使用示例：

# 导入相关库
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba

# 自定义数据集
data = ["第一个句子","第二个句子"]
def cut_words(str):
    return " ".join(list(jieba.cut(str)))

# 给data里的句子分词
data = [cut_words(i) for i in data]

# 创建TF-IDF向量化器
tool = TfidfVectorizer()

# 拟合并转换数据集
data_new = tool.fit_transform(data)

# 输出结果
print(tool.get_feature_names_out())
print(data_new.toarray())

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Clocky7\AppData\Local\Temp\jieba.cache
Loading model cost 0.529 seconds.
Prefix dict has been built successfully.

['句子' '第一个' '第二个']
[[0.57973867 0.81480247 0. ]
[0.57973867 0. 0.81480247]]

可以看到，有几个句子（文档）就有几个列表，其中的数据表明对应词语在当前文档中的重要程度。

扩展：

在sklearn库中 TF-IDF算法做了一些细节的优化：

- TfidfVectorizer 中，TF 默认是：直接使用一个词在文档中出现的次数也就是CountVectorizer的结果

- TfidfVectorizer 中，IDF 的默认计算公式是：

- 而且机器学习中还会进行归一化（L2归一化）处理

下面是手动实现TF-IDF算法的代码：

# 手动实现tfidf向量(跟上面的api实现出一样的效果)
import numpy as  np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
import jieba

data = ["第一个句子","第二个句子"]
def cut_words(str):
    return " ".join(list(jieba.cut(str)))

data = [cut_words(i) for i in data]

def tfidf(x):
    cv = CountVectorizer()
    # 获取词频
    tf = cv.fit_transform(x).toarray()
    # 计算idf
    idf = np.log((len(tf)+1)/(np.sum(tf!=0,axis=0)+1))+1
    tfidf = tf*idf
    # 归一化
    tfidf = normalize(tfidf, norm='l2', axis=1)
    return tfidf

data_new = tfidf(data)
print(data_new)