实验：缺失值的处理对XGBoost模型效果的影响

2020-09-27

背景

在向XGBoost模型输入的特征矩阵当中，会存在一些缺失值。缺失值的来源主要有两种：

原始数据的缺失，例如原始交易报文中某些字段为空，或为异常值，无法使用.
特征工程的逻辑中，部分特征不适用于这条交易。例如，某个特征是判断一张卡在此前是否连续发生过3笔金额相同的交易，是为1，否为0；但实际上这张卡的历史交易数不足三笔.

对于这些缺失值，一般可以考虑的处理方法有：

全都使用0替换，但是需要考虑的点是正常的特征值也可能为0.
全都使用数据集中不会出现的某个数值替换，例如所有的特征值都是非负数，那么使用-1来代表缺失值.
全都使用NaN替换（这是XGBoost默认的缺失值表示方法，相当于让模型对缺失值作自动处理）.
使用平均数/中位数替换，但是只对连续型的特征有实际意义（如金额）.

关于XGBoost模型的输入应该如何处理缺失值，XGBoost模型的提出者陈天奇本人在GitHub上表示过：

xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value.

i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it.

Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

Normally, it is fine that you treat missing and zero all as zero:)

陈天奇的描述提供了下面这些信息：

XGBoost模型天生具有处理缺失值的能力，并且模型会对缺失值进行学习，找到对于缺失值的最佳优化方向（称为稀疏感知拆分发现，sparsity-aware split finding算法）.
向XGBoost输入一个稀疏矩阵（sparse matrix），XGBoost就可以对矩阵当中的缺失值作自动处理.
或者为XGBoost模型的 `missing` 参数指定一个值（默认为NaN），模型也会对缺失值作自动处理.
一般来说，就算正常数据中有0值，陈天奇认为用0来表示缺失值也没有问题.

实验

下面对不同的缺失值表示方式做一些实验，比较对模型效果的影响。

使用的数据集

下面是这个数据集的一部分

2	1	530101	38.5	66	28	3	3	?	2	5	4	4	?	?	?	3	5	45	8.4	?	?	2	2	11300	0	0	2
1	1	534817	39.2	88	20	?	?	4	1	3	4	2	?	?	?	4	2	50	85	2	2	3	2	2208	0	0	2
2	1	530334	38.3	40	24	1	1	3	1	3	3	1	?	?	?	1	1	33	6.7	?	?	1	2	0	0	0	1
1	9	5290409	39.1	164	84	4	1	6	2	2	4	4	1	2	5	3	?	48	7.2	3	5.3	2	1	2208	0	0	1
2	1	530255	37.3	104	35	?	?	6	2	?	?	?	?	?	?	?	?	74	7.4	?	?	2	2	4300	0	0	2
2	1	528355	?	?	?	2	1	3	1	2	3	2	2	1	?	3	3	?	?	?	?	1	2	0	0	0	2
1	1	526802	37.9	48	16	1	1	1	1	3	3	3	1	1	?	3	5	37	7	?	?	1	1	3124	0	0	2
1	1	529607	?	60	?	3	?	?	1	?	4	2	2	1	?	3	4	44	8.3	?	?	2	1	2208	0	0	2
2	1	530051	?	80	36	3	4	3	1	4	4	4	2	1	?	3	5	38	6.2	?	?	3	1	3205	0	0	2
2	9	5299629	38.3	90	?	1	?	1	1	5	3	1	2	1	?	3	?	40	6.2	1	2.2	1	2	0	0	0	1
1	1	528548	38.1	66	12	3	3	5	1	3	3	1	2	1	3	2	5	44	6	2	3.6	1	1	2124	0	0	1
2	1	527927	39.1	72	52	2	?	2	1	2	1	2	1	1	?	4	4	50	7.8	?	?	1	1	2111	0	0	2
1	1	528031	37.2	42	12	2	1	1	1	3	3	3	3	1	?	4	5	?	7	?	?	1	2	4124	0	0	2
2	9	5291329	38	92	28	1	1	2	1	1	3	2	3	?	7.2	1	1	37	6.1	1	?	2	2	0	0	0	1
1	1	534917	38.2	76	28	3	1	1	1	3	4	1	2	2	?	4	4	46	81	1	2	1	1	2112	0	0	2
1	1	530233	37.6	96	48	3	1	4	1	5	3	3	2	3	4.5	4	?	45	6.8	?	?	2	1	3207	0	0	2
1	9	5301219	?	128	36	3	3	4	2	4	4	3	3	?	?	4	5	53	7.8	3	4.7	2	2	1400	0	0	1

数据集每行表示一个样本，最后一列是类别标签，前面的所有列都是特征。可以看出，特征中既包含了连续的数值类特征，也包含离散的类别特征。

数据集当中的问号"?"就表示该数据缺失。

使用如下代码进行缺失值替换的实验：

# binary classification, missing data
import numpy as np
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
# load data
dataframe = read_csv("horse-colic.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split data into X and y
X = dataset[:,0:27]
Y = dataset[:,27]
# set missing values to 0
X[X == '?'] = 0
# convert to numeric
X = X.astype('float32')
# encode Y class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)
# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_train)
print(model)
# make predictions for test data
predictions = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

在代码中，`X[X == '?'] = 0` 这一条语句指定了缺失值的替换方式。

使用不用的值实验后，结果如下：

缺失值替换为   预测准确度
np.nan         85.86%
0              83.84%
-1             83.84%
1              79.80%
各列平均值     79.80%

可以看到，使用会破坏原有数据意义的值（例如会和正常特征产生混淆的值1，以及直接对类别特征求均值的结果），得到的模型效果是最差的。

使用0，或者数据集中不存在的值，如-1作为替代，得到的预测效果更好。

而使预测效果最好的替代值是np.nan，推测这和陈天奇提到的，XGBoost能够自动处理缺失值有关。XGBoost模型有一个 `missing` 参数，默认值就是np.nan。指定这个参数，让模型来处理缺失值，看起来会比我们自己替换缺失值的效果更好。

为了验证，使用-1作为替换值，并且将-1作为参数传递给模型。

X[X == '?'] = -1
..........
model = XGBClassifier(missing=-1)

再次运行，能够获得85.86%最高准确率。

结论

XGBoost模型本身具有处理缺失值的能力，一般来说直接使用np.nan作为缺失值，或者使用一个数据集中不会出现的值，再通过 `missing` 参数传递给模型，得到的效果是最好的。

直接使用数据集中不存在的值，或者直接使用0，效果相对会差一些，但是差距不大，在遇到具体问题时也值得进行实验。

使用会破坏原有数据集意义的值作为缺失值，会导致模型效果下降，应该避免。

参考：

Jason Brownlee, XGBoost With Python

What are the ways of treatng missing values in XGboost?

How to use missing parameter of XGBRegressor of scikit-learn

关于稀疏矩阵