xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value.

i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it.

Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

Normally, it is fine that you treat missing and zero all as zero:)






2	1	530101	38.5	66	28	3	3	?	2	5	4	4	?	?	?	3	5	45	8.4	?	?	2	2	11300	0	0	2
1	1	534817	39.2	88	20	?	?	4	1	3	4	2	?	?	?	4	2	50	85	2	2	3	2	2208	0	0	2
2	1	530334	38.3	40	24	1	1	3	1	3	3	1	?	?	?	1	1	33	6.7	?	?	1	2	0	0	0	1
1	9	5290409	39.1	164	84	4	1	6	2	2	4	4	1	2	5	3	?	48	7.2	3	5.3	2	1	2208	0	0	1
2	1	530255	37.3	104	35	?	?	6	2	?	?	?	?	?	?	?	?	74	7.4	?	?	2	2	4300	0	0	2
2	1	528355	?	?	?	2	1	3	1	2	3	2	2	1	?	3	3	?	?	?	?	1	2	0	0	0	2
1	1	526802	37.9	48	16	1	1	1	1	3	3	3	1	1	?	3	5	37	7	?	?	1	1	3124	0	0	2
1	1	529607	?	60	?	3	?	?	1	?	4	2	2	1	?	3	4	44	8.3	?	?	2	1	2208	0	0	2
2	1	530051	?	80	36	3	4	3	1	4	4	4	2	1	?	3	5	38	6.2	?	?	3	1	3205	0	0	2
2	9	5299629	38.3	90	?	1	?	1	1	5	3	1	2	1	?	3	?	40	6.2	1	2.2	1	2	0	0	0	1
1	1	528548	38.1	66	12	3	3	5	1	3	3	1	2	1	3	2	5	44	6	2	3.6	1	1	2124	0	0	1
2	1	527927	39.1	72	52	2	?	2	1	2	1	2	1	1	?	4	4	50	7.8	?	?	1	1	2111	0	0	2
1	1	528031	37.2	42	12	2	1	1	1	3	3	3	3	1	?	4	5	?	7	?	?	1	2	4124	0	0	2
2	9	5291329	38	92	28	1	1	2	1	1	3	2	3	?	7.2	1	1	37	6.1	1	?	2	2	0	0	0	1
1	1	534917	38.2	76	28	3	1	1	1	3	4	1	2	2	?	4	4	46	81	1	2	1	1	2112	0	0	2
1	1	530233	37.6	96	48	3	1	4	1	5	3	3	2	3	4.5	4	?	45	6.8	?	?	2	1	3207	0	0	2
1	9	5301219	?	128	36	3	3	4	2	4	4	3	3	?	?	4	5	53	7.8	3	4.7	2	2	1400	0	0	1




# binary classification, missing data
import numpy as np
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
# load data
dataframe = read_csv("horse-colic.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split data into X and y
X = dataset[:,0:27]
Y = dataset[:,27]
# set missing values to 0
X[X == '?'] = 0
# convert to numeric
X = X.astype('float32')
# encode Y class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)
# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data
predictions = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

在代码中,`X[X == '?'] = 0` 这一条语句指定了缺失值的替换方式。


缺失值替换为   预测准确度
np.nan         85.86%
0              83.84%
-1             83.84%
1              79.80%
各列平均值     79.80%



而使预测效果最好的替代值是np.nan,推测这和陈天奇提到的,XGBoost能够自动处理缺失值有关。XGBoost模型有一个 `missing` 参数,默认值就是np.nan。指定这个参数,让模型来处理缺失值,看起来会比我们自己替换缺失值的效果更好。


X[X == '?'] = -1
model = XGBClassifier(missing=-1)



XGBoost模型本身具有处理缺失值的能力,一般来说直接使用np.nan作为缺失值,或者使用一个数据集中不会出现的值,再通过 `missing` 参数传递给模型,得到的效果是最好的。




Jason Brownlee, XGBoost With Python

What are the ways of treatng missing values in XGboost?

How to use missing parameter of XGBRegressor of scikit-learn
