2020-09-26
< view all postsXGBoost是于2015年由陈天奇等人提出的一个新的Gradient Boosting的实现。名称中的X代表 Extreme,意思是XGBoost旨在将Gradient Boosting对资源的利用和计算能力推至极限。
The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.
—— 陈天奇
尽管XGBoost最大的优势在于其对计算速度和资源使用的优化,但对于Gradient Boosting算法本身,XGBoost同样进行了改进。主要体现在对于模型优化过程的改进:
传统的Gradient Boosting在作梯度下降时只使用一阶导数信息,而XGBoost中对目标函数进行二阶泰勒展开,引入二阶导数信息,推导出一种新的对决策树划分增益的计算方法,新的方法在优化目标函数上更加有效和精确。
具体的推导过程可以参考这篇文章。
下面是在python环境中使用XGBoost训练一个简单模型的示例程序
输入是一个csv文件,文件中每一行代表一条记录,前8列的数值是记录所对应的特征,最后一列有两个取值,0和1,代表样本属于的两个分类。
from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',') # https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv print(dataset) X = dataset[:,0:8] Y = dataset[:,8] seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) model = XGBClassifier() model.fit(X=X_train, y=y_train) predictions = model.predict(X_test) print(predictions) accuracy = accuracy_score(y_test, predictions) print(accuracy*100.0)
输出如下:
[[ 6. 148. 72. ... 0.627 50. 1. ] [ 1. 85. 66. ... 0.351 31. 0. ] [ 8. 183. 64. ... 0.672 32. 1. ] ... [ 5. 121. 72. ... 0.245 30. 0. ] [ 1. 126. 60. ... 0.349 47. 1. ] [ 1. 93. 70. ... 0.315 23. 0. ]] [0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1.] 77.95275590551181
可以看到,这个示例中,模型预测的准确度为77.9%