Using UCI Chess Data to train Decision Tree and Random Forest Classifier

On May 11 1997, Deep Blue shocked the world as the first computer to beat a world champion in a six-game match under standard time controls. The recipe of Deep Blue, the first AI in chess domain, is no longer a secret, which can be described by the following formula:

DEEP BLUE := Minimax Algorithm+ Alpha-Beta Algorithm+ Progressive Deepening + Parallel Computing + Opening Book + Special Purpose Stuff for the End Game + Uneven Tree Development;


Although the computing resource that supports Deep Blue is enhanced by parallel computing, it still needs pre-saved data to save time for walking through computing that are already done before the game begins. Deep Blue did it with domain knowledge, but in my analysis, I used machine learning methods to train a classifier to know whether King-Rook or King-Pawn wins in certain game states. The data used is from UCI Machine Learning Repository: Chess (King-Rook vs. King-Pawn) Data Set. There are 3196 instances and 36 attributes in the dataset, and the categories include ‘win’ or ‘nowin’.


I built the decision tree with python, and the algorithm for decision making I used is ID3 (Iterative Dichotomiser 3). The implementation can be found on my Github. Use 300 instances as a test set and the rest as a training set, the implementation got an accuracy of 97.67%. I also used WEKA with the same split of training-testing ratio, and the accuracy turned out to be 99.3333%. WEKA’s accuracy is better because it implements early stopping and pruning to counter overfitting. In overfitting situation, a statistical model describes random error or noise instead of the real patterns that we are trying to find out of the dataset. Pursuing high accuracy of classifying training instances, and taking too many features, both crucial and trivial ones, into consideration can cause overfitting. Pruning and early stopping can counter overfitting to some extent, but random forest offers a better solution.

Random forest creates multiple decision trees by bootstrap aggregating. With WEKA, I analyzed how the number of features and instances involved in training would influence the accuracy. Using WEKA, test the last 300 instances with 2896 instances as the training set with random forest method. The accuracy is 98.6667%, which is lower than using a decision tree. This is not normal because RF should be considered as an improved method for decision tree. This might be caused by using majority vote but not the possible choice. The randomness introduced to the model is not handled properly, which means that the accuracy can be compromised.

Now we adjust the size of the training set to 1000,1500,2000,2500 instances. The reason why my own implementation is less accurate than WEKA may be that WEKA does pruning in the decision tree to optimize the result. Moreover, WEKA may stop when one node contains fewer instances than the threshold. Inside WEKA, they may adjust parameters for optimization.

Number of instances Prediction Accuracy
1000 60.33%
1500 98.00%
2000 99.33%
2500 99%
2896 98.67%

Accuracy-Instance Chart

It indicates that when more instances are used for training, the prediction accuracy has a tendency of increasing. When the number of instances increases, the bias of training set can be compensated effectively by randomness. The accuracy does not always go up when the number of instances increases. As the data indicates, when we only use 2000 instances, the accuracy is 99.33%, which is higher than the accuracy calculated with 2896 instances. Using more instances in building the model can generate a model more suitable for existing instances, but overfitting compromises the accuracy of predicting unseen instances.

Now we adjust the number of features to 5,10,15,20,25,30.

Number of features

Prediction Accuracy
















Accuracy-Feature Chart

The prediction accuracy raises mostly when more features are involved in the Random Forest method. It not always goes up because some of the features may be more significant than other features, which means that number of the features is clearly the only way to value a dataset. Features come with varied significance for the tree model to use as splitting points, and the randomness introduced to the random forest model is not the best way to deal with it. That is why we use feature selection methods such as PCA and SVD, which will be in my future posts.


(Images Source:



Data Mining & Machine Learning

Arthur Zhang View All →

A current master student in WUSTL, department of Electrical and System Engineering.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: