Before now, my tutorials on machine learning were boring tales and theories , In this post, I will be sharing my first code experience on an AI competition. I tried my basic machine learning skills on Numerai.
Github source code: https://github.com/andela-ysanni/numer.ai
According to their competition’s website, Numerai is a global artificial intelligence tournament to predict the stock market. Numerai is a little bit similar to Kaggle but with clean and tidy dataset; You download the data, build a model, and upload your predictions. It’s rather hard to find a contest where you could just apply whatever methods you fancy, without much data cleaning and feature engineering. In this tournament, you can do exactly that.
Oh yes, I had initially started Udacity Intro to machine learning and had some basic knowledge on supervised machine learning algorithms using scikit-learn. I was initially scared at first; what does a newbie like me know to engage in an online competition where there’s a leaderboard? What were the chance of me not being at the bottom of the leaderboard? Anyway, I took the bull by the horn.
Packages Used in this project
Pandas is a package written in Python for data structures and data analysis, numpy for creating large multi-dimensional arrays and matrices, you can simply install these packages using the pip install command. We would also import some packages from the sklearn library, which consists of simple and efficient tools for data mining and data analysis including supervised and unsupervised algorithms.
Overview of the Data
For this tournament, we had two datasets which is our training data and testing data. I’d recommend you load your datasets with Numbers (it comes with the Mac OS by default) to see how it looks, if otherwise, thank goodness for Microsoft Excel. You can also use a Text editor like Sublime or Atom to load your dataset.
I used the panda library method “read_csv” to parse the data into a DataFrame object. The read_csv method takes the file_path and some other optional parameters.
import pandas as pd training_data = pd.read_csv('numerai_training_data.csv') tournament_data = pd.read_csv('numerai_tournament_data.csv')
The training datasets has 22 columns. 21 columns consists of our features ranging from feature 1 to feature 21 while the last column is the target value; a 1 or 0 value which is going to be used to train our classifier. We have about 96321 rows.
The tournament datasets is our test set which also has 22 columns. Column 1 is t_id which is the target id from our training data. The 21 remaining columns is our features values.
Cross-validation is primarily a way of measuring the predictive performance of a statistical model to an independent data set (Retrieved from http://robjhyndman.com). One way to measure the predictive ability of a model is to test it on a set of data not used in the training data. Data miners call this a “test set” and the data used for estimation is the “training set”.
The major purpose of validation is to avoid overfitting. Overfitting occurs when a machine learning algorithm, such as a classifier, identifies not only the signal in a dataset, but the noise as well. Noise here means the model is too sensitive to features of the dataset that don’t really mean anything. The practical outcome of overfitting is that a classifier which appears to perform well on its training data may perform poorly, possibly very badly, on new data from the same problem.
To develop our classifier, we split our dataset into two using 70 percent of the data to train the algorithm. We then run the classifier on the remaining 30 percent, so far unseen, and record these results. This is our own version of cross validation below:
from sklearn import cross_validation features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(training_data.iloc[:,0:21], training_data['target'], test_size=0.3, random_state=0)
I’ve used the cross_validation method from sklearn to abstract the test sets in a ratio of 30 percent from the training data. Let me explain the parameters:
train_test_split takes an array of our training data which includes the features train excluding our target value, followed by an array of our target value.
test_size is our 30 percent ratio of the data.
random_state takes an integer value of a pseudo-random number generator state used for random sampling.
Our cross validation returns four arrays which include 70 percent of our features_train and labels_train, remaining 30 percent of features_test and labels_test.
Implementation and fitting of the Classifier
In this project, the task is binary classification, the output variables also known as our target is expected to be 1 or 0. I will be using SVC(Support Vector Classification) as the classifier. A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.
The advantages of support vector machines are:
- Effective in high dimensional spaces.
- Still effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
from sklearn.svm import SVC as svc clf = svc(C=1.0).fit(features_train, labels_train)
C here controls the cost of misclassification on the training data. A large C gives you low bias and high variance. Low bias because you penalize the cost of mis-clasification a lot while a small C gives you higher bias and lower variance.
.fit() method fits the SVM model according to the given training data which is the features_training with the labels train.
Next is to make predictions on the classifier we trained using 30 percent of the data sets.
predictions = clf.predict(features_test)
The predict() method takes an array and performs classification on the array.
Accuracy is a weighted arithmetic mean of Precision of the model we have built. I will be measuring the accuracy score using sklearn. The method returns the mean accuracy on the given test data and labels.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(predictions,labels_test)
The accuracy_score() takes in two arrays; the predictions we made earlier and the true target test data.
The accuracy score gotten here is 0.514361849391. Pretty low so I decided to raise the value of the C to 100.0 for better classification and high variance. This actually took almost forever(15 mins) for the classify to run but gave a score of 0.518133997785 which is just a bit higher right than the previous score. It’s still low but I’m glad to know it’s a bit above average. Fair enough for a newbie
The classifier I used is pretty slow, it takes about 10 minutes when C=1.0 and 15 minutes when C=100.0 to have a better score, this will not be scalable if our datasets triples in size. Why? The implementation is based on libsvm. Hence, the fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
Improvement and Conclusion
In my next post, I will be talking about how I fine tuned the parameters of my classifier and also how I switched to using another algorithm to optimize my result. I hope you enjoyed this tutorial so far. Feel free to drop comments, questions on the comment section. Ciao