How I predicted the stock market at Numerai ML tournament : Final tutorial

stock-market-chart

Recap

In my last post on the numeria challenge. In case you missed it, you can always refer back to it here. We made a conclusion to find a better classifier that will implement our datasets faster and also raise our accuracy score higher than the initial value we got with SVC().

Github source code: https://github.com/andela-ysanni/numer.ai

Cross Validation

I would be skipping cross validation and overview of the data since we already did that in the last post. This is how our cross validation code likes in case you need a refresh.

features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(training_data.iloc[:,0:21], training_data[‘target’], test_size=0.3, random_state=0)

Using RandomForestClassifier

So the better classifier I choose to use over SVC() is RandomForestClassifier, What is Random Forest? Random forest or random decision forests is a meta estimator that fits a number of decision trees classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. Each decision tree is constructed by using a random subset of the training data.  It builds multiple such decision tree and amalgamate them together to get a more accurate and stable prediction. You can read more.

Example of how RandomForestClassifier works

figure-210-random-forest

Hyperparameters

In machine learning, hyperparameter is used to distinguish from standard model parameters. A machine learning model is the definition of a mathematical formula with a number of parameters that need to be learned from the data. That is the crux of machine learning: fitting a model to the data. This is done through a process known as model training. In other words, by training a model with existing data, we are able to fit the model parameters.

However, there is another kind of parameters that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. They are called hyperparameters. Hyperparameters are usually fixed before the actual training process begins.

So, how are hyperparameters decided? It is done by setting different values for those hyperparameters, training different models, and deciding which ones work best by testing them.

So, to summarise. Hyperparameters:

  • Define higher level concepts about the model such as complexity, or capacity to learn.
  • Cannot be learned directly from the data in the standard model training process and need to be predefined.
  • Can be decided by setting different values, training different models, and choosing the values that test better

Some examples of hyperparameters:

  • Number of leaves or depth of a tree
  • Number of latent factors in a matrix factorization
  • Learning rate (in many models)
  • Number of hidden layers in a deep neural network
  • Number of clusters in a k-means clustering

Tuning the hyper-parameters of an estimator

Hyperparameter optimization is a technique where the algorithm parameters are referred to as hyperparameters whereas the coefficients found by the machine learning algorithm itself are referred to as parameters. Phrased as a search problem, you can use different search strategies to find a good and robust parameter or set of parameters for an algorithm on a given problem. Two simple and easy search strategies are grid search and random search which can actually be applied to any type of algorithm. In this post, I will be talking about using GridSearch with our RandomForestClassifier.

Grid search means you have a set of models (which differ from each other in their parameter values, which lie on a grid). What you do is you then train each of the models and evaluate it using cross-validation. You then select the one that performed best.

To give a concrete example, if you’re using a random forest classifier, you could use different values for n_estimators, max_features, min_samples_leaf, etc. So, for example you could have a grid with the following values for (n_estimators, min_samples_leaf ): ([15,20, 25],[150,180,200]). Grid-search would basically train a RandomForestClassifier for each of these 3 pair of (n_estimators, min_samples_leaf ) values, then evaluate it using cross-validation, and select the one that did best.

Like old times, we import our classifier from sklearn and fit our model usingGridSearchCV:

from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.grid_search import GridSearchCV as GS

# splitting my arrays in ratio of 30:70 percent
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(training_data.iloc[:,0:21], training_data['target'], test_size=0.3, random_state=0)

# setting the range of values for our parameters
parameters = {
       'n_estimators': [ 20,25 ],
       'random_state': [ 0 ],
       'max_features': [ 2 ],
       'min_samples_leaf': [150,200,250]
}

# implementing my classifier
model = RFC()
grid = GS(estimator=model, param_grid=parameters)
grid.fit(features_train, labels_train)

Explanation of the parameters used above

n_estimators is the number of trees you want to build before taking the maximum averages of predictions. So the higher the n_estimators, the longer time for the classifier to run because this means it has to divide the datasets to the minimum by the number of estimators provided.

random_state takes an integer value of a pseudo-random number generator state used for random sampling.

max_features is the maximum number of features Random Forest is allowed to try in an individual decision tree.

min_samples_leaf => If you have built a decision tree before, you can appreciate the importance of minimum sample leaf size. Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. However, you should try multiple leaf sizes to find the most optimum for your use case.

The next question we want to ask is does this make our model run faster and accuracy higher than SVC() we tried previously. Well, we are about to find out :)

Making Predictions

Our predictions is still the same way. The predict() method takes an array and performs classification on the array.

prob_predictions_class_test = grid.predict(features_test)

Making Class Predictions

The predict_proba() methods takes an array and predict class probabilities of the array which is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

prob_predictions_test = grid.predict_proba(features_test)

This method returns an array of the probability classes for both 0 and 1 respectively. You’d notice the sum of the float values in each array returns 1. E.g 0.51778569+0.48221431 = 1.

screen-shot-2016-09-29-at-11-02-02-pm

Accuracy Score and Logloss

I previously explained accuracy_score() in the last post but not Logloss aka logistic loss or cross-entropy loss. Log loss is used when we have {0,1} response. This is usually because when we have {0,1} response, the best models give us values in terms of probabilities. In simple words, log loss measures the UNCERTAINTY of the probabilities of your model by comparing them to the true labels. You can have a closer look here. Unto writing the code:

from sklearn.metrics import accuracy_score, log_loss
prob_predictions_class_test = grid.predict(features_test)
prob_predictions_test = grid.predict_proba(features_test)
logloss = log_loss(labels_test,prob_predictions_test)
accuracy = accuracy_score(labels_test, prob_predictions_class_test, normalize=True,sample_weight=None)

Both accuracy_score() and logloss() returns a float value. Our accuracy score for this is 0.523913344408 which is higher than our previous score0.514361849391 with SVC(). The logloss is 0.691353938387. Prediction was about just 2 minutes against SVC() which was about 10 minutes. Comparing the logloss at the time of writing this post to the leaderboard, we took 151th place out of 372 members. Who knew amateurs like us can rise up to that level :)

List Comprehension

In this particular challenge, one of the rules(https://numer.ai/rules) for submission is to make predictions based on probability estimated by our model of the observation being of class 1. This means we have to extract the class probabilities for class 1.

tournament_data = pd.read_csv('../datasets/numerai_tournament_data.csv')

# predict class probabilities for the tournament set
prob_predictions_tournament = grid.predict_proba(tournament_data.iloc[:,1:22])

# extract the probability of being in a class 1
probability_class_of_one = np.array([x[1] for x in prob_predictions_tournament[:]])

The code above is basically extracting the probabilities predictions for class 1 using the test data sets and extracting it at the same time into an array using the np.array() method.

Saving our probabilities and Final Submission

The format of our prediction upload should be a CSV file with two columns:t_id and probability. The t_id is gotten from the tournament data sets while The probability column is the probability estimated by our model of the observation being of class 1 which is our “probability_class_of_one”.

To write our values to a CSV file, we are going to make use of the savetxt() –  Checkout the documentation here(http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html) method from numpy.

# extract the t_id if the test data sets(same as tournament data sets)
t_id = tournament_data['t_id']

To save our t_id:

np.savetxt(
   '../t_id.csv',          # file name
   t_id,                   # array to save
   fmt='%.d',              # formatting, 2 digits in this case
   delimiter=',',          # column delimiter
   newline='\n',           # new line character
   header= 't_id')   # file header

…and lastly our probability value:

np.savetxt(
   '../probability.csv',          # file name
   probability_class_of_one,  # array to save
   fmt='%.2f',               # formatting, 2 digits in this case
   delimiter=',',          # column delimiter
   newline='\n',           # new line character
   header= 'probability')   # file header

This automatically writes and saves our t_id and probability to file names “t_id.csv’” and “probability.csv”. I concatenated both files and here is a peek of my final submission.

screen-shot-2016-09-29-at-11-24-47-pm

Conclusion

Finally, we know how GridSearch and Random Forest works, and we know about Hyperparameters. I would have looked for other ways to move from 151th position to maybe 5oth but I decided not to continue. For me, submission wasn’t so much about scaling the leaderboard but more about the learning process. However, the datasets given for this tournament was encrypted, and it would be hard for a newbie like me to implement feature engineering to build a better model than what we have. You might be wondering what feature engineering is all about, well, fingers crossed until my next post. Ciao!!!

Did you enjoy reading this? Recommend, share, comment or ask questions.

You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *