Kaggle Competition
Kaggle Competition
Machine Learning from Disaster
Machine Learning from Disaster
(Written January 2021)
I came across Kaggle many times early on in my searches and readings. It sounded very appealing : "...find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges."
I came across Kaggle many times early on in my searches and readings. It sounded very appealing : "...find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges."
I decided it was time to give it a go with their introductory competition: "use machine learning to create a model that predicts which passengers survived the Titanic shipwreck."
I decided it was time to give it a go with their introductory competition: "use machine learning to create a model that predicts which passengers survived the Titanic shipwreck."
I would be able to see how my model's predictions rank in comparison with other participants, but more importantly I was going to use this as an opportunity to learn from others on how to improve my model.
I would be able to see how my model's predictions rank in comparison with other participants, but more importantly I was going to use this as an opportunity to learn from others on how to improve my model.
As an afterthought, I decided to also try using Google's Auto ML and see how it's automatically generated model would score.
As an afterthought, I decided to also try using Google's Auto ML and see how it's automatically generated model would score.
Many of us have watched the Titanic movie and know that women and children were prioritized for the life rafts and that those poor souls in lower class were not. The data presented in the competition included name, age, gender, socio-economic class, etc.
Many of us have watched the Titanic movie and know that women and children were prioritized for the life rafts and that those poor souls in lower class were not. The data presented in the competition included name, age, gender, socio-economic class, etc.
Using what I've been learning from the Big Data Analytics coursework at York University, I explored the data for completeness, insights, and made decision on how to impute for missing data.
Using what I've been learning from the Big Data Analytics coursework at York University, I explored the data for completeness, insights, and made decision on how to impute for missing data.
I utilized sci-kit learn as the machine learning library, with its feature selection to select the optimal number of features, fitted with the Random Forest algorithm.
I utilized sci-kit learn as the machine learning library, with its feature selection to select the optimal number of features, fitted with the Random Forest algorithm.
I believed RF would yearn the highest accuracy without over-fitting. I understand that its often used in industries such as banking / capital markets, medicine, and marketing. Part of its appeal is that its results can be explained to a client, a trade-off and something which can be overlooked when pursuing other models that may yield better results.
I believed RF would yearn the highest accuracy without over-fitting. I understand that its often used in industries such as banking / capital markets, medicine, and marketing. Part of its appeal is that its results can be explained to a client, a trade-off and something which can be overlooked when pursuing other models that may yield better results.
My submission scored so-so...enough to put me somewhere around the 50th percentile on the competition's ranks.
My submission scored so-so...enough to put me somewhere around the 50th percentile on the competition's ranks.
Now what?
Now what?
I wanted to try other models and improve my score. I read what other folks had tried or recommended and decided to give it another shot with XGBoost and AdaBoost. Admittedly I did not know about these models beforehand but found many articles including this one to be quite informative:
I wanted to try other models and improve my score. I read what other folks had tried or recommended and decided to give it another shot with XGBoost and AdaBoost. Admittedly I did not know about these models beforehand but found many articles including this one to be quite informative:
With XGBoost, I was able to increase my score to place me in the top 32% percentile of the competition.
With XGBoost, I was able to increase my score to place me in the top 32% percentile of the competition.
It was at this point I came across another article on Kaggle regarding Google AutoML, a Machine Learning tool that would automatically prepare the data, build and train the model, and make predictions. More on that over here
It was at this point I came across another article on Kaggle regarding Google AutoML, a Machine Learning tool that would automatically prepare the data, build and train the model, and make predictions. More on that over here
Sign me up!
Sign me up!