Automatic Machine Learning with DataRobot

Hi this time I want to share with you my experimentations with a DataRobot , an automated machine learning software that has promised to help to leverage machine learning techniques with few clicks of mouse .

Let’s see it in action with a very simple dataset, the so called Titanic: Machine Learning from disaster competition on Kaggle (Extract):

“The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.”

The data looks like this:

Screen Shot 2018-04-24 at 9.21.21 PM

and we can see that for each passenger we have some data (Name, Sex, Age,  Port of Embark, Ticket Class) and the indication if this passenger is survived to the Titanic sank in the Survived column. Our objective is to discover as much as possible inside all the information we have about passengers if there are and what are the key influencers of the survival .

The interface of DataRobot is very simple at the start, asking us to upload the data we have: Screen Shot 2018-04-24 at 9.33.03 PM

Once the upload is done DataRobot asks us which is the column we want to “predict” :

Screen Shot 2018-04-24 at 9.35.49 PM

and once we select our target column we can press the big “Start” button .

Here DataRobot analyzes our file and calculates how many models are suitable for this data and once done that start automatically “training” all those models in parallel according to the process power we select (“workers” setting).

Screen Shot 2018-04-24 at 9.38.15 PM

Screen Shot 2018-04-24 at 9.41.30 PM

Once all the models are trained on the leaderboard we can find at first place the best model possible according to the metric that DataRobot picks for the problem we are trying to solve.

Once we have selected the “best model” we can understand what are the key findings like those ones:

Screen Shot 2018-04-24 at 9.51.48 PM

Screen Shot 2018-04-24 at 10.10.09 PM

in other words females in first class had high chances of survival while men in third class were really at risk of not surviving.

Using the predict feature of Data Robot we tested with an external kaggle “test” file the accuracy of this model uploading the predictions obtained by Data Robot to Kaggle and here is the result:

Screen Shot 2018-04-24 at 9.55.48 PM

which is absolutely not bad , because given the fact that 9408 data scientists participated to this competition , this means I am in the top 18% globally!

The pros: I did not touch the data like adding more features, normalizing the data, removing columns like IDs , etc…, the data was analyzed by DataRobot as is.  I used all default settings without touching any “advanced option”.

The cons: DataRobot misses a data preparation functionality (you can try products like  Alteryx or Trifacta in combination with DataRobot) and this means that we have to at least use two products to manage end to end a data science experiment that involves typically operations like joins of multiple tables, files, complex aggregations, sub queries, etc..

Finally while we have to absolutely admit that 80% of data science experimentation job is around data collection, source access, data preparation, cleaning, etc.., at the same time DataRobot can unlock several quick wins and opportunities in all the IT departments where several analysts /developers are really expert in those activities , they have the business knowledge of the data but they lack data science abilities .