Automated Machine Learning with H2O Driverless AI

Hi everyone , this time I want to evaluate another automated machine learning tool called H2O Driverless AI and also compare it with DataRobot (of course a very lightweight type of comparison analysis has been done).

First great feature of H2O driverless AI is that you can have it instantly (almost) as long you have an Amazon, Google or Azure account you can spin up a H2O Driverless quite easily as described here  :

azure_vm_size

You can choose if you want to Bring your own license (you can ask an evaluation of 21 days as I did) or pay the cost of the license inside the hourly costs of the VM in your cloud provider.

Once you have your VM up & running, my suggestion is to update it to the latest docker image of H2O Driverless AI as described into the how to

sudo h2oai update

and then connect to the UI.

Once connected you can upload directly from the UI the datasets you like and perform ML experiments with them choosing which column you want predict/analyze and what is the metric you want to measure your model (AUC, etc..).

Here one running on 4 GPUs:

Screen Shot 2018-03-22 at 9.23.38 AM

Once experiment is finished with the interpreting model page you can understand the key influencers in your dataset for the target column you were interested to analyze/predict.

viewing_results

Since few days ago I did some test with DataRobot and Kaggle competitions I tried to perform the same on H2O and the results are….

Titanic Competition (metric accuracy –> higher better):

DataRobot 0.79904 (Best)

H2O 0.78947

House Prices Regression (metric RMSE –> lower better):

DataRobot 0.12566 (Best)

H2O 0.13378

As you can see on both DataRobot leads but the results of H2O are not so far away !

Talking instead of model understanding and explicability of the results in “human” terms I see DataRobot offering more different and meaningful visualizations than H2O, additionally you can decide by yourself which of the many models you want to use and not only the winning one (there are cases where you want to use one that is less accurate but that has a higher evaluation (inference) speed), while with H2O you have no choice than using the only one surviving to the process of automatic evaluation.

H2O however is more accessible in terms of testing/trying , it offers GPU acceleration that is a very nice bonus especially on large datasets .

Happy Auto ML !

Annunci

Text Analytics with Facebook FastText

I recently I had to work a very interesting challenge, imagine a very large collection of phrases/statements and you want to derive from each of them the key ask/term .

An example can be a large collection of emails, scan the subjects and understand what was the key topic of the email, so for example an email with the following subject : “Dear dad the printer is no more working….” we can say that the key is “printer“.

can27t

If you try to apply generic key phrases algorithms , they can work pretty well in a very general context, but if your context is specific they will miss several key terms that are part of the dictionary of your context.

897

I successfully used facebook fasttext for this supervised classification task, and here is what you need to make it work :

  1. A Virtual Machine Ubuntu Linux 16.04 (is free on a Macbook with Parallels Lite)
  2. Download and compile fast text as described here
  3. A training set with statements and corresponding labels
  4. A validation set with statements and corresponding labels

So yes you need to manually label a good amount of statements to make fasttext “understand” well your dataset. 

Of course you can speed up the process transforming each statement into an array of the words and massively assign those labels where it makes sense ( python pandas or just old SQL can help you here).

Let’s see how to build the training set:

create a file called train.csv and here insert each statement in a line in the following format __label__labelname here write your statement.

Let’s make an example with the email subject used before:

__label__printer Dear dad the printer is no more working

You can have also multiple labels for the same statement, let’s see this example:

__label__printer __label__wifi Dear dad the printer and the wifi are dead

The validation set can be another file called validation.csv filled exactly in the same way, and of course you have to follow the usual good practices to split correctly your labeled dataset into the training dataset and validation dataset.

In order to start the training with fasttext you have to type the following command:

./fasttext supervised -input data/train.csv -output result/model_1 -lr 1.0 -epoch 25 -wordNgrams 2

this assumes that you are with terminal inside the fasttext folder and the training file is inside a subfolder called data , while the resulting model will be saved in a result folder.

I added also some other optional arguments to improve the precision in my specific case, you can look at those options here.

Once the training is done (you will understand why is called fasttext here!) , you can check the precision of the model in this way:

./fasttext test result/model_1.bin data/valid.csv

In my case I obtained a good 83% 🙂

aaeaaqaaaaaaaaitaaaajde5ngnjngm3lwqwnwmtndm5yi04nzfmlta1mdeyngfjmzuxzg

If you want to test your model manually (typing sentences and obtaining the corresponding labels) , you can try the following command:

./fasttext predict result/model_1.bin –

Fasttext has also python wrappers , like this one I used and you can leverage this wrapper to perform a massive scoring like I did here:


from fastText import load_model
import pandas as pd
fastm=load_model('/result/model_1.bin')
k = 1
df=pd.read_csv("data/datatoscore.csv")
df.insert(2,'label','')
for index, row in df.iterrows():
labels, probabilities = fastm.predict(str(row["short_statement"]), k)
for w, f in zip(labels, probabilities):
row["label"]=w
df.to_csv("data/finalresult.csv")

You can improve the entire process in many different ways, for example you can use the unsupervised training to obtain word vectors for you dataset , use this “dictionary” as base for your list of labels and use the nearest neighbor to find similar words that can grouped into single labels when doing the supervised training.

fb

Let’s dig in our email!

As many of you, even if we are almost in 2018, I still work A LOT using emails and recently I was asking myself the following question what if I can leverage analytics and also machine learning to have a better understanding of my emails?

text-analytics

Here is a quick way to understand who is inspiring you more

positive-attitude

and who are instead the ones spreading a bit more negativity in your daily job 🙂

workplace-negativity-56a0f2bc5f9b58eba4b5761e

You will need (if you want to process ALL your emails in one shot!) :

  1. Windows 7/8/10
  2. Outlook 2013 or 2016
  3. Access 2013 or 2016
  4. An Azure Subscription
  5. A data lake store and analytics account
  6. PowerBI Desktop or any other Visualization Tool you like (Tableau or simply Excel)

Step 1 : Link MS Access Tables to your Outlook folders as explained here

Step 2: Export from Access to csv files your emails.

Step 3: Upload those files to your data lake store.

Step 4: Process the fields containing text data with the U-SQL cognitive extensions and derive sentiment and key phrases of each email

Step 5: With PowerBI Desktop you can access the output data sitting into the data lake store as described here

Step 6: Find the senders with highest average sentiment and the ones with the lowest one 🙂 .

job-well-done-clipart-1

If you are worried about leaving your emails in the cloud, after obtaining the sentiment and key phrases , you can download this latest output and remove all the data from data lake store , using this (local) file as input for power bi desktop.

In addition to this I would also suggest to perform a one way hash of the sender email address and upload to the data lake store account the emails with this hashed field instead of the real sender.

wekk4

Once you have the data lake analytics job results you can download them and join locally in Access to associate again each email to the original sender.

 

AI is progressing at incredible speed!

Several people tend to think that all the new AI technologies like Convolutional neural networks, Recurrent Neural Networks, Generative adversarial networks,etc.. are used mainly in tech giants like Google , Microsoft , etc.. , in reality many enterprises are already leveraging deep learning in production like Zalando, Instacart and many others . Well known deep learning frameworks like Keras, Tensorflow, CNTK, Caffe2, etc.. are now finally reaching a larger audience.

Big data engines like Spark are finally able to pilot also deep learning workloads and also the first steps to make large deep neural networks models fit inside small cpu/low memory, occasionally connected IOT devices are coming.

Finally new hardware has been built specifically for deep learning :

  1. https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
  2. https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
  3. https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/

But the AI space never sleeps and we are already seeing arriving new solutions/frameworks and architectures that are actually under development:

Ray: the new distributed execution framework that aims to replace the well known Spark!

Pytorch and fast.ai framework that wants to compete and beat all the existing deep learning frameworks

To overcome one of the biggest problems on deep learning : amount of training data, snorkel a new framework has been designed to create brand new training data with little human interaction

Finally to help to create a better integration and performance of the deep learning models with the applications that want to consume those models a new prediction serving system Clipper has been designed .

The speed of AI evolution is incredible and be prepared to see much more than this in the near future!

My Top 2 Microsoft Build 2017 Sessions

Let’s start with Number 1 this is the Visionary Cloud that is arriving , compute nodes combined with FPGA neurons that act as Hardware Micro services communicating and changing their internal code directly attached to the Azure Network , like a global neural network. Do you want to know more? Click here and check directly with the new video index AI the content of this presentation jumping directly on the portions of the video that you like, searching words, concepts and people appearing in the video.

We can then look here at Number 2 (go to 1:01:54) :

Matt Velloso teaching to a Robot (Zenbo) how to recognize the images the robot sees using Microsoft Bot Framework and the new Custom Image Recognition Service.

Do you want to explore more?

Go here at channel9 and have fun exploring all the massive updates that has been released!

Image recognition for everyone

Warning : I’m NOT a data scientist, but I huge fan of cool technology !

Today I want to write of a new functionality that amazes me and that can help you to literally do “magic” things that you can think can be exclusive of super data scientists expert of deep learning and frameworks like TensorFlow, CNTK, Caffe,etc…

Imagine the following: someone  trains huge neural networks (imagine like mini brains) for weeks/months using a lot of GPUs on thousands and thousands of images.

These mini brains are then used to classify images and say something like: a car is present, a human is present, a cat etc… . Now one of “bad things” of neural networks is that usually you cannot understand how they really work internally and what is the “thinking process” of a neural network.

featurization

However latest studies on neural networks have found a way to “extract” this knowledge and Microsoft has delivered right now in April this knowledge or better these models.

Now I want to show you an example on how to do this.

Let’s grab some images of the Simpsons :

59381

and some other images of the Flintstones:

2BF47AA000000578-3221746-image-a-102_1441327664251

For example 13 images of Simpson cartoon and 11 of Flintstones. And let’s build a program that can predict given a new image that is not part of the two image sets if it is a Simpson or Flintstone image. I’ve chosen cartoons but you can apply this to any image that you want to process (watches? consoles? vacation places? etc…).

The idea is the following: I take the images I have and give these images to the “model” that has been trained . Now the result of this process will be , for each image, a collection of “numbers” that are the representation of that image according to the neural network. An example to understand that: our DNA is a small fraction of ourselves but it can “represent” us, so these “numbers” are the DNA of the image.

Now that we have the image represented by a simple array of numbers, we can use a “normal” machine learning technique like linear regression to leverage this simplified representation of the image and learn how to classify them.

Applying the sample R code described in the article to only a small sample of images (13 and 11 respectively) using 80% for training and 20% for scoring we obtained the following result:

Score

A good 75% on a very small amount of images !