Machine Learning for All

One of the my dream applications since I started working with data has been a “magic” app that was able to provide me , for a given a dataset, the insight I was looking for.

At the same time in my work I observed several times the need of this kind of tool, so I decided to create a simple website http://mlforall.azurewebsites.net/ (still alpha version) to try to see if I was able to assemble something like that.

How it works?

Upload a csv file with the data you want to analyze
Choose the column you want to understand
Done! In few minutes you will receive in your email the results!

You can for example upload the Titanic dataset , choose the Survived column as your objective and in few minutes have this:

DmovwzhW0AEfNfY

So this means that sex and price/kind of ticket were very important factors for the survival. In fact looking at the data you can see that a significant part of the people that survived were female and/or having a first class ticket.

What about name? Well in reality name contains Mrs, Mr terms that are an equivalent of sex and that’s why they are marked as important.

Now I guess the question you have is : why are you doing this and how are sustaining the costs of this?

The answer to the first question is that I want to understand if really “normal” people can benefit from tools like this.

The answer to the second question is that actually it’s me paying but with tight cost control and the right architecture you can do the same for few dollars or even less every day.

Let’s go then to the architecture and the software stack (still work in progress of course):

Screen Shot 2018-09-11 at 4.12.32 PM

In essence the flow is the following:

File upload on the web site (App Service) lands in blob storage container
Azure function is triggered and after some checks invokes a Logic App
Logic Apps creates and monitors Containers Instances and destroy them.
Container Instances pull from Docker hub a custom made docker image (containing Spark runtime, trasmogrifAI and some custom Scala classes) and it starts its execution
The Spark cluster made by those containers processes the file, produces the output in a blob storage container, and goes in “finished” state ready to be destroyed
Another Azure function triggered by the output creation reads the report and sends this data to SendGrid with the email of the person that requested that analysis

Of course things are a bit more complicated than this (we use also some azure blob storage table to manage preferences and file metadata) but you got the point.

With this architecture the current costs are very limited (some days few cents of dollar) and I hope it can continue to run without breaking my pockets.

Hope you like it and if you try it reach me on twitter for feedback.

Let’s dig in our email!

As many of you, even if we are almost in 2018, I still work A LOT using emails and recently I was asking myself the following question what if I can leverage analytics and also machine learning to have a better understanding of my emails?

text-analytics

Here is a quick way to understand who is inspiring you more

positive-attitude

and who are instead the ones spreading a bit more negativity in your daily job 🙂

workplace-negativity-56a0f2bc5f9b58eba4b5761e

You will need (if you want to process ALL your emails in one shot!) :

Windows 7/8/10
Outlook 2013 or 2016
Access 2013 or 2016
An Azure Subscription
A data lake store and analytics account
PowerBI Desktop or any other Visualization Tool you like (Tableau or simply Excel)

Step 1 : Link MS Access Tables to your Outlook folders as explained here

Step 2: Export from Access to csv files your emails.

Step 3: Upload those files to your data lake store.

Step 4: Process the fields containing text data with the U-SQL cognitive extensions and derive sentiment and key phrases of each email

Step 5: With PowerBI Desktop you can access the output data sitting into the data lake store as described here

Step 6: Find the senders with highest average sentiment and the ones with the lowest one 🙂 .

job-well-done-clipart-1

If you are worried about leaving your emails in the cloud, after obtaining the sentiment and key phrases , you can download this latest output and remove all the data from data lake store , using this (local) file as input for power bi desktop.

In addition to this I would also suggest to perform a one way hash of the sender email address and upload to the data lake store account the emails with this hashed field instead of the real sender.

wekk4

Once you have the data lake analytics job results you can download them and join locally in Access to associate again each email to the original sender.

Jazoon 2017 AI meet Developers Conference Review

Hi I had the opportunity to participate to this conference in Zurich on the 27 October 2017 and attend to the following sessions:

Build Your Intelligent Enterprise with SAP Machine Learning
Applied AI: Real-World Use Cases for Microsoft’s Azure Cognitive Services
Run Deep Learning models in the browser with JavaScript and ConvNetJS
Using messaging and AI to build novel user interfaces for work
JVM based DeepLearning on IoT data with Apache Spark
Apache Spark for Machine Learning on Large Data Sets
Anatomy of an open source voice assistant
Building products with TensorFlow

Most of the sessions have been recorded and they are available here:

https://www.youtube.com/channel/UC9kq7rpecrCX7S_ptuA20OA

The first session has been a more a sales/pre-recorded demos presentation of SAP capabilities in terms of AI mainly in their cloud:

But with some interesting ideas like the Brand Impact Video analyzer that computes how much airtime is filled by specific brands inside a video:

And another good use case representation is the defective product automatic recognition using image similarity distance API:

The second session has been around the new AI capabilities offered by Microsoft and divided into two parts:

Capabilities for data scientists that want to build their python models

Azure Machine Learning Workbench that is an electron based desktop app that mainly accelerates the data preparation tasks using “a learn by example” engine that creates on the fly data preparation code.

Azure Notebooks a free but limited Cloud Based Jupyter Notebook environment to share and re-use models/notebooks

Azure Data Science Virtual Machine a pre-built VM with all the most common DS packages (TensorFlow, Caffe, R, Python, etc..)

Capabilities (i.e. Face/Age/Sentiment/OCR/Hand written detection) for developers that want to consume Microsoft pre-trained models calling directly Microsoft Cognitive API

The third session has been more an “educational presentation” around deep learning, and how at high level a deep learning system work, however we have seen in this talk some interesting topics:

The existence of several pre-trained models that can be used as is especially for featurization purposes and/or for transfer learning

How to visualize neural networks with web sites like http://playground.tensorflow.org
A significant amount of demos that can show case DNN applications that can run directly in the browser

The fourth session has been one also an interesting session, because the speaker clearly explained the current possibilities and limits of the current application development landscape and in particular of the enterprise bots.

Key take away: Bots are far from being smart and people don’t want to type text.

Suggested approach bots are new apps that are reaching their “customers” in the channels that they already use (slack for example) and those new apps using the context and channel functionalities have to extend and at the same time simplify the IT landscape.

Example: bot in a slack channel that notifies manager of an approval request and the manager can approve/deny directly in slack without leaving the app.

The fourth and the fifth talk have been rather technical/educational on specific frameworks (IBM System ML for Spark) and on models portability (PMML) with some good points around hyper parameter tuning using a spark cluster in iterative mode and DNN auto encoders.

The sixth talk has been about the open source voice assistant MyCroft and the related open source device schemas.

The session has been principally made on live demos showcasing several open source libraries that can be used to create a device with Alexa like capabilities:

Pocketsphinx for speechrecognition
Padatious for NLP intent detection
Mimic for text to speech
Adapt Intent parser

The last session was on tensor flow but also in general experiences around AI coming from Google, like how ML is used today:

And how Machine Learning is fundamental today with quotes like this:

“Remember in 2010, when the hype was mobile-first? Hype was right. Machine Learning is similarly hyped now. Don’t get left behind”
“You must consider the user journey, the entire system. If users touch multiple components to solve a problem, transition must be seamless”

Other pieces of advice where around talent research and maintain/grow/spread ML inside your organization :

How to hire ML experts:

don’t ask a Quant to figure out your business model
design autonomy
$$$ for compute & data acquisition
Never done!

How to Grow ML practice:

Find ML Ninja (SWE + PM)
Do Project incubation
Do ML office hours / consulting

How to spread the knowledge:

Build ML guidelines
Perform internal training
Do open sourcing

And on ML algorithms project prioritization and execution:

Pick algorithms based on the success metrics & data you can get
Pick a simple one and invest 50% of time into building quality evaluation of the model
Build an experiment framework for eval & release process
Feedback loop

Overall the quality has been good even if I was really disappointed to discover in the morning that one the most interesting session (with the legendary George Hotz!) has been cancelled.

How to generate Terabytes of IOT data with Azure Data Lake Analytics

Hi everyone, during one of my projects I’ve been asked the following question:

I’m actually storing my IOT sensor’s data in Azure Data Lake for analysis and feature engineering , but currently I still have very few devices, so not a big amount of data and I’m not able to understand how much fast will be my queries and my transformations when with much more devices and months/years of sensor data my data lake will reach do over several terabytes.

Well in that case let’s generate quickly those terabytes of data using U-SQL capabilities!

Let’s assume that our data resembles the following:

deviceId, timestamp, sensorValue, …….

so we have for each IOT device a unique identifier called deviceId and let’s assume is a composition of numbers and letters, we have a timestamp indicating the time at millisecond precision, where the IOT event was generated and finally we have the values of the sensors in that moment (temperature, speed, etc..).

The idea is the following give a real deviceId, generate N “synthetic deviceIds” that have all the same data of the original device . So if we have , for example , 5 real deviceId each with 100.000.000 records (500.000.000 records in total), if we generate 1000 synthetic deviceIds for each real deviceId we will have 1000x5x100.000.000 additional records so 500.000.000.000 records.

But we can expand the amount of synthetic data even more playing with time, for example, if our real data has events only for 2017, we can duplicate the entire dataset for all the years starting from 2006 to 2016 and have 5.000.000.000.000 records.

Here some sample C# code that generates the synthetic deviceIds:

note the GetArraysOfSyntheticDevices function that will be executed into the U-SQL script.

Before using it we have to register the assembly into our DataLake account and database (in my case the master one):

DROP ASSEMBLY master.[Microsoft.DataGenUtils];
CREATE ASSEMBLY master.[Microsoft.DataGenUtils] FROM @”location of dll”;

Now we can read the original IOT data and create the additional data:

REFERENCE ASSEMBLY master.[Microsoft.DataGenUtils];

@t0 =

EXTRACT
deviceid string,
timeofevent DateTime,
sensorvalue float
FROM “2017/IOTRealData.csv”
USING Extractors.Csv();

//Let’s have the distinct list of all the real DeviceIds
@t1 =SELECT DISTINCT
deviceid AS deviceid
FROM @t0;

//Let’s calculate for each deviceId an array of 1000 synthetic devices

@t2 =
SELECT deviceid,
Microsoft.DataGenUtils.SyntheticData.GetArrayOfSynteticDevices(deviceid, 1000) AS SyntheticDevices
FROM @t1;

//Let’s assign to each array of synthetic devices the same data of the corresponding real device

@t3 = SELECT a.SyntheticDevices,
de.timeofevent,
de.sensorvalue
FROM @t0 AS de INNER JOIN @t2 AS a ON de.deviceid== a.deviceid;

//Let’s use the explode function to expand the array to records

@t1Exploded =
SELECT
emp AS deviceid,
de.timeofevent,
de.sensorvalue
FROM @t3 AS de
CROSS APPLY
EXPLODE(de.SyntheticDevices) AS dp(emp);

//Now we can write the expanded data

OUTPUT @t1Exploded
TO “SyntethicData/2017/expanded_{*}.csv”
USING Outputters.Csv();

Once you have the expanded data for the entire 2017 you can just use c# DateTime functions that add Years, Months or days to a specific date, applied that to timeofevent column and write the new data in a new folder (for example SyntethicData/2016, SyntethicData/2015 etc…).

My Top 2 Microsoft Build 2017 Sessions

Let’s start with Number 1 this is the Visionary Cloud that is arriving , compute nodes combined with FPGA neurons that act as Hardware Micro services communicating and changing their internal code directly attached to the Azure Network , like a global neural network. Do you want to know more? Click here and check directly with the new video index AI the content of this presentation jumping directly on the portions of the video that you like, searching words, concepts and people appearing in the video.

We can then look here at Number 2 (go to 1:01:54) :

Matt Velloso teaching to a Robot (Zenbo) how to recognize the images the robot sees using Microsoft Bot Framework and the new Custom Image Recognition Service.

Do you want to explore more?

Go here at channel9 and have fun exploring all the massive updates that has been released!

Pyspark safely on Data Lake Store and Azure Storage Blob

Hi , I’m working on several projects where is required to access cloud storages (in this case Azure Data Lake Store and Azure Blob Storage) from pyspark running on Jupyter avoiding that all the Jupyter users are accessing these storages with the same credentials stored inside the core-site.xml configuration file of the Spark Cluster.

I started my investigations looking at the SparkSession that comes with Spark 2.0, especially to commands like this spark.conf.set(“spark.sql.shuffle.partitions”, 6), but I discovered that this command are not working at Hadoop settings level, but they are limited to the spark runtime parameters.

I moved then my attention to SparkContext and in particular to HadoopConfiguration that seemed promising but it is missing into the pyspark implementation…

Finally I was able to find this excellent Stackoverflow post that points out how to leverage the HadoopConfiguration functionality from pyspark.

So in a nutshell you can have the core-site.xml defined as follows:

So as you can see we do not store any credential here.

Let’s see how to access Azure Storage Blob Container with a shared access signature that can be created specifically to access a specific Container (imagine it like a folder) and set almost a fine grained security model on the Azure Storage account without sharing the Azure Blob Storage Access Keys.

If you love python here some code that an admin can use to generate SAS signatures quickly that last for 24 hours:

from azure.storage.blob import (
BlockBlobService,
ContainerPermissions
)

from datetime import datetime, timedelta

account_name ="ACCOUNT_NAME"
account_key ="ADMIN_KEY"
CONTAINER_NAME="CONTAINER_NAME"

block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key)

sas_url = block_blob_service.generate_container_shared_access_signature(CONTAINER_NAME,ContainerPermissions.READ,datetime.utcnow() + timedelta(hours=24),)

print(sas_url)

You will obtain something like this:

sv=2015-04-05&st=2015-04-29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D

You can refer to this link to understand the structure.

Ok now, once the azure storage admin provide us the signature, we can use this SAS signature to access directly the files on the Azure Storage Blob Container safely:

 
 sc._jsc.hadoopConfiguration().set("fs.azure.sas.PUT_YOUR_CONTAINER_NAME.PUT_YOUR_ACCOUNT_NAME.blob.core.windows.net", "PUT_YOUR_SIGNATURE")
 from pyspark.sql.types import *

# Load the data.We use the sample HVAC.csv file of HDInsight samples
 hvacText = sc.textFile("wasbs://PUT_YOUR_CONTAINER_NAME@PUT_YOUR_ACCOUNT_NAME.blob.core.windows.net/HVAC.csv")

# Create the schema
 hvacSchema = StructType([StructField("date", StringType(), False),StructField("time", StringType(), False),StructField("targettemp", IntegerType(), False),StructField("actualtemp", IntegerType(), False),StructField("buildingID", StringType(), False)])

# Parse the data in hvacText
 hvac = hvacText.map(lambda s: s.split(",")).filter(lambda s: s[0] != "Date").map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

# Create a data frame
 hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

# Register the data fram as a table to run queries against
 hvacdf.registerTempTable("hvac")
 from pyspark.sql import HiveContext
 hive_context = HiveContext(sc)
 bank = hive_context.table("hvac")
 bank.show()

The same idea can be applied to data lake store. Assuming that you have your data lake credentials setup as described here , you can access data lake store safely in this way:

 
sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/PUT_YOUR_TENANT_ID/oauth2/token")
sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.client.id", "PUT_YOUR_CLIENT_ID")
sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.credential", "PUT_YOUR_SECRET")


  from pyspark.sql.types import *

# Load the data. The path below assumes Data Lake Store is default storage for the Spark cluster
  hvacText = sc.textFile("adl://YOURDATALAKEACCOUNT.azuredatalakestore.net/Samples/Data/HVAC.csv")

# Create the schema
  hvacSchema = StructType([StructField("date", StringType(), False),StructField("time", StringType(), False),StructField("targettemp", IntegerType(), False),StructField("actualtemp", IntegerType(), False),StructField("buildingID", StringType(), False)])

  # Parse the data in hvacText
  hvac = hvacText.map(lambda s: s.split(",")).filter(lambda s: s[0] != "Date").map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

  # Create a data frame
  hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

  # Register the data fram as a table to run queries against
  hvacdf.registerTempTable("hvac")

    from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("hvac")
bank.show()

Happy coding with pyspark and Azure!

Data scientists wanna have fun!

Hi everyone, yes I’m back!

This is time we are going to setup a Big Data playground on Azure that can be really useful for any python/pyspark data scientist .

Typically what you can have out of the box on Azure for this task it’s Spark HDInsight cluster (i.e. Hadoop on Azure in Platform as a Service mode) connected to Azure Blob Storage (where the data is stored) running pyspark jupyter notebooks.

It’s a fully managed cluster that you can start in few clicks and gives you all the Big Data power you need to crunch billions of rows of data, this means that cluster nodes configuration, libraries, networking, etc.. everything is done automatically for you and you have just to think to solve your business problems without worry about IT tasks like “check if cluster is alive or check if cluster is ok, etc…” , Microsoft will do this for you.

Now one key ask that data scientist have is : “freedom!” , in other words they want to install/update new libraries , try new open source packages but at the same time they also don’t want to manage “a cluster” as an IT department .

In order to satisfy these two requirements we need some extra pieces in our playground and one key component is the Azure Linux Data Science Virtual Machine.

The Linux Data Science Virtual Machine it’s the Swiss knife for all data science needs, here you can have an idea of all the incredible tasks you can accomplish with this product .

In this case I’m really interested in these capabilities:

It’s a VM so data scientists can add/update all the libraries they need
Jupyter and Spark are already installed on it so data scientists can use it to play locally and experiment on small data before going “Chuck Norris mode” on HDInsight

But there is something missing here…., as a data scientist I would love to work in one unified environment accessing all my data and switch with a simple click from local to “cluster” mode without changing anything in my code or my configurations.

Uhmmm…. seems impossible, here some magic is needed !

Wait a minute , did you say “magic”? I think we have that kind of magic :-), it’s spark magic!

In fact we can use the local jupyter and spark environment by default and when we need the power of the cluster using spark magic when can , simply changing the kernel of the notebook, run the same code on the cluster!

In order to complete the setup we need to do the following:

Add to the Linux DS VM the possibility to connect , via local spark, to azure blob storage (adding libraries, conf files and settings)
Add to the Linux DS VM spark magic (adding libraries, conf files and settings) to connect from local Jupyter notebook to the HDInsight cluster using Livy

Here the detailed instructions:

Step 1 to start using Azure blob from your Spark program (ensure you run these commands as root):

cd $SPARK_HOME/conf
cp spark-defaults.conf.template spark-defaults.conf
cat >> spark-defaults.conf <<EOF
spark.jars /dsvm/tools/spark/current/jars/azure-storage-4.4.0.jar,/dsvm/tools/spark/current/jars/hadoop-azure-2.7.3.jar
EOF

If you dont have a core-site.xml in $SPARK_HOME/conf directory run the following:

cat >> core-site.xml <<EOF
< ?xml version=”1.0″ encoding=”UTF-8″?>
< ?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
< configuration>
< property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
< /property>
< property>
<name>fs.azure.account.key.YOURSTORAGEACCOUNT.blob.core.windows.net</name>
<value>YOURSTORAGEACCOUNTKEY</value>
< /property>
< /configuration>
EOF

Else, just copy paste the two <property> sections above to your core-site.xml file. Replace the actual name of your Azure storage account and Storage account key.

Once you do these steps, you should be able to access the blob from your Spark program with the wasb://YourContainer@YOURSTORAGEACCOUNT.blob.core.windows.net/YourBlob URL in the read API.

Step 2 Enable local Juypiter notebook with remote spark execution on HDInsight (Assuming that default python is 3.5 like is coming from Linux DS VM ):

sudo /anaconda/envs/py35/bin/pip install sparkmagic

cd /anaconda/envs/py35/lib/python3.5/site-packages

sudo /anaconda/envs/py35/bin/jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel

sudo /anaconda/envs/py35/bin/jupyter-kernelspec install sparkmagic/kernels/sparkkernel

sudo /anaconda/envs/py35/bin/jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

in your /home/{YourLinuxUsername}/ folder

create a folder called .sparkmagic and create a file called config.json
Write in the file the configuration values of HDInsight (livy endpoints and auth) as described here :

At this point going back to Jupyter should allow you run your notebook against the HDInsight cluster using PySpark3, Spark, SparkR kernels and you can switch from local Kernel to remote kernel execution with one click!

Of course some security features have to improved (passwords in clear text!), but the community is already working on this (see here support for base64 encoding) and ,of course , you can get the spark magic code from git, add the encryption support you need and bring back this to the community!

Have fun with Spark and Spark Magic!

UPDATE : here instructions on how to connect also to Azure Data Lake Store!

Download this package and just extract these two libraries: azure-data-lake-store-sdk-2.0.11.jar , hadoop-azure-datalake-3.0.0-alpha2.jar
Copy these libraries here “/home/{YourLinuxUsername}/Desktop/DSVM tools/spark/spark-2.0.2/jars/”
Add their path to the list of library paths inside spark-defaults.conf as we have done before
Go here and after you have created your AAD Application note down : Client ID, Client Secret and Tenant ID
Add the following properties to your core-site.xml replacing the values with the ones you have obtained from the previous step:<property><name>dfs.adls.oauth2.access.token.provider.type</name><value>ClientCredential</value></property><property><name>dfs.adls.oauth2.refresh.url</name><value> https://login.microsoftonline.com/{YOUR TENANT ID}/oauth2/token</value></property><property><name>dfs.adls.oauth2.client.id</name><value>{YOUR CLIENT ID}</value></property>
<property><name>dfs.adls.oauth2.credential</name><value>{YOUR SECRET ID}</value></property>

<property><name>fs.adl.impl</name><value>org.apache.hadoop.fs.adl.AdlFileSystem</value></property>

<property><name>fs.AbstractFileSystem.adl.impl</name><value>org.apache.hadoop.fs.adl.Adl</value></property>

	Rosaura su Pyspark safely on Data Lake St…
	bharath su Bot hand off to agent with Sal…
	Hi, can you please s… su Bot hand off to agent with Sal…
	competitors su Pyspark safely on Data Lake St…
	Detti su Bot hand off to agent with Sal…

	Rosaura su Pyspark safely on Data Lake St…
	bharath su Bot hand off to agent with Sal…
	Hi, can you please s… su Bot hand off to agent with Sal…
	competitors su Pyspark safely on Data Lake St…
	Detti su Bot hand off to agent with Sal…

Alberto De Marco @albertod

Hi I am Alberto De Marco , I write this blog . I am interested mainly in security & ML/big data tech but also in some other collateral stuff.

Big Data