My Top 2 Microsoft Build 2017 Sessions

Let’s start with Number 1 this is the Visionary Cloud that is arriving , compute nodes combined with FPGA neurons that act as Hardware Micro services communicating and changing their internal code directly attached to the Azure Network , like a global neural network. Do you want to know more? Click here and check directly with the new video index AI the content of this presentation jumping directly on the portions of the video that you like, searching words, concepts and people appearing in the video.

We can then look here at Number 2 (go to 1:01:54) :

Matt Velloso teaching to a Robot (Zenbo) how to recognize the images the robot sees using Microsoft Bot Framework and the new Custom Image Recognition Service.

Do you want to explore more?

Go here at channel9 and have fun exploring all the massive updates that has been released!

Pyspark safely on Data Lake Store and Azure Storage Blob

Hi , I’m working on several projects where is required to access cloud storages (in this case Azure Data Lake Store and Azure Blob Storage) from pyspark running on Jupyter avoiding that all the Jupyter users are accessing these storages with the same credentials stored inside the core-site.xml configuration file of the Spark Cluster.



I started my investigations looking at the SparkSession that comes with Spark 2.0, especially to commands like this spark.conf.set(“spark.sql.shuffle.partitions”, 6), but I discovered that this command are not working at Hadoop settings level, but they are limited to the spark runtime parameters.

I moved then my attention to SparkContext and in particular to HadoopConfiguration that seemed promising but it is missing into the pyspark implementation…

Finally I was able to find this excellent Stackoverflow post that points out how to leverage the HadoopConfiguration functionality from pyspark.

So in a nutshell you can have the core-site.xml defined as follows:


So as you can see we do not store any credential here.

Let’s see how to access Azure Storage Blob Container with a shared access signature that can be created specifically to access a specific Container (imagine it like a folder) and set almost a fine grained security model on the  Azure Storage account without sharing the Azure Blob Storage Access Keys.

If you love python here some code that an admin can use to generate SAS signatures quickly that last for 24 hours:

from import (

from datetime import datetime, timedelta

account_name ="ACCOUNT_NAME"
account_key ="ADMIN_KEY"

block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key)

sas_url = block_blob_service.generate_container_shared_access_signature(CONTAINER_NAME,ContainerPermissions.READ,datetime.utcnow() + timedelta(hours=24),)


You will obtain something like this:


You can refer to this link to understand the structure.

Ok now, once the azure storage admin provide us the signature, we can use this SAS signature to access directly the files on the Azure Storage Blob Container safely:

 sc._jsc.hadoopConfiguration().set("", "PUT_YOUR_SIGNATURE")
 from pyspark.sql.types import *

# Load the data.We use the sample HVAC.csv file of HDInsight samples
 hvacText = sc.textFile("wasbs://")

# Create the schema
 hvacSchema = StructType([StructField("date", StringType(), False),StructField("time", StringType(), False),StructField("targettemp", IntegerType(), False),StructField("actualtemp", IntegerType(), False),StructField("buildingID", StringType(), False)])

# Parse the data in hvacText
 hvac = s: s.split(",")).filter(lambda s: s[0] != "Date").map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

# Create a data frame
 hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

# Register the data fram as a table to run queries against
 from pyspark.sql import HiveContext
 hive_context = HiveContext(sc)
 bank = hive_context.table("hvac")

The same idea can be applied to data lake store. Assuming that you have your data lake credentials setup as described here , you can access data lake store safely in this way:

sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.refresh.url", "")
sc._jsc.hadoopConfiguration().set("", "PUT_YOUR_CLIENT_ID")
sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.credential", "PUT_YOUR_SECRET")

  from pyspark.sql.types import *

# Load the data. The path below assumes Data Lake Store is default storage for the Spark cluster
  hvacText = sc.textFile("adl://")

# Create the schema
  hvacSchema = StructType([StructField("date", StringType(), False),StructField("time", StringType(), False),StructField("targettemp", IntegerType(), False),StructField("actualtemp", IntegerType(), False),StructField("buildingID", StringType(), False)])

  # Parse the data in hvacText
  hvac = s: s.split(",")).filter(lambda s: s[0] != "Date").map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

  # Create a data frame
  hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

  # Register the data fram as a table to run queries against

    from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("hvac")

Happy coding with pyspark and Azure!

Data scientists wanna have fun!

Hi everyone, yes I’m back!

This is time we are going to setup a Big Data playground on Azure that can be really useful for any python/pyspark data scientist .

Typically what you can have out of the box on Azure for this task it’s Spark HDInsight cluster (i.e. Hadoop on Azure in Platform as a Service mode) connected to Azure Blob Storage (where the data is stored)  running pyspark jupyter notebooks.

It’s a fully managed cluster that you can start in few clicks and gives you all the Big Data power you need to crunch billions of rows of data, this means that cluster nodes configuration, libraries, networking, etc.. everything is done automatically for you and you have just to think to solve your business problems without worry about IT tasks like “check if cluster is alive or check if cluster is ok, etc…”  , Microsoft will do this for you.

Now one key ask that data scientist have is : “freedom!” , in other words they want to install/update new libraries , try new open source packages but at the same time they also don’t want to manage “a cluster” as an IT department .

In order to satisfy these two requirements we need some extra pieces in our playground and one key component is the Azure Linux Data Science Virtual Machine.

The Linux Data Science Virtual Machine it’s the Swiss knife for all data science needs, here  you can have an idea of all the incredible tasks you can accomplish with this product .

In this case I’m really interested in these capabilities:

  • It’s a VM so data scientists can add/update all the libraries they need
  • Jupyter and Spark are already installed on it so data scientists can use it to play locally and experiment on small data before going “Chuck Norris mode” on HDInsight

But there is something missing here…., as a data scientist I would love to work in one unified environment accessing all my data and switch with a simple click from local to “cluster” mode without changing anything in my code or my configurations.

Uhmmm…. seems impossible, here some magic is needed !

Wait a minute , did you say “magic”? I think we have that kind of magic :-), it’s spark magic!

In fact we can use the local jupyter and spark environment by default and when we need the power of the cluster using spark magic when can , simply changing the kernel of the notebook,  run the same code on the cluster!

diagramIn order to complete the setup we need to do the following:

  1. Add to the Linux DS VM the possibility to connect , via local spark, to azure blob storage (adding libraries, conf files and settings)
  2. Add to the Linux DS VM spark magic (adding libraries, conf files and settings) to connect from local Jupyter notebook to the HDInsight cluster using Livy

Here the detailed instructions:

Step 1  to start using Azure blob from your Spark program (ensure you run these commands as root):

cd $SPARK_HOME/conf
cp spark-defaults.conf.template spark-defaults.conf
cat >> spark-defaults.conf <<EOF
spark.jars                 /dsvm/tools/spark/current/jars/azure-storage-4.4.0.jar,/dsvm/tools/spark/current/jars/hadoop-azure-2.7.3.jar

If you dont have a core-site.xml in $SPARK_HOME/conf directory run the following:

cat >> core-site.xml <<EOF
< ?xml version=”1.0″ encoding=”UTF-8″?>
< ?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
< configuration>
< property>
< /property>
< property>
< /property>
< /configuration>

Else, just copy paste the two <property> sections above to your core-site.xml file. Replace the actual name of your Azure storage account and Storage account key.

Once you do these steps, you should be able to access the blob from your Spark program with the wasb:// URL in the read API.

Step 2 Enable local Juypiter notebook with remote spark execution on  HDInsight (Assuming that default python is 3.5 like is coming from Linux DS VM ):

sudo /anaconda/envs/py35/bin/pip install sparkmagic

cd /anaconda/envs/py35/lib/python3.5/site-packages

sudo /anaconda/envs/py35/bin/jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel

sudo /anaconda/envs/py35/bin/jupyter-kernelspec install sparkmagic/kernels/sparkkernel

sudo /anaconda/envs/py35/bin/jupyter-kernelspec install sparkmagic/kernels/sparkrkernel


in your /home/{YourLinuxUsername}/ folder

  1. create a folder called .sparkmagic and create a file called config.json
  2. Write in the file the configuration values of HDInsight (livy endpoints and auth) as described here :

At this point going back to Jupyter should allow you run your notebook against the HDInsight cluster using PySpark3, Spark, SparkR kernels and you can switch from local Kernel to remote kernel execution with one click!

Of course some security features have to improved (passwords in clear text!), but the community is already working on this (see here support for base64 encoding) and ,of course , you can get the spark magic code from git, add the encryption support you need and bring back this to the community!

Have fun with Spark and Spark Magic!

UPDATE : here instructions on how to connect also to Azure Data Lake Store!

  1. Download this package and just extract these two libraries: azure-data-lake-store-sdk-2.0.11.jar , hadoop-azure-datalake-3.0.0-alpha2.jar
  2. Copy these libraries here “/home/{YourLinuxUsername}/Desktop/DSVM tools/spark/spark-2.0.2/jars/”
  3. Add their path to the list of library paths inside spark-defaults.conf as we have done before
  4. Go here and after you have created your AAD Application note down : Client ID, Client Secret and Tenant ID
  5. Add the following properties to your core-site.xml replacing the values with the ones you have obtained from the previous step:<property><name>dfs.adls.oauth2.access.token.provider.type</name><value>ClientCredential</value></property><property><name>dfs.adls.oauth2.refresh.url</name><value>{YOUR TENANT ID}/oauth2/token</value></property><property><name></name><value>{YOUR CLIENT ID}</value></property>

    <property><name>dfs.adls.oauth2.credential</name><value>{YOUR SECRET ID}</value></property>



Integrating Azure API App protected by Azure Active Directory with Salesforce using Oauth

This time I had to face a new integration challenge: on salesforce service cloud , in order to offer a personalized service to customers requesting assistance, I added a call to an azure app that exposes all the information the  company has about this customer on all the touch points (web, mobile, etc…). Apex coding is quite straightforward when dealing with simple http calls and interchange of Json objects, it becames more tricky when you have to deal with authentication.

In my specific case the token based authentication I have to put in place is composed by the following steps:

  1. Identify the url that accept our authentication request and returns the authentication token
  2. Compose the authentication request with all the needed parameters that define the requester identity and the target audience
  3. Retrieve the token and model all the following  requests to the target app inserting this token in the header
  4. Nice to have : cache the token in order to reuse it for multiple apex calls and refresh it before it expires or on request.

Basically all the info we need is contained in this single Microsoft page.

So before even touching a single line of code we have to register the calling and called applications in azure directory (this will give to both an “identity” ).

This step should be already done for the azure API app when you protect it with AAD using the portal (write down the client Id of the AAD registered app), while for the caller (salesforce) just register a simple app you want on the AAD.

When you will do this step write down the client id and the client secret that the portal will give you.

Now you need your tenantid , specific for your azure subscription. There are multiple ways of retrieving this parameter as stated here , for me worked the powershell option.

Once you have these 4 parameters you can be build a POST request in this way:

Endpoint:<tenant id>/oauth2/token


Content-Type: application/x-www-form-urlencoded

Request body:

grant_type=client_credentials&client_id=<client  id of salesforce app>&client_secret=<client secret of the salesforce app>&resource=<client id of the azure API app>

If everything goes as expected you will receive this JSON response:






“resource”:”<client id of the azure API app>”


Now your are finally ready to call the azure API app endpoint, in fact the only added thing you have to do is to add to the http request is an header with the following contents :

Authorization: Bearer <access_token coming from the previous request> 

This should be sufficient to complete our scenario (btw do not forget to add and https://<UrlOfYourApiApp&gt; to the authorized remote urls list of your salesforce setup).

Using the expires data of the token you can figure out how long it will last (usually 24h) and you can setup your cache strategy.

Happy integration then!


Here some apex snipplets that implement what explained.

public with sharing class AADAuthentication {
private static String TokenUrl='';
private static String grant_type='client_credentials';
private static String client_id='putyourSdfcAppClientId';
private static String client_secret='putyourSdfcAppClientSecret';
private static String resource='putyourAzureApiAppClientId';
private static String JSonTestUrl='putyourAzureApiUrlyouwanttoaccess';

public static String AuthAndAccess()
String responseText='';
String accessToken=getAuthToken();
HttpRequest req = new HttpRequest();
req.setHeader('Authorization', 'Bearer '+accessToken);
Http http = new Http();
HTTPResponse res = http.send(req);
System.debug('COMPLETE RESPONSE: '+responseText);

} catch(System.CalloutException e) {
return responseText;


public static String getAuthToken()
String responseText='';
HttpRequest req = new HttpRequest();
String requestString='grant_type='+grant_type+'&client_id='+client_id+'&client_secret='+client_secret+'&resource='+resource;
Http http = new Http();
HTTPResponse res = http.send(req);
System.debug('COMPLETE RESPONSE: '+responseText);

} catch(System.CalloutException e) {

JSONParser parser = JSON.createParser(responseText);
String accessToken='';
while (parser.nextToken() != null) {
if ((parser.getCurrentToken() == JSONToken.FIELD_NAME) &&
(parser.getText() == 'access_token')) {
accessToken = parser.getText();
System.debug('accessToken: '+accessToken);
return accessToken;



Tips on cloud solutions integration with Azure Logic Apps

The cloud paradigm is a well established reality: old CRM systems are now replaced by salesforce  solutions, HR and finance systems by workday , old exchange servers and SharePoint intranets by office 365, without mentioning entire datacenters migrated to Amazon. At the same time all the efforts that on premise where done to have all these systems to communicate together (the glorious days  of ESB solutions with TIBCO, Websphere, Bea Logic, BizTalk, etc..) seems kind of lost in the new cloud world. Why?

Well each cloud vendor (salesforce first I think) tries to position his own AppStore (AppExchange) with pre-built connectors or apps that quickly allow to companies to have integrations with other cloud platforms in a way that is not only cost effective but most of all supported by app vendor and not by a system integrator (so paid by the license).

Well this in theory should work , in practice we see a lot of good will from niche players on these apps but no or little commitment from big vendors.

Luckily however the best cloud solutions already provide rich and secure APIs to enable integration , it’s only a matter of connecting the dots and here several “integration” cloud vendors are already positioning themselves : Informatica Cloud, Dell, SnapLogic,MuleSoft,etc… ,the Gartner quadrant for Integration platform as a service (iPaaS) represents well the situation.

But while Gartner produced the report on March 2015 , Microsoft released a new kind of app on the azure platform that is called Azure Logic App.

What is the strength of this “service” compared to the others? Well in my opinion is that lies on a “de facto” proven and standard platform that can scale as much as you want and it also gives you the possibility of writing in your own connectors as Azure API apps , finally you can create your integration workflows right in the browser “no client attached!”. Of course it has so many other benefits but for me these three are so compelling that I decided to give it a try and start developing some POCs around it.

What you can do with them? Basically you can connect any system with any other system right from your browser!

Do you want to twit something if a document is uploaded on your dropbox? You can do it!

You want to define a complex workflow that start from a message on the service bus , calls a BAPI on SAP and ends inside a oracle stored procedures? It’s right there!

There is a huge list of connectors that you can use , and each of them can be combined with others in so many interesting ways!

Connectors have actions and triggers! Triggers , as the word says, are useful to react to an event (a new tweet with a specific word , a new lead on salesforce, etc..) and they can be used in a push or pull fashion (I’m interested in this event so the connector will notify me when this happens or I’m interested in this data and I will periodically call the connector to check if there is new data).

Actions are simply methods that can be executed on the connector (send an email, do a query, write a post on facebook,etc…).

An azure logic app is a workflow where you combine all these connectors using triggers and actions to achieve your goals.

How they communicate each other? I mean how do I refer inside a connector B that is linked to A to perform the action using A data? It is super simple! When you link two connectors you will see inside the target one on every action that requires data pick lists where you can easily pick the source data! This can happen because each connector automatically describes its API schema using swagger (this really rocks!).

And you want to know the best of this? If you write your own connector with Visual Studio it will automatically generate the swagger metadata for you! So in really 10 min you can have your brand new connector ready to use !

Added bonus : you can have automatically done for you a testing api made by swagger!

Azure website is full of references to quickly ramp up on this technology , so I want to give you some  useful tips in your app logic journey instead of a full tutorial.

Tip 1:  You will see that published connectors are requesting you some configuration values (Package Settings) and only after that the connector becomes usable in your logic app. I tried with no success to do the same in visual studio with a custom API app and the best that I was able to find is that you simulate this only if you use deploy script from github (look at azuredeploy.json file, some examples here ) , at this stage in fact with the deploy setup screen you can set some configuration values that will never change once your azure api app is published . The way this is done is to map deploy parameters with app settings like this:

"properties": {
"name": "[parameters('siteName')]",
"gatewaySiteName": "[parameters('gatewayName')]",
"serverFarmId": "[resourceId('Microsoft.Web/serverfarms', parameters('svcPlanName'))]",
 "siteConfig": {
  "appSettings": [
   "name": "clientId",
   "value": "[parameters('MS_client_id')]"
   "name": "clientSecret",
   "value": "[parameters('MS_client_secret')]"

Then you can use the usual ConfigurationManager.Appsettings to read these values into the code.

I guess this will be fixed (possibility of defining package settings) when the publishing on marketplace will be available.

Tip 2: If you store credentials inside your custom api app please note that by default api app are published as public…. so if this particular api app reads from you health IOT device anyone in the world that knows or discovers API address can call this API and read your health data! So set security to internal and not public!

Tip 3: Browser Designer can be sometimes instable and produce not exactly what you were hoping from it, always check also the code view!
Tip 4: Azure API Apps have application settings editable on azure portal like the normal Azure “Web” Apps but they are hidden!
Look at this blog that saved me!

That’s it!