Extract text from documents at scale in the Azure Data Lake

Hi across all the content posted on Build 2017 , I was really impressed by this presentation where you can learn how to lift and shift almost any runtime to Azure Data Lake Analytics for large scale processing.

Following those recommendations I built a custom Extractor and Processor for USQL leveraging tika.net extractor in order to extract in a text format all the content stored in files like pdf, docx, etc…

The idea is to solve the following business scenario: you have a large collection of docs (azure data lake store capacity is unlimited)  : pdfs, xls, ppts, etc.. and you want to quickly understand the information stored in all those documents without having to create/deploy your own cluster but in pure PaaS mode.

Here a sample of code built around the Visual Studio Project template for U-SQL Applications.

ADLA4As you can see in this demo we limited the max size of the extracted content to 128Kb in order to comply to this ADLA limit. This limit can be bypassed working on byte arrays.

Now I uploaded all the dll binaries to the data lake stored and registered them as assembly

adla3.jpg

Then I launched a U-SQL command to actually take text data stored in a collection of pdf documents specifying 22 AU .adla2

And in less than 2 min I have my collection of documents parsed inside one single csv file .adla.jpg

Now that the information is in text format we can use the Azure Data Lake topics and keywords extensions to understand quickly what kind of information is stored inside our document collection.

The final result that shows how keywords are linked to documents can be visualized in Power Bi with several nice visualizations

rev1

And clicking on one keyword we can see immediately which documents are linked to it

rev2

Another way to visualize this , it is with word cloud

rev4

where we see for a specific document this time what are the keywords most representative of the document.

If you are interested into the solution and you want to know more send me a message on my twitter handle.

Enterprise Ready Jupyter Hub on the Linux Data Science VM

Hi everyone, this time we want to try to improve the data science experience of the Azure Linux Data Science VM with a special focus on accounts and authentication.

Data Scientists , as all the employees in major Enterprises, have an active directory account, the one they use everyday to logon on their laptops and to access emails. When instead they have to use Data science exploration tools like Jupyter notebooks they have to remember specific local accounts created for this objective (using notebooks) and of course the IT managing these accounts it’s even loaded of procedures and checks to manage these additional accounts in addition to the usual ones on Active Directory.

We will try to improve this experience explaining how to set up a Linux Data Science VM joined to a managed domain and have also Jupyter Hub authentication working with the very same domain.

The things to set up are the following:

  1. An Azure Active Directory that usually mirrors automatically the on-premise active directory structure and content
  2. Azure Active Directory Domain Services with its own Classic VNET
  3. Another Resource Manager VNET where one or more Linux DS VMs will be deployed
  4. A peering between the two VNETs
  5. The packages needed for the Linux OS to join a managed domain
  6. The authentication module for Jupyter Hub that makes authentication happen against the managed domain

Why all these components and complexity?

Well Azure Active Directory works mainly with oauth protocol while OS authentication works with Kerberos tickets that requires an “old fashion” managed domain and Domain Services it’s a way to have this completely managed by Azure. In addition Domain Services gives us also LDAP protocol support that is exactly what we need for Jupyter Hub.

The two VNETs are needed because Domain Services still needs a “Classic VNET” while the modern Linux DS VMs are made with Resource Manager template. The peering between the two guarantees that they can see each other even if they are separate.

Peering

Now we will demonstrate step by step how to setup all the necessary components.

Step N. 1 Create an Azure Active Directory

Go to https://manage.windowsazure.com/ and here click on +New button on the left hand bottom corner, go and click on App Services > Active Directory > Directory, finally click on Custom Create , here choose name, domain name and Country .

Pay attention to country choice because it will decide on which datacenter your active directory will be.

Once done you should have something like mytestdomain.onmicrosoft.com .

Step N. 2 Create Azure Active Directory Domain Services with its own Classic VNET

Here simply follow this great Microsoft step by step tutorial   completing all the 5 tasks. Do not forget , if you do not import from on premise AD, to add at least one user , to change the password of this user and to add it to AAD DC Administrators group.

Step N.3 Create a Resource Manager based VNET

Here simply go to the new portal.azure.com and create a normal vnet paying attention to choose the addresses in way that are not overlapping with the ones of the previous VNET (so if you have choosen 10.1.0.24 for the classic VNET , pick 10.2.0.24 for the new one).

Step N.4 Define the peering between the two VNETs

Go to portal.azure.com, to the new VNET that you have just created and enable the peering :

peering

Step N.5 Deploy  and Configure Linux DS VM

Again from portal.azure.com , add a new Linux Data Science VM CentOS version and during the configuration pay attention to pick as VNET the latest one you created (the ARM based one).

Once the VM is up install the needed packages with this command:

yum install sssd realmd oddjob oddjob-mkhomedir adcli samba-common samba-common-tools krb5-workstation openldap-clients policycoreutils-python -y

Go now to /etc/resolv.conf and setup the name resolution in the putting the domain name and the ipaddress of the azure domain services (one of the two).

Here an example:

search mytestdomain.onmicrosoft.com
nameserver 10.0.0.4

Now join the domain with this command (change the user to the admin defined at Step 2)

realm join –user=administrator mytestdomain.onmicrosoft.com

Check that everything is ok with the command realm list .

Now modify the /etc/sssd/sssd.conf changing these two lines in the following way:

use_fully_qualified_names = False
fallback_homedir = /home/%u

and restart the sssd demon with this command systemctl restart sssd .

Now try to login/ssh with simple domain username (without @mytestdomain.onmicrosoft.com) and password and everything should work.

Step 6 Configure Jupiter Hub

Add the ldap connector with pip:

pip install jupyterhub-ldapauthenticator

Configure the jupyter hub configuration file in the following way (change Ip Address and other parameters accordingly):

c.JupyterHub.authenticator_class = ‘ldapauthenticator.LDAPAuthenticator’
c.LDAPAuthenticator.server_address = ‘10.0.0.4’
c.LDAPAuthenticator.bind_dn_template = ‘CN={username},OU=AADDC Users,DC=mytestdomain,DC=onmicrosoft,DC=com’
c.LDAPAuthenticator.lookup_dn = True
c.LDAPAuthenticator.user_search_base = ‘DC=mytestdomain,DC=onmicrosoft,DC=com’
c.LDAPAuthenticator.user_attribute = ‘sAMAccountName’
c.LDAPAuthenticator.server_port = 389
c.LDAPAuthenticator.use_ssl = False
c.Spawner.cmd = [‘/anaconda/envs/py35/bin/jupyterhub-singleuser’]

Now to troubleshoot and verify that everything works kill the jupyterhub processes running by default on the linux DS VM and try the following command (sudo is needed to launch jupyter hub in multiuser mode):

sudo /anaconda/envs/py35/bin/jupyterhub -f /path/toconfigfile/jupyterhub_config.py –no-ssl –log-level=DEBUG

Now try to authenticate going to localhost:8000 with domain username (without @mytestdomain.onmicrosoft.com) and password and you should be able to log on on juypter with your AAD credentials.

 

 

Image recognition for everyone

Warning : I’m NOT a data scientist, but I huge fan of cool technology !

Today I want to write of a new functionality that amazes me and that can help you to literally do “magic” things that you can think can be exclusive of super data scientists expert of deep learning and frameworks like TensorFlow, CNTK, Caffe,etc…

Imagine the following: someone  trains huge neural networks (imagine like mini brains) for weeks/months using a lot of GPUs on thousands and thousands of images.

These mini brains are then used to classify images and say something like: a car is present, a human is present, a cat etc… . Now one of “bad things” of neural networks is that usually you cannot understand how they really work internally and what is the “thinking process” of a neural network.

featurization

However latest studies on neural networks have found a way to “extract” this knowledge and Microsoft has delivered right now in April this knowledge or better these models.

Now I want to show you an example on how to do this.

Let’s grab some images of the Simpsons :

59381

and some other images of the Flintstones:

2BF47AA000000578-3221746-image-a-102_1441327664251

For example 13 images of Simpson cartoon and 11 of Flintstones. And let’s build a program that can predict given a new image that is not part of the two image sets if it is a Simpson or Flintstone image. I’ve chosen cartoons but you can apply this to any image that you want to process (watches? consoles? vacation places? etc…).

The idea is the following: I take the images I have and give these images to the “model” that has been trained . Now the result of this process will be , for each image, a collection of “numbers” that are the representation of that image according to the neural network. An example to understand that: our DNA is a small fraction of ourselves but it can “represent” us, so these “numbers” are the DNA of the image.

Now that we have the image represented by a simple array of numbers, we can use a “normal” machine learning technique like linear regression to leverage this simplified representation of the image and learn how to classify them.

Applying the sample R code described in the article to only a small sample of images (13 and 11 respectively) using 80% for training and 20% for scoring we obtained the following result:

Score

A good 75% on a very small amount of images !

 

 

E2E Product Info Bot Demo

Hi this time I wanted to investigate in the possibilities of using a chat bot to help any e-commerce website to answer product related questions (if possible even more complicated questions) and also from a back end perspective have a tracking of what is happening and how our bot is performing, what are the most requested products, how consumers feels about the products and of course all of this in real time !

So let’s start from the basics, what is the product that we want to sell ?

Since I’m a huge fallout fan and I’m playing Fallout 4 right now, I want to introduce you to the fantastic world of … Nuka Cola!

Yeah you will love it! The smell, the flavor and the radiations will make it the next most wanted beverage in the world!

We can have different types of Nuka Cola (Wild, Orange,etc..) and we want to explain the characteristics of each variant to the users of our bot.

I will leverage several technologies mainly Microsoft based but can achieve the same with many other bot/analytics technologies (check here ).

Let’s start with the bot itself that we will build with Microsoft Bot Framework.

You need at least : 1 Hotmail, Live, Outlook.com account, 1 azure subscription linked to it and some dev account into the channels you want to use for your bot (Facebook developer account for example).

I reused a VM already with Visual Studio 2015 and after upgrading it to the latest patch level, I installed the bot framework visual studio project template and I also installed the bot emulator to test the bot locally.

The procedure is well explained here.

Now that we have a bot running locally we want to add “some intelligence” right?

The intelligent service that will help us to make a bot that is able to understand human language is LUIS , where will introduce what actually our bot is able to understand and what are the concepts that he is able to distill from a message.

First we need to create a new LUIS application (that will became later a simple api endpoint that we have to call) and we need to define at least three things: intents , action parameters and entities, but these concepts can be easily explained with our example:

Intent: I want to understand which type of Nuka Cola (Action Parameter) the consumer is interested .

Action Parameter: the parameter NukaColaType has to match an Entity.

Entity: List of all possible Nuka Cola variants

So what we will do on LUIS is the following:

  1. Create an Entity called NukaColaVariants
  2. Create an Intent called ProductInfoRequest
  3. Define inside ProductInfoRequest a mandatory Action Parameter called NukaColaType that matches the Entity NukaColaVariants

Here an screenshot:

luisAnd now? Well now the magic happens! Click on new utterances and start typing possible requests that your nukacola customers probably will do like:

“Tell me more about nuka cola dark” or

“What about nuka cola orange?”

After you record an utterance you will define the intent for it and then highlight the part of the phrase where the action parameter is defined and the matching entity.

In this way , utterance after utterance, LUIS will automatically be able to answer, but more importantly he will be smart enough to understand some like “how does nuka cola quartz taste like?” even if we never typed this utterance inside LUIS.

You can also look at LUIS errors and help him to understand some phrases that he was not able to classify properly.

Once we think that our LUIS model is kind of ok (I inserted 3-4 different utterances to start), you can publish the LUIS app and this action will give you a full api URL with inside your application id and subscription key.

Now bringing all this “Intelligence” to our bot program will be super complicated right? Api calls, object mappings, parameters in, parameters out, try catch…..

Instead it will be actually super super easy using the LuisDialog Class , check here or here for the details!

Now our bot using the emulator is answering amazingly in the emulator, but we want to use it live,right?

So from VS2015 we publish the bot to our Azure Subscription , register the bot on the bot developer portal and add/configure the channels that you want (I opted for telegram and facebook).

All the steps are described here.

fbbottelegram

Do you want to try ?

Go to the nukabot page on facebook to chat with my bot or try the telegram bot.

Ok now we have our bot answering to facebook and telegram like a champion but if this was really a bot answering hundreds of requests per hour we cannot watch all the chats and have an idea of what is happening.

We need some analytics that in real time processes the information and gives us an idea of what is happening.

I created this PowerBI Dashboard:

dashThanks to this dashboard in real time I can see what is going on and what is the feeling of my customers about my nuka cola products!

Let’s see how to do that.

First step Logic Apps :

logicWhat we do here is to create an endpoint listening that will accept calls from our bot, call the Text Analytics Api on Azure to understand the sentiment from the text typed by the customer and finally send all the information to an event hub.

In order to send from Logic Apps info to an event hub you have to deploy the EventHubApi App published here using the magic “Deploy to azure” button and then discover the API with the “Show APIs for App Services in the same region” option when you add an action.

You can use the start a free ride of Text Analytics Api really simply clicking this link (provided you are logged in your azure subscription).

Since it is essentially a simple http endpoint processing json inputs and outputs you can use a simple Http connector.

A stream analytics job will in real time process the info coming from the event hub and populate the data on Power BI.

simplifiedStream analytics jobs just take one input (In this case event hub), transform it using a query and puts the results in an output (in this case a powerbi dataset).

The query we used is the following:

SELECT
NukaColaType,
Channel,
System.Timestamp as time,
COUNT(*) AS [Count],
AVG(score) AS [Score]
INTO
OutputBI
FROM
InputFromHub TIMESTAMP BY [timestamp]
GROUP BY
NukaColaType,
Channel,
TumblingWindow(second, 5)

So every 5 sec we take all the messages , we group them by nukacola type and channel and we compute the count of events and the average of the score.

That’s what we will see in real time on the dashboard.

Why we added Logic Apps into this? Can we call directly the event hub from the bot?  Why use the score only for analytics while it can and should be used to provide better feedback to the consumer in real time?

All of these and many other are great questions, some answers:

  1. I like Logic Apps because it exposed one end point and behind the scenes with 0 code I can create “monster” workflows that can do as many things as I want
  2. Using the score in real time it’s a great idea , but still I’m not able to think a way to have Luis and sentiment playing together nicely (I have to study more probably)

Now some clarifications related to data, data retention:

  1. I do not store any user identifier (my scope is only to understand if the bot is responding well I do not care who is the person actually writing)
  2. All the data is stored in powerbi at aggregated level (as you have seen with the query)
  3. The detailed data into the event hub is cleared automatically every day without any backup policy.

So at the end of the day I only observe stats about bot responses and I can look into LUIS errors and improve the responses of the bot.

Extending Salesforce Search with Azure Search using Logic Apps

Hi this time we want to bring inside the Salesforce platform the power of Azure search having as objective to index directly Salesforce entities.

Why this? Salesforce search is not enough? Of course Salesforce search capabilities are great but we can make it even greater if we can have things like fuzzy search, suggestions,etc.. that Azure Search offers out of the box.

How to achieve this? We can do a minimal project that can demonstrate quickly how we can add this functionality with ZERO impacts on existing Salesforce entities (no additional triggers or change in your existing code) but only a couple of additional visual force pages if you want.

The idea is to use Azure Logic Apps to detect record changes in Salesforce objects and refresh the azure search index when this happens.

Step 1 : Create the Azure Search Index Definition

We can use the rest API or directly the azure portal, let’s see the rest raw api call:


POST https://yoursearchservice.search.windows.net/indexes?api-version=2015-02-28
api-key: [yourApiKey]
Content-Type: application/json
{
"name": "myindex",
"fields": [
{"name": "id", "type": "Edm.String","key": true, "searchable": false},
{"name": "FirstName", "type": "Edm.String"},
{"name": "LastName", "type": "Edm.String"},
{"name": "Email", "type": "Edm.String"}
],
"suggesters": [
{
"name": "sg",
"searchMode": "analyzingInfixMatching",
"sourceFields": ["LastName"]
}
]
}

As you can see a real simple example with just SalesforceId, firstname, lastname and email indexed (we plan to index the Contact Objects). We add the suggester on the lastname, so you can test that even misspelling a surname the azure search will give you anyway a good tip on the person you wanted to find.

Step 2: Create an Azure Logic App

Here another simple process to put in place, with some minor complications due to the fact that search rest api do not expose still a good swagger definition so we have to go direct with http. We first login in salesforce with salesforce connector, then we pick the contact as object to watch.

sample

Here the code view of the Azure Logic App:


{
"$schema": "https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#",
"actions": {
"HTTP": {
"inputs": {
"body": "{'value': [{'@search.action': 'upload','id': '@{triggerBody()?['Id']}','FirstName': '@{triggerBody()?['FirstName']}','LastName': '@{triggerBody()['LastName']}','Email': '@{triggerBody()?['Email']}' }]}",
"headers": {
"Content-Type": "application/json",
"api-key": "yourapikey"
},
"method": "POST",
"uri": "https://yoursearchservice.search.windows.net/indexes/myindex/docs/index?api-version=2015-02-28"
},
"runAfter": {},
"type": "Http"
}
},
"contentVersion": "1.0.0.0",
"outputs": {},
"parameters": {
"$connections": {
"defaultValue": {},
"type": "Object"
}
},
"triggers": {
"When_an_object_is_modified": {
"inputs": {
"host": {
"api": {
"runtimeUrl": "https://logic-apis-westeurope.azure-apim.net/apim/salesforce"
},
"connection": {
"name": "@parameters('$connections')['salesforce']['connectionId']"
}
},
"method": "get",
"path": "/datasets/default/tables/@{encodeURIComponent(encodeURIComponent('Contact'))}/onupdateditems"
},
"recurrence": {
"frequency": "Minute",
"interval": 1
},
"splitOn": "@triggerBody()?.value",
"type": "ApiConnection"
}
}
}

So what are we doing here?
Well every minute we check if any contact is changed and if yes we update the azure search index with the new information.
Note the splitOn option we have OOB with this kind of trigger that allows us , if more than one record is changed during the minute , to invoke the azure search automatically for each of this records without doing anything more!

Step 3: Enjoy it!

Now you can build any visualforce page you like to call the azure search api and leverage the incredible features this technology offers to you!

Try the azure search suggestions api with some ajax to propose suggestions while the user types, your call center users will love that!

Tips: You can call azure search just using  http requests with api key on the header and use the automatic json deserialization libraries that are available on apex to convert the responses in objects that you can map to the visualforce tags.

 

The Customer Paradigm Executive

Fun article this time ! 

Contact me /comment if you want to apply 😂😂😂 for the role!

Customer Paradigm Executive

Candidates must be able to mesh e-business users and grow proactive networks and evolve 24/7 architectures as well.

This position requires to syndicate innovative infomediaries and morph them with viral metrics in order to synergize back-end convergence.

The objective is to deliver brand wearable eyeballs and architect granular eyeballs deployed to B2B markets incentivizing leading-edge e-business and disintermediating 24/7 relationships.

You will also have to aggregate revolutionary communities and whiteboard end-to-end systems in order to orchestrate dynamic convergence and effectively monetize efficient interfaces that can morph scalable e-markets.

Indexing SQL Data Warehouse with Azure Search and consume it with Salesforce

SQL Data Warehouse is one of the brand new PAAS services that Microsoft offers as part of their Cortana Analytics suite. If we want to find a way to describe it quickly we can say that is the Azure equivalent of Amazon Redshift with some differences of course, but in essence is a cloud MPP database that can be scaled quickly on demand , can ingest TB of data and leveraging the multi node architecture can process/query this data in seconds. One of the limitations of these systems (MPP) is that they cannot handle an high number of concurrent queries , usually the maximum value is around 20/30 and if the that number is reached new queries will be parked in queues (of course the entire query queue processing logic is more complicated than this but for in essence is that). Now this is fine is you plan to expose your data warehouse only to few analysts but if you want to make this data available also to other endpoints/consumers this became a problem. An example can be this : imagine in fact that you store in your DWH all your customer data/interactions/orders collected across multiple systems and you want your call center agents to access this info in order to correctly support your customers. So you want a super quick response time service called by several clients for searching costumers data and while this is a task that a classic relational database can perform with no problems , as said this represents a problem for an MPP database.

Luckily azure PAAS offering has multiple ways to solve this problem, one way can be to use Azure Search to index SQL Data Warehouse costumer data and offer to call center systems the Azure Search API as the service that will provide the search capabilities needed.

Azure Search has also the Indexer functionality that with pure configuration can automatically index a portion of a database, however looking at Azure Search documentation it seems that SQL Data Warehouse is not supported , but trying to use the Azure SQL Indexer I had no problems in performing the configuration task. I was able ,following the mentioned documentation, to schedule the indexing process and using the High Water Mark Change Detection policy (I had a field with timestamp) I was able to process the data progressively as it was updated/added to the DWH.

If you need to launch on demand indexing you can leverage the Azure Search Rest API and call Run Indexer operation , this combined with an Azure Logic App can let you update the index every time you need according to your workflow design (event based, depending on another scheduling  etc ..).

Consuming this service , from a call center software like Salesforce service cloud can be done using callouts consuming Azure Search Rest Api, a good example can be this , just change the json generation part so that instead of making queries on Salesforce accounts you actually call the Azure Search Rest Api. Actually you have to make two calls : one for the autocomplete (call the suggestions api ) that will provide you suggestions (so even if you make a typo the fuzzy search will help you to find the right customer) sand one that is the search api that will give you the customer detail data coming from the index.

One important thing to note: Salesforce callouts have concurrency limits but these limits can be bypassed using some tricks as described here .