Digital Marketing in practice (part 2)

In part one we defined our landscape and the first challenges encountered on managing the known customers .

We now want to investigate how to bring new customers to our e-commerce site and more importantly how to target , among potentially 7 billion customers, the ones that have the higher chances to buy our fantastic product.

Ideally we would like to place advertisement on :

  1. search engines (google, bing, etc..)
  2. social networks (facebook, twitter, snapchat, etc..)
  3. big content websites (msn.com, yahoo, news websites,etc..)
  4. inside famous apps that can host advertisement
  5. etc…

How do we contact all these different “information publishers” and how we can create a single “campaign” targeting all these “channels” ?

Here we have go into the DMP, DSP,SSP world, and see how these platforms can help us in reaching our objectives.

Let explain this with an example : go now to this yahoo page https://www.yahoo.com/news/ , you should see almost immediately on the top of the page an advertisement like this:

How and why this advertisement was placed there ?

The “publishers” like yahoo, have a so called “inventory” of places where ads can be positioned on each page , on different times of day or of the week typically. So they use a platform called SSP  to communicate to the entire world the following message : “I have these open slots/positions in my pages, who pays the highest amount to buy them and place their own ads?”

On the other side of the fence there is another platform called DSP where “marketers” can say the following : “I want to pay this amount of money to place my banner in yahoo pages “.

The process where “supply” and “demand” meet together is called RTB , real time bidding , and thanks to that, in real time it will be decided what is the advertisement appearing in yahoo website. If you want to understand this more in deep look at articles like this , but you understood that in this way we can have new customers that can reach our e-commerce site clicking on the banner.

But now another question comes up: is this banner the same for all the visitors? And if it has been displayed to 1 Million or 10 Million or 100 Million visitors what is the price that we have to pay?

This is the right time to explain the concept of audience: before going to the DSP and search for “open inventory” we first want to define who are the visitors or anonymous customers that we want to target in our campaign, in this way we can have an idea of how many of them in theory can see the banner.

But if these customers are “unknown” how do I target them? Here the final piece of the puzzle comes into play: the DMP . With DMP we can actually “purchase” (or better rent) from third parties anonymous customers profiles  that are based on browser cookies or smartphone device ids and pick only the ones that according to us are the best ones.

So for example we select them using simple filters in this way:

Once our audience is prepared, we can have an idea of how many of these potential customers we can reach and with this have an idea of the money we will spend if all of them will see our banner and hopefully click on it.

Now this is a pretty straightforward process, but it is not really super optimized…

In fact we already have customers in our e-commerce site (the so called first party data) and we already know who are the ones that are “our best customers”, it would make sense to find on the DMP platform the potential customers that are very similar to those ones, right?

This process exists and it is called look-alike modeling:

So now we know that we have somehow to integrate also the DMP world (that is already a world on his own with connections to DSPs and third party data providers) into our digital marketing landscape at least in two directions:

  1. unknown–>known campaigns integration
  2. known–>unknown look alike modeling and on-boarding process (bring our first party data into the DMP)

Be patient now 🙂 .

We will start designing our digital marketing landscape and architecture in part 3.

Digital Marketing in practice (Part 1)

Several times you hear these buzzwords: digital transformation, digital marketing, etc.., but what really is that from a technology point of view?

I will try to give you my point of view (mainly looking at architectures and components involved) but I hope this can be useful to a large audience.

In order to understand the digital marketing ecosystem imagine that we are the founder of one of the an innovative company that is launching a new incredible mini robot that is mainly focused for kids , let’s call our product TheToy:

Now that we have a product to sell we need to prepare the following:

  1. An e-commerce web site to sell this product
  2. A payment provider that help us to accept any credit card/bitcoin/bank transfer/paypal/etc..
  3. A delivery/supply chain provider that help us to bring the goods to our customers
  4. Speaking of customers we need a CRM system with integrated call center
  5. ……

This of course it’s just an over simplified view of the needs of an e-commerce initiative , but it’s already interesting to speak of this because in this way we can for example evaluate the following options:

  • Our IT related “stuff” (e-commerce, crm,etc..) needs servers, where do I buy, host and maintain these servers?
  • Do I build from scratch ,hiring some developers, the e-commerce website, the payment provider, the CRM,etc..?

Since our focus , should be, to create a fantastic product and having customers happy and not be an IT company we can do , for example, the following choices :

A) Pick a cloud provider that gives us all necessary “virtual hardware” and bandwidth, pick an e-commerce software package, install/configure a huge amount of stuff and hopefully end up in something (again high level) like this :

B) Pick a software as service e-commerce provider that already has all of this in place and where we have just to upload our product catalog and start our e-commerce site immediately:

Now choice is not simple as it seems because every time we pick a “the shortest route”, there is a price to pay , not only in terms of pricing of the solution, but also in terms of functionalities that we would like later to have.

As we said there are other pieces of the puzzle like our CRM and not only we have to pick the CRM that best serves our needs and budget but that also it has to be “somehow integrated” with our e-commerce site…

Before we dig also in this , let’s imagine for a moment that , magically, our e-commerce and CRM and Delivery/Supply Chain/etc.. are already in place and we are selling/delivering our products successfully , what if we have a new product/offer to sell and we want to notify our customers that we have a new product?

This process of contacting customers in order to sell/advertise a new product/offer is called “marketing campaign” and of course we need a tool for that 🙂 : we need a marketing campaign automation tool that help us to create targeted campaigns (we want to reach the right consumer for the right product….) and deliver those messages in multiple ways :

  • emails
  • sms
  • push notifications if we have also an e-commerce app
  • social accounts/pages (btw we need to setup also those accounts!)
  • personalized pages and messages on the e-commerce website advertising the new product only to the “right customers”
  • personalized CRM responses (when the target customers calls only to these customers the crm agent has to propose the new offer/product)
  • etc…

So now the things became to be a bit more complicated, we need in fact an e-commerce site integrated with CRM both integrated with a marketing campaign automation tool .

In order to have even more fun let’s also consider the following : we said that we want to produce “targeted campaigns” , this means that we want to leverage all our data on customers to target only the “right” ones ….

What is the data that we can leverage?

Some ideas:

  • The weblogs of the e-commerce site (google analytics logs for example)
  • The orders of the e-commerce site
  • The calls/cases of the CRM
  • The responses/interactions with our previous campaigns
  • The interactions on our social channels
  • etc…

This , even at small scale of our little e-commerce initiative , seems like a little data warehouse project and if we plan to find the right customers for the right offer , this is also a data science project involving machine learning…..

So we need an e-commerce site integrated with CRM integrated with a marketing campaign automation tool integrated with a big data engine with machine learning capabilities .

In the part 2 we will start to look also to another dimension of the digital marketing landscape: the unknown customers that are waiting to purchase our product but they don’t know that the product exists and we don’t how to reach them 🙂

Extract text from documents at scale in the Azure Data Lake

Hi across all the content posted on Build 2017 , I was really impressed by this presentation where you can learn how to lift and shift almost any runtime to Azure Data Lake Analytics for large scale processing.

Following those recommendations I built a custom Extractor and Processor for USQL leveraging tika.net extractor in order to extract in a text format all the content stored in files like pdf, docx, etc…

The idea is to solve the following business scenario: you have a large collection of docs (azure data lake store capacity is unlimited)  : pdfs, xls, ppts, etc.. and you want to quickly understand the information stored in all those documents without having to create/deploy your own cluster but in pure PaaS mode.

Here a sample of code built around the Visual Studio Project template for U-SQL Applications.

ADLA4As you can see in this demo we limited the max size of the extracted content to 128Kb in order to comply to this ADLA limit. This limit can be bypassed working on byte arrays.

Now I uploaded all the dll binaries to the data lake stored and registered them as assembly

adla3.jpg

Then I launched a U-SQL command to actually take text data stored in a collection of pdf documents specifying 22 AU .adla2

And in less than 2 min I have my collection of documents parsed inside one single csv file .adla.jpg

Now that the information is in text format we can use the Azure Data Lake topics and keywords extensions to understand quickly what kind of information is stored inside our document collection.

The final result that shows how keywords are linked to documents can be visualized in Power Bi with several nice visualizations

rev1

And clicking on one keyword we can see immediately which documents are linked to it

rev2

Another way to visualize this , it is with word cloud

rev4

where we see for a specific document this time what are the keywords most representative of the document.

If you are interested into the solution and you want to know more send me a message on my twitter handle.

My Top 2 Microsoft Build 2017 Sessions

Let’s start with Number 1 this is the Visionary Cloud that is arriving , compute nodes combined with FPGA neurons that act as Hardware Micro services communicating and changing their internal code directly attached to the Azure Network , like a global neural network. Do you want to know more? Click here and check directly with the new video index AI the content of this presentation jumping directly on the portions of the video that you like, searching words, concepts and people appearing in the video.

We can then look here at Number 2 (go to 1:01:54) :

Matt Velloso teaching to a Robot (Zenbo) how to recognize the images the robot sees using Microsoft Bot Framework and the new Custom Image Recognition Service.

Do you want to explore more?

Go here at channel9 and have fun exploring all the massive updates that has been released!

Enterprise Ready Jupyter Hub on the Linux Data Science VM

Hi everyone, this time we want to try to improve the data science experience of the Azure Linux Data Science VM with a special focus on accounts and authentication.

Data Scientists , as all the employees in major Enterprises, have an active directory account, the one they use everyday to logon on their laptops and to access emails. When instead they have to use Data science exploration tools like Jupyter notebooks they have to remember specific local accounts created for this objective (using notebooks) and of course the IT managing these accounts it’s even loaded of procedures and checks to manage these additional accounts in addition to the usual ones on Active Directory.

We will try to improve this experience explaining how to set up a Linux Data Science VM joined to a managed domain and have also Jupyter Hub authentication working with the very same domain.

The things to set up are the following:

  1. An Azure Active Directory that usually mirrors automatically the on-premise active directory structure and content
  2. Azure Active Directory Domain Services with its own Classic VNET
  3. Another Resource Manager VNET where one or more Linux DS VMs will be deployed
  4. A peering between the two VNETs
  5. The packages needed for the Linux OS to join a managed domain
  6. The authentication module for Jupyter Hub that makes authentication happen against the managed domain

Why all these components and complexity?

Well Azure Active Directory works mainly with oauth protocol while OS authentication works with Kerberos tickets that requires an “old fashion” managed domain and Domain Services it’s a way to have this completely managed by Azure. In addition Domain Services gives us also LDAP protocol support that is exactly what we need for Jupyter Hub.

The two VNETs are needed because Domain Services still needs a “Classic VNET” while the modern Linux DS VMs are made with Resource Manager template. The peering between the two guarantees that they can see each other even if they are separate.

Peering

Now we will demonstrate step by step how to setup all the necessary components.

Step N. 1 Create an Azure Active Directory

Go to https://manage.windowsazure.com/ and here click on +New button on the left hand bottom corner, go and click on App Services > Active Directory > Directory, finally click on Custom Create , here choose name, domain name and Country .

Pay attention to country choice because it will decide on which datacenter your active directory will be.

Once done you should have something like mytestdomain.onmicrosoft.com .

Step N. 2 Create Azure Active Directory Domain Services with its own Classic VNET

Here simply follow this great Microsoft step by step tutorial   completing all the 5 tasks. Do not forget , if you do not import from on premise AD, to add at least one user , to change the password of this user and to add it to AAD DC Administrators group.

Step N.3 Create a Resource Manager based VNET

Here simply go to the new portal.azure.com and create a normal vnet paying attention to choose the addresses in way that are not overlapping with the ones of the previous VNET (so if you have choosen 10.1.0.24 for the classic VNET , pick 10.2.0.24 for the new one).

Step N.4 Define the peering between the two VNETs

Go to portal.azure.com, to the new VNET that you have just created and enable the peering :

peering

Step N.5 Deploy  and Configure Linux DS VM

Again from portal.azure.com , add a new Linux Data Science VM CentOS version and during the configuration pay attention to pick as VNET the latest one you created (the ARM based one).

Once the VM is up install the needed packages with this command:

yum install sssd realmd oddjob oddjob-mkhomedir adcli samba-common samba-common-tools krb5-workstation openldap-clients policycoreutils-python -y

Go now to /etc/resolv.conf and setup the name resolution in the putting the domain name and the ipaddress of the azure domain services (one of the two).

Here an example:

search mytestdomain.onmicrosoft.com
nameserver 10.0.0.4

Now join the domain with this command (change the user to the admin defined at Step 2)

realm join –user=administrator mytestdomain.onmicrosoft.com

Check that everything is ok with the command realm list .

Now modify the /etc/sssd/sssd.conf changing these two lines in the following way:

use_fully_qualified_names = False
fallback_homedir = /home/%u

and restart the sssd demon with this command systemctl restart sssd .

Now try to login/ssh with simple domain username (without @mytestdomain.onmicrosoft.com) and password and everything should work.

Step 6 Configure Jupiter Hub

Add the ldap connector with pip:

pip install jupyterhub-ldapauthenticator

Configure the jupyter hub configuration file in the following way (change Ip Address and other parameters accordingly):

c.JupyterHub.authenticator_class = ‘ldapauthenticator.LDAPAuthenticator’
c.LDAPAuthenticator.server_address = ‘10.0.0.4’
c.LDAPAuthenticator.bind_dn_template = ‘CN={username},OU=AADDC Users,DC=mytestdomain,DC=onmicrosoft,DC=com’
c.LDAPAuthenticator.lookup_dn = True
c.LDAPAuthenticator.user_search_base = ‘DC=mytestdomain,DC=onmicrosoft,DC=com’
c.LDAPAuthenticator.user_attribute = ‘sAMAccountName’
c.LDAPAuthenticator.server_port = 389
c.LDAPAuthenticator.use_ssl = False
c.Spawner.cmd = [‘/anaconda/envs/py35/bin/jupyterhub-singleuser’]

Now to troubleshoot and verify that everything works kill the jupyterhub processes running by default on the linux DS VM and try the following command (sudo is needed to launch jupyter hub in multiuser mode):

sudo /anaconda/envs/py35/bin/jupyterhub -f /path/toconfigfile/jupyterhub_config.py –no-ssl –log-level=DEBUG

Now try to authenticate going to localhost:8000 with domain username (without @mytestdomain.onmicrosoft.com) and password and you should be able to log on on juypter with your AAD credentials.

 

 

Image recognition for everyone

Warning : I’m NOT a data scientist, but I huge fan of cool technology !

Today I want to write of a new functionality that amazes me and that can help you to literally do “magic” things that you can think can be exclusive of super data scientists expert of deep learning and frameworks like TensorFlow, CNTK, Caffe,etc…

Imagine the following: someone  trains huge neural networks (imagine like mini brains) for weeks/months using a lot of GPUs on thousands and thousands of images.

These mini brains are then used to classify images and say something like: a car is present, a human is present, a cat etc… . Now one of “bad things” of neural networks is that usually you cannot understand how they really work internally and what is the “thinking process” of a neural network.

featurization

However latest studies on neural networks have found a way to “extract” this knowledge and Microsoft has delivered right now in April this knowledge or better these models.

Now I want to show you an example on how to do this.

Let’s grab some images of the Simpsons :

59381

and some other images of the Flintstones:

2BF47AA000000578-3221746-image-a-102_1441327664251

For example 13 images of Simpson cartoon and 11 of Flintstones. And let’s build a program that can predict given a new image that is not part of the two image sets if it is a Simpson or Flintstone image. I’ve chosen cartoons but you can apply this to any image that you want to process (watches? consoles? vacation places? etc…).

The idea is the following: I take the images I have and give these images to the “model” that has been trained . Now the result of this process will be , for each image, a collection of “numbers” that are the representation of that image according to the neural network. An example to understand that: our DNA is a small fraction of ourselves but it can “represent” us, so these “numbers” are the DNA of the image.

Now that we have the image represented by a simple array of numbers, we can use a “normal” machine learning technique like linear regression to leverage this simplified representation of the image and learn how to classify them.

Applying the sample R code described in the article to only a small sample of images (13 and 11 respectively) using 80% for training and 20% for scoring we obtained the following result:

Score

A good 75% on a very small amount of images !

 

 

Pyspark safely on Data Lake Store and Azure Storage Blob

Hi , I’m working on several projects where is required to access cloud storages (in this case Azure Data Lake Store and Azure Blob Storage) from pyspark running on Jupyter avoiding that all the Jupyter users are accessing these storages with the same credentials stored inside the core-site.xml configuration file of the Spark Cluster.

microsoft-azure-blob-e1483079067730

134d5808-d70b-4db2-97f8-1061211cd82f

I started my investigations looking at the SparkSession that comes with Spark 2.0, especially to commands like this spark.conf.set(“spark.sql.shuffle.partitions”, 6), but I discovered that this command are not working at Hadoop settings level, but they are limited to the spark runtime parameters.

I moved then my attention to SparkContext and in particular to HadoopConfiguration that seemed promising but it is missing into the pyspark implementation…

Finally I was able to find this excellent Stackoverflow post that points out how to leverage the HadoopConfiguration functionality from pyspark.

So in a nutshell you can have the core-site.xml defined as follows:

xml

So as you can see we do not store any credential here.

Let’s see how to access Azure Storage Blob Container with a shared access signature that can be created specifically to access a specific Container (imagine it like a folder) and set almost a fine grained security model on the  Azure Storage account without sharing the Azure Blob Storage Access Keys.

If you love python here some code that an admin can use to generate SAS signatures quickly that last for 24 hours:

from azure.storage.blob import (
BlockBlobService,
ContainerPermissions
)

from datetime import datetime, timedelta

account_name ="ACCOUNT_NAME"
account_key ="ADMIN_KEY"
CONTAINER_NAME="CONTAINER_NAME"

block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key)

sas_url = block_blob_service.generate_container_shared_access_signature(CONTAINER_NAME,ContainerPermissions.READ,datetime.utcnow() + timedelta(hours=24),)

print(sas_url)

You will obtain something like this:

sv=2015-04-05&st=2015-04-29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D

You can refer to this link to understand the structure.

Ok now, once the azure storage admin provide us the signature, we can use this SAS signature to access directly the files on the Azure Storage Blob Container safely:

 
 sc._jsc.hadoopConfiguration().set("fs.azure.sas.PUT_YOUR_CONTAINER_NAME.PUT_YOUR_ACCOUNT_NAME.blob.core.windows.net", "PUT_YOUR_SIGNATURE")
 from pyspark.sql.types import *

# Load the data.We use the sample HVAC.csv file of HDInsight samples
 hvacText = sc.textFile("wasbs://PUT_YOUR_CONTAINER_NAME@PUT_YOUR_ACCOUNT_NAME.blob.core.windows.net/HVAC.csv")

# Create the schema
 hvacSchema = StructType([StructField("date", StringType(), False),StructField("time", StringType(), False),StructField("targettemp", IntegerType(), False),StructField("actualtemp", IntegerType(), False),StructField("buildingID", StringType(), False)])

# Parse the data in hvacText
 hvac = hvacText.map(lambda s: s.split(",")).filter(lambda s: s[0] != "Date").map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

# Create a data frame
 hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

# Register the data fram as a table to run queries against
 hvacdf.registerTempTable("hvac")
 from pyspark.sql import HiveContext
 hive_context = HiveContext(sc)
 bank = hive_context.table("hvac")
 bank.show()

The same idea can be applied to data lake store. Assuming that you have your data lake credentials setup as described here , you can access data lake store safely in this way:

 
sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/PUT_YOUR_TENANT_ID/oauth2/token")
sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.client.id", "PUT_YOUR_CLIENT_ID")
sc._jsc.hadoopConfiguration().set("dfs.adls.oauth2.credential", "PUT_YOUR_SECRET")


  from pyspark.sql.types import *

# Load the data. The path below assumes Data Lake Store is default storage for the Spark cluster
  hvacText = sc.textFile("adl://YOURDATALAKEACCOUNT.azuredatalakestore.net/Samples/Data/HVAC.csv")

# Create the schema
  hvacSchema = StructType([StructField("date", StringType(), False),StructField("time", StringType(), False),StructField("targettemp", IntegerType(), False),StructField("actualtemp", IntegerType(), False),StructField("buildingID", StringType(), False)])

  # Parse the data in hvacText
  hvac = hvacText.map(lambda s: s.split(",")).filter(lambda s: s[0] != "Date").map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

  # Create a data frame
  hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

  # Register the data fram as a table to run queries against
  hvacdf.registerTempTable("hvac")

    from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("hvac")
bank.show()

Happy coding with pyspark and Azure!