Text Analytics with Facebook FastText

I recently I had to work a very interesting challenge, imagine a very large collection of phrases/statements and you want to derive from each of them the key ask/term .

An example can be a large collection of emails, scan the subjects and understand what was the key topic of the email, so for example an email with the following subject : “Dear dad the printer is no more working….” we can say that the key is “printer“.

can27t

If you try to apply generic key phrases algorithms , they can work pretty well in a very general context, but if your context is specific they will miss several key terms that are part of the dictionary of your context.

897

I successfully used facebook fasttext for this supervised classification task, and here is what you need to make it work :

  1. A Virtual Machine Ubuntu Linux 16.04 (is free on a Macbook with Parallels Lite)
  2. Download and compile fast text as described here
  3. A training set with statements and corresponding labels
  4. A validation set with statements and corresponding labels

So yes you need to manually label a good amount of statements to make fasttext “understand” well your dataset. 

Of course you can speed up the process transforming each statement into an array of the words and massively assign those labels where it makes sense ( python pandas or just old SQL can help you here).

Let’s see how to build the training set:

create a file called train.csv and here insert each statement in a line in the following format __label__labelname here write your statement.

Let’s make an example with the email subject used before:

__label__printer Dear dad the printer is no more working

You can have also multiple labels for the same statement, let’s see this example:

__label__printer __label__wifi Dear dad the printer and the wifi are dead

The validation set can be another file called validation.csv filled exactly in the same way, and of course you have to follow the usual good practices to split correctly your labeled dataset into the training dataset and validation dataset.

In order to start the training with fasttext you have to type the following command:

./fasttext supervised -input data/train.csv -output result/model_1 -lr 1.0 -epoch 25 -wordNgrams 2

this assumes that you are with terminal inside the fasttext folder and the training file is inside a subfolder called data , while the resulting model will be saved in a result folder.

I added also some other optional arguments to improve the precision in my specific case, you can look at those options here.

Once the training is done (you will understand why is called fasttext here!) , you can check the precision of the model in this way:

./fasttext test result/model_1.bin data/valid.csv

In my case I obtained a good 83% 🙂

aaeaaqaaaaaaaaitaaaajde5ngnjngm3lwqwnwmtndm5yi04nzfmlta1mdeyngfjmzuxzg

If you want to test your model manually (typing sentences and obtaining the corresponding labels) , you can try the following command:

./fasttext predict result/model_1.bin –

Fasttext has also python wrappers , like this one I used and you can leverage this wrapper to perform a massive scoring like I did here:


from fastText import load_model
import pandas as pd
fastm=load_model('/result/model_1.bin')
k = 1
df=pd.read_csv("data/datatoscore.csv")
df.insert(2,'label','')
for index, row in df.iterrows():
labels, probabilities = fastm.predict(str(row["short_statement"]), k)
for w, f in zip(labels, probabilities):
row["label"]=w
df.to_csv("data/finalresult.csv")

You can improve the entire process in many different ways, for example you can use the unsupervised training to obtain word vectors for you dataset , use this “dictionary” as base for your list of labels and use the nearest neighbor to find similar words that can grouped into single labels when doing the supervised training.

fb

Annunci

Let’s dig in our email!

As many of you, even if we are almost in 2018, I still work A LOT using emails and recently I was asking myself the following question what if I can leverage analytics and also machine learning to have a better understanding of my emails?

text-analytics

Here is a quick way to understand who is inspiring you more

positive-attitude

and who are instead the ones spreading a bit more negativity in your daily job 🙂

workplace-negativity-56a0f2bc5f9b58eba4b5761e

You will need (if you want to process ALL your emails in one shot!) :

  1. Windows 7/8/10
  2. Outlook 2013 or 2016
  3. Access 2013 or 2016
  4. An Azure Subscription
  5. A data lake store and analytics account
  6. PowerBI Desktop or any other Visualization Tool you like (Tableau or simply Excel)

Step 1 : Link MS Access Tables to your Outlook folders as explained here

Step 2: Export from Access to csv files your emails.

Step 3: Upload those files to your data lake store.

Step 4: Process the fields containing text data with the U-SQL cognitive extensions and derive sentiment and key phrases of each email

Step 5: With PowerBI Desktop you can access the output data sitting into the data lake store as described here

Step 6: Find the senders with highest average sentiment and the ones with the lowest one 🙂 .

job-well-done-clipart-1

If you are worried about leaving your emails in the cloud, after obtaining the sentiment and key phrases , you can download this latest output and remove all the data from data lake store , using this (local) file as input for power bi desktop.

In addition to this I would also suggest to perform a one way hash of the sender email address and upload to the data lake store account the emails with this hashed field instead of the real sender.

wekk4

Once you have the data lake analytics job results you can download them and join locally in Access to associate again each email to the original sender.

 

From Windows 10 to Mac OS Sierra without admin privileges

Hi everyone, lately thanks to my manager and my new employer I was able to switch from a Windows 10 laptop to a shiny mac book pro  and I want to share with you some tips and tricks that probably you will encounter if you will do the same. First let’s start with the basics: why I have chosen to switch? Well I always (since 2009) had only Apple devices at home and I always loved the consistency and the “stability” of the Apple devices, but I never had the opportunity to actually “work” with a Mac , so this is also a learning for me. If you actually never used a Mac the first obstacle will be shortcuts like CTRL+C and CTRL+V , the mouse clicks (actually the right click on the track pad), the scrolling with 2 fingers on the trackpad and now the shiny and mysterious touch bar. Passed this first shock, you will quickly get used to the magic search experience of spotlight, the backup for dummies of time machine and the well known experience of the App Store.

Now let’s focus on the work related stuff: you can finally have on a mac also office 2016 but it is miles and miles away from the functionalities and easy of use of Office 2016 on windows, not super evident differences but if you use office professionally you will quickly find the missing pieces.

Solution ? Go the App Store, purchase Parallels Lite and enjoy Linux and Windows Virtual Machines. You will have VMs without being admin because Parallels Lite uses the native hypervisor available on Mac since Yosemite.

Thanks to this I was able to have back also several “life saving” applications that I use daily like PowerBi Desktop, SQL Server Management Studio and Visual Studio 2017. To be honest they have their versions in the mac world but the functionalities that are missing in those versions are too numerous to live only with that.

So I ended up having a windows 10 VM full of software, so why don’t use directly windows? Well , with the windows VM i can exactly use windows for the apps that are running great on that platform and if the system starts to be unstable I can still normally work on my mac without losing my work while windows does his own “things” 🙂 .

When needed I leverage an ubuntu VM with docker  and vs code with the same segregation of duties principle (main OS fast and stable, guest OS with rich and dedicated software).

Now I work several times in this way : sql server hosted on linux, I do import/export of external data easily with Sql server management studio from windows and I run pyspark notebooks on docker accessing the same data and finally I do visualizations with power bi desktop on windows.

In case, like me, you have strict policies around admin accounts , I want to share with you this: do you remember the concept of portable apps in windows? Well on the mac you can do the same with some (not all) the applications that are outside the App Store (you can install almost all the apps in the App Store without admin privileges).

The technique to have an application on mac “portable” is simply the double extraction of the pkg files and Payload files to one folder that you can access (like your desktop), you can check the details here and here and basically run those applications from the locations that you like.

The exceptions will be :

  1. Applications not signed by a recognized and well know developer or software house
  2. Applications that on start up will ask you to install additional services
  3. Applications that before being launched require the registration of specific libraries/frameworks

There are cases (like azure machine learning workbech ) where the installer it’s actually writing everything in you user account folders but the last step will be the copy of the UI app to the Applications folder and this will fail if you are not admin. The solution is to look a bit inside the installer folders and find inside the json files the location of the downloaded packages . Once you find the URL of the missing one (use the installer error message to help you to find the package he was not able to copy) , download it locally and execute the app from any location, it should work without problems.

 

Jazoon 2017 AI meet Developers Conference Review

Hi I had the opportunity to participate to this conference in Zurich on the 27 October 2017 and attend to the following sessions:

  • Build Your Intelligent Enterprise with SAP Machine Learning
  • Applied AI: Real-World Use Cases for Microsoft’s Azure Cognitive Services
  • Run Deep Learning models in the browser with JavaScript and ConvNetJS
  • Using messaging and AI to build novel user interfaces for work
  • JVM based DeepLearning on IoT data with Apache Spark
  • Apache Spark for Machine Learning on Large Data Sets
  • Anatomy of an open source voice assistant
  • Building products with TensorFlow

Most of the sessions have been recorded and they are available here:

https://www.youtube.com/channel/UC9kq7rpecrCX7S_ptuA20OA

The first session has been a more a sales/pre-recorded demos presentation of SAP capabilities in terms of AI mainly in their cloud:

1

But with some interesting ideas like the Brand Impact Video analyzer that computes how much airtime is filled by specific brands inside a video:

2

And another good use case representation is the defective product automatic recognition using image similarity distance API:

3

The second session has been around the new AI capabilities offered by Microsoft and divided into two parts:

Capabilities for data scientists that want to build their python models

  • Azure Machine Learning Workbench that is an electron based desktop app that mainly accelerates the data preparation tasks using “a learn by example” engine that creates on the fly data preparation code.

4

  • Azure Notebooks a free but limited Cloud Based Jupyter Notebook environment to share and re-use models/notebooks

5

  • Azure Data Science Virtual Machine a pre-built VM with all the most common DS packages (TensorFlow, Caffe, R, Python, etc..)

6

Capabilities (i.e. Face/Age/Sentiment/OCR/Hand written detection) for developers that want to consume Microsoft pre-trained models calling directly Microsoft Cognitive API

7

8

The third session has been more an “educational presentation” around deep learning, and how at high level a deep learning system work, however we have seen in this talk some interesting topics:

  • The existence of several pre-trained models that can be used as is especially for featurization purposes and/or for transfer learning

9

  • How to visualize neural networks with web sites like http://playground.tensorflow.org
  • A significant amount of demos that can show case DNN applications that can run directly in the browser

The fourth session has been one also an interesting session, because the speaker clearly explained the current possibilities and limits of the current application development landscape and in particular of the enterprise bots.

10

Key take away: Bots are far from being smart and people don’t want to type text.

Suggested approach bots are new apps that are reaching their “customers” in the channels that they already use (slack for example) and those new apps using the context and channel functionalities have to extend and at the same time simplify the IT landscape.

11

Example: bot in a slack channel that notifies manager of an approval request and the manager can approve/deny directly in slack without leaving the app.

The fourth and the fifth talk have been rather technical/educational on specific frameworks (IBM System ML for Spark) and on models portability (PMML) with some good points around hyper parameter tuning using a spark cluster in iterative mode and DNN auto encoders.

12

13

The sixth talk has been about the open source voice assistant MyCroft and the related open source device schemas.

The session has been principally made on live demos showcasing several open source libraries that can be used to create a device with Alexa like capabilities:

  • Pocketsphinx for speechrecognition
  • Padatious for NLP intent detection
  • Mimic for text to speech
  • Adapt Intent parser

14

The last session was on tensor flow but also in general experiences around AI coming from Google, like how ML is used today:

15

And how Machine Learning is fundamental today with quotes like this:

  • Remember in 2010, when the hype was mobile-first? Hype was right. Machine Learning is similarly hyped now. Don’t get left behind
  • You must consider the user journey, the entire system. If users touch multiple components to solve a problem, transition must be seamless

Other pieces of advice where around talent research and maintain/grow/spread ML inside your organization :

How to hire ML experts:

  1. don’t ask a Quant to figure out your business model
  2. design autonomy
  3. $$$ for compute & data acquisition
  4. Never done!

How to Grow ML practice:

  1. Find ML Ninja (SWE + PM)
  2. Do Project incubation
  3. Do ML office hours / consulting

How to spread the knowledge:

  1. Build ML guidelines
  2. Perform internal training
  3. Do open sourcing

And on ML algorithms project prioritization and execution:

  1. Pick algorithms based on the success metrics & data you can get
  2. Pick a simple one and invest 50% of time into building quality evaluation of the model
  3. Build an experiment framework for eval & release process
  4. Feedback loop

Overall the quality has been good even if I was really disappointed to discover in the morning that one the most interesting session (with the legendary George Hotz!) has been cancelled.

Helping Troy Hunt for fun and profit

Hi everyone,  I’m a huge fan of the security expert Troy Hunt and of his incredible “free!” service haveibeenpwned.com (if you don’t know it, please use it now! to test if your email accounts are compromised ! ).

Troy-Hunt-Profile-Photo

Now Troy has created a contest where you can actually win a shiny Lenovo laptop, if you create something “new” that can help people to be more aware of the security risks related to pwned accounts.

I decided to participate and my idea is the following, helping all the people that have gmail (and Hotmail/outlook/office 365 in alpha version!) accounts to verify if their friends, colleagues and family members have their email accounts compromised.

I uploaded the code and executables here and I strongly suggest you to read ENTIRELY the readme instructions to understand how the tool works, what are expected results and what you can do.

Regardless if I win the laptop or not, I already won because I was able, thanks to this tool, to alert my wife and some of my friends of the danger and to have the right “push” to convince them to setup two-factor authentication.

If you want to donate , for this effort please donate directly to Troy here, he deserves a good beer !

 

A massive adventure ends…

Hi everyone my journey in Microsoft ends and I want to use this opportunity to revisit many of the good moments I spent in this great company .

The first days I joined it was like entering in a new solar system where you have to learn all the names of the planets, moons, orbits at warp speed

but at the same time I had all the support of my manager , my team mates and my mentor !

Very quickly it was time to enter in action with my team and start working with clients, partners, developers, colleagues from different continents, teams and specializations.

We worked on labs and workshops helping customers and partners to discover and ramp up on Microsoft Advanced Analytics capabilities

and with customer engagements helping our clients to exploit all the possibilities that the Azure Cloud can provide to help them in reaching their goals

I certainly i cannot forget the incredible Microsoft Tech Ready event in Seattle:

and all the talented colleagues I had the opportunity to work with.

Finally I want to say a massive thank you to my v-team , I will miss you!!

As an adventure ends I am really excited to bring my passion and curiosity into a new one joining a customer .

I can’t wait to learn more, improve and be part of the massive digital transformation journey that is in front of us.

 

 

 

AI is progressing at incredible speed!

Several people tend to think that all the new AI technologies like Convolutional neural networks, Recurrent Neural Networks, Generative adversarial networks,etc.. are used mainly in tech giants like Google , Microsoft , etc.. , in reality many enterprises are already leveraging deep learning in production like Zalando, Instacart and many others . Well known deep learning frameworks like Keras, Tensorflow, CNTK, Caffe2, etc.. are now finally reaching a larger audience.

Big data engines like Spark are finally able to pilot also deep learning workloads and also the first steps to make large deep neural networks models fit inside small cpu/low memory, occasionally connected IOT devices are coming.

Finally new hardware has been built specifically for deep learning :

  1. https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
  2. https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
  3. https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/

But the AI space never sleeps and we are already seeing arriving new solutions/frameworks and architectures that are actually under development:

Ray: the new distributed execution framework that aims to replace the well known Spark!

Pytorch and fast.ai framework that wants to compete and beat all the existing deep learning frameworks

To overcome one of the biggest problems on deep learning : amount of training data, snorkel a new framework has been designed to create brand new training data with little human interaction

Finally to help to create a better integration and performance of the deep learning models with the applications that want to consume those models a new prediction serving system Clipper has been designed .

The speed of AI evolution is incredible and be prepared to see much more than this in the near future!