Text Analytics with Facebook FastText

I recently I had to work a very interesting challenge, imagine a very large collection of phrases/statements and you want to derive from each of them the key ask/term .

An example can be a large collection of emails, scan the subjects and understand what was the key topic of the email, so for example an email with the following subject : “Dear dad the printer is no more working….” we can say that the key is “printer“.

can27t

If you try to apply generic key phrases algorithms , they can work pretty well in a very general context, but if your context is specific they will miss several key terms that are part of the dictionary of your context.

897

I successfully used facebook fasttext for this supervised classification task, and here is what you need to make it work :

  1. A Virtual Machine Ubuntu Linux 16.04 (is free on a Macbook with Parallels Lite)
  2. Download and compile fast text as described here
  3. A training set with statements and corresponding labels
  4. A validation set with statements and corresponding labels

So yes you need to manually label a good amount of statements to make fasttext “understand” well your dataset. 

Of course you can speed up the process transforming each statement into an array of the words and massively assign those labels where it makes sense ( python pandas or just old SQL can help you here).

Let’s see how to build the training set:

create a file called train.csv and here insert each statement in a line in the following format __label__labelname here write your statement.

Let’s make an example with the email subject used before:

__label__printer Dear dad the printer is no more working

You can have also multiple labels for the same statement, let’s see this example:

__label__printer __label__wifi Dear dad the printer and the wifi are dead

The validation set can be another file called validation.csv filled exactly in the same way, and of course you have to follow the usual good practices to split correctly your labeled dataset into the training dataset and validation dataset.

In order to start the training with fasttext you have to type the following command:

./fasttext supervised -input data/train.csv -output result/model_1 -lr 1.0 -epoch 25 -wordNgrams 2

this assumes that you are with terminal inside the fasttext folder and the training file is inside a subfolder called data , while the resulting model will be saved in a result folder.

I added also some other optional arguments to improve the precision in my specific case, you can look at those options here.

Once the training is done (you will understand why is called fasttext here!) , you can check the precision of the model in this way:

./fasttext test result/model_1.bin data/valid.csv

In my case I obtained a good 83% 🙂

aaeaaqaaaaaaaaitaaaajde5ngnjngm3lwqwnwmtndm5yi04nzfmlta1mdeyngfjmzuxzg

If you want to test your model manually (typing sentences and obtaining the corresponding labels) , you can try the following command:

./fasttext predict result/model_1.bin –

Fasttext has also python wrappers , like this one I used and you can leverage this wrapper to perform a massive scoring like I did here:


from fastText import load_model
import pandas as pd
fastm=load_model('/result/model_1.bin')
k = 1
df=pd.read_csv("data/datatoscore.csv")
df.insert(2,'label','')
for index, row in df.iterrows():
labels, probabilities = fastm.predict(str(row["short_statement"]), k)
for w, f in zip(labels, probabilities):
row["label"]=w
df.to_csv("data/finalresult.csv")

You can improve the entire process in many different ways, for example you can use the unsupervised training to obtain word vectors for you dataset , use this “dictionary” as base for your list of labels and use the nearest neighbor to find similar words that can grouped into single labels when doing the supervised training.

fb

Annunci

Let’s dig in our email!

As many of you, even if we are almost in 2018, I still work A LOT using emails and recently I was asking myself the following question what if I can leverage analytics and also machine learning to have a better understanding of my emails?

text-analytics

Here is a quick way to understand who is inspiring you more

positive-attitude

and who are instead the ones spreading a bit more negativity in your daily job 🙂

workplace-negativity-56a0f2bc5f9b58eba4b5761e

You will need (if you want to process ALL your emails in one shot!) :

  1. Windows 7/8/10
  2. Outlook 2013 or 2016
  3. Access 2013 or 2016
  4. An Azure Subscription
  5. A data lake store and analytics account
  6. PowerBI Desktop or any other Visualization Tool you like (Tableau or simply Excel)

Step 1 : Link MS Access Tables to your Outlook folders as explained here

Step 2: Export from Access to csv files your emails.

Step 3: Upload those files to your data lake store.

Step 4: Process the fields containing text data with the U-SQL cognitive extensions and derive sentiment and key phrases of each email

Step 5: With PowerBI Desktop you can access the output data sitting into the data lake store as described here

Step 6: Find the senders with highest average sentiment and the ones with the lowest one 🙂 .

job-well-done-clipart-1

If you are worried about leaving your emails in the cloud, after obtaining the sentiment and key phrases , you can download this latest output and remove all the data from data lake store , using this (local) file as input for power bi desktop.

In addition to this I would also suggest to perform a one way hash of the sender email address and upload to the data lake store account the emails with this hashed field instead of the real sender.

wekk4

Once you have the data lake analytics job results you can download them and join locally in Access to associate again each email to the original sender.

 

From Windows 10 to Mac OS Sierra without admin privileges

Hi everyone, lately thanks to my manager and my new employer I was able to switch from a Windows 10 laptop to a shiny mac book pro  and I want to share with you some tips and tricks that probably you will encounter if you will do the same. First let’s start with the basics: why I have chosen to switch? Well I always (since 2009) had only Apple devices at home and I always loved the consistency and the “stability” of the Apple devices, but I never had the opportunity to actually “work” with a Mac , so this is also a learning for me. If you actually never used a Mac the first obstacle will be shortcuts like CTRL+C and CTRL+V , the mouse clicks (actually the right click on the track pad), the scrolling with 2 fingers on the trackpad and now the shiny and mysterious touch bar. Passed this first shock, you will quickly get used to the magic search experience of spotlight, the backup for dummies of time machine and the well known experience of the App Store.

Now let’s focus on the work related stuff: you can finally have on a mac also office 2016 but it is miles and miles away from the functionalities and easy of use of Office 2016 on windows, not super evident differences but if you use office professionally you will quickly find the missing pieces.

Solution ? Go the App Store, purchase Parallels Lite and enjoy Linux and Windows Virtual Machines. You will have VMs without being admin because Parallels Lite uses the native hypervisor available on Mac since Yosemite.

Thanks to this I was able to have back also several “life saving” applications that I use daily like PowerBi Desktop, SQL Server Management Studio and Visual Studio 2017. To be honest they have their versions in the mac world but the functionalities that are missing in those versions are too numerous to live only with that.

So I ended up having a windows 10 VM full of software, so why don’t use directly windows? Well , with the windows VM i can exactly use windows for the apps that are running great on that platform and if the system starts to be unstable I can still normally work on my mac without losing my work while windows does his own “things” 🙂 .

When needed I leverage an ubuntu VM with docker  and vs code with the same segregation of duties principle (main OS fast and stable, guest OS with rich and dedicated software).

Now I work several times in this way : sql server hosted on linux, I do import/export of external data easily with Sql server management studio from windows and I run pyspark notebooks on docker accessing the same data and finally I do visualizations with power bi desktop on windows.

In case, like me, you have strict policies around admin accounts , I want to share with you this: do you remember the concept of portable apps in windows? Well on the mac you can do the same with some (not all) the applications that are outside the App Store (you can install almost all the apps in the App Store without admin privileges).

The technique to have an application on mac “portable” is simply the double extraction of the pkg files and Payload files to one folder that you can access (like your desktop), you can check the details here and here and basically run those applications from the locations that you like.

The exceptions will be :

  1. Applications not signed by a recognized and well know developer or software house
  2. Applications that on start up will ask you to install additional services
  3. Applications that before being launched require the registration of specific libraries/frameworks

There are cases (like azure machine learning workbech ) where the installer it’s actually writing everything in you user account folders but the last step will be the copy of the UI app to the Applications folder and this will fail if you are not admin. The solution is to look a bit inside the installer folders and find inside the json files the location of the downloaded packages . Once you find the URL of the missing one (use the installer error message to help you to find the package he was not able to copy) , download it locally and execute the app from any location, it should work without problems.