facebook

I recently I had to work a very interesting challenge, imagine a very large collection of phrases/statements and you want to derive from each of them the key ask/term .

An example can be a large collection of emails, scan the subjects and understand what was the key topic of the email, so for example an email with the following subject : “Dear dad the printer is no more working….” we can say that the key is “printer“.

If you try to apply generic key phrases algorithms , they can work pretty well in a very general context, but if your context is specific they will miss several key terms that are part of the dictionary of your context.

I successfully used facebook fasttext for this supervised classification task, and here is what you need to make it work :

A Virtual Machine Ubuntu Linux 16.04 (is free on a Macbook with Parallels Lite)
Download and compile fast text as described here
A training set with statements and corresponding labels
A validation set with statements and corresponding labels

So yes you need to manually label a good amount of statements to make fasttext “understand” well your dataset.

Of course you can speed up the process transforming each statement into an array of the words and massively assign those labels where it makes sense ( python pandas or just old SQL can help you here).

Let’s see how to build the training set:

create a file called train.csv and here insert each statement in a line in the following format __label__labelname here write your statement.

Let’s make an example with the email subject used before:

__label__printer Dear dad the printer is no more working

You can have also multiple labels for the same statement, let’s see this example:

__label__printer __label__wifi Dear dad the printer and the wifi are dead

The validation set can be another file called validation.csv filled exactly in the same way, and of course you have to follow the usual good practices to split correctly your labeled dataset into the training dataset and validation dataset.

In order to start the training with fasttext you have to type the following command:

./fasttext supervised -input data/train.csv -output result/model_1 -lr 1.0 -epoch 25 -wordNgrams 2

this assumes that you are with terminal inside the fasttext folder and the training file is inside a subfolder called data , while the resulting model will be saved in a result folder.

I added also some other optional arguments to improve the precision in my specific case, you can look at those options here.

Once the training is done (you will understand why is called fasttext here!) , you can check the precision of the model in this way:

./fasttext test result/model_1.bin data/valid.csv

In my case I obtained a good 83% 🙂

If you want to test your model manually (typing sentences and obtaining the corresponding labels) , you can try the following command:

./fasttext predict result/model_1.bin –

Fasttext has also python wrappers , like this one I used and you can leverage this wrapper to perform a massive scoring like I did here:

from fastText import load_model import pandas as pd fastm=load_model('/result/model_1.bin') k = 1 df=pd.read_csv("data/datatoscore.csv") df.insert(2,'label','') for index, row in df.iterrows(): labels, probabilities = fastm.predict(str(row["short_statement"]), k) for w, f in zip(labels, probabilities): row["label"]=w df.to_csv("data/finalresult.csv")

You can improve the entire process in many different ways, for example you can use the unsupervised training to obtain word vectors for you dataset , use this “dictionary” as base for your list of labels and use the nearest neighbor to find similar words that can grouped into single labels when doing the supervised training.

	Rosaura su Pyspark safely on Data Lake St…
	bharath su Bot hand off to agent with Sal…
	Hi, can you please s… su Bot hand off to agent with Sal…
	competitors su Pyspark safely on Data Lake St…
	Detti su Bot hand off to agent with Sal…

	Rosaura su Pyspark safely on Data Lake St…
	bharath su Bot hand off to agent with Sal…
	Hi, can you please s… su Bot hand off to agent with Sal…
	competitors su Pyspark safely on Data Lake St…
	Detti su Bot hand off to agent with Sal…

Alberto De Marco @albertod

Hi I am Alberto De Marco , I write this blog . I am interested mainly in security & ML/big data tech but also in some other collateral stuff.

Text Analytics with Facebook FastText