Extract text from documents at scale in the Azure Data Lake

Hi across all the content posted on Build 2017 , I was really impressed by this presentation where you can learn how to lift and shift almost any runtime to Azure Data Lake Analytics for large scale processing.

Following those recommendations I built a custom Extractor and Processor for USQL leveraging tika.net extractor in order to extract in a text format all the content stored in files like pdf, docx, etc…

The idea is to solve the following business scenario: you have a large collection of docs (azure data lake store capacity is unlimited)  : pdfs, xls, ppts, etc.. and you want to quickly understand the information stored in all those documents without having to create/deploy your own cluster but in pure PaaS mode.

Here a sample of code built around the Visual Studio Project template for U-SQL Applications.

ADLA4As you can see in this demo we limited the max size of the extracted content to 128Kb in order to comply to this ADLA limit. This limit can be bypassed working on byte arrays.

Now I uploaded all the dll binaries to the data lake stored and registered them as assembly


Then I launched a U-SQL command to actually take text data stored in a collection of pdf documents specifying 22 AU .adla2

And in less than 2 min I have my collection of documents parsed inside one single csv file .adla.jpg

Now that the information is in text format we can use the Azure Data Lake topics and keywords extensions to understand quickly what kind of information is stored inside our document collection.

The final result that shows how keywords are linked to documents can be visualized in Power Bi with several nice visualizations


And clicking on one keyword we can see immediately which documents are linked to it


Another way to visualize this , it is with word cloud


where we see for a specific document this time what are the keywords most representative of the document.

If you are interested into the solution and you want to know more send me a message on my twitter handle.