Hi everyone, during one of my projects I’ve been asked the following question:
I’m actually storing my IOT sensor’s data in Azure Data Lake for analysis and feature engineering , but currently I still have very few devices, so not a big amount of data and I’m not able to understand how much fast will be my queries and my transformations when with much more devices and months/years of sensor data my data lake will reach do over several terabytes.
Well in that case let’s generate quickly those terabytes of data using U-SQL capabilities!
Let’s assume that our data resembles the following:
deviceId, timestamp, sensorValue, …….
so we have for each IOT device a unique identifier called deviceId and let’s assume is a composition of numbers and letters, we have a timestamp indicating the time at millisecond precision, where the IOT event was generated and finally we have the values of the sensors in that moment (temperature, speed, etc..).
The idea is the following give a real deviceId, generate N “synthetic deviceIds” that have all the same data of the original device . So if we have , for example , 5 real deviceId each with 100.000.000 records (500.000.000 records in total), if we generate 1000 synthetic deviceIds for each real deviceId we will have 1000x5x100.000.000 additional records so 500.000.000.000 records.
But we can expand the amount of synthetic data even more playing with time, for example, if our real data has events only for 2017, we can duplicate the entire dataset for all the years starting from 2006 to 2016 and have 5.000.000.000.000 records.
Here some sample C# code that generates the synthetic deviceIds:
note the GetArraysOfSyntheticDevices function that will be executed into the U-SQL script.
Before using it we have to register the assembly into our DataLake account and database (in my case the master one):
DROP ASSEMBLY master.[Microsoft.DataGenUtils];
CREATE ASSEMBLY master.[Microsoft.DataGenUtils] FROM @”location of dll”;
Now we can read the original IOT data and create the additional data:
REFERENCE ASSEMBLY master.[Microsoft.DataGenUtils];
//Let’s have the distinct list of all the real DeviceIds
@t1 =SELECT DISTINCT
deviceid AS deviceid
//Let’s calculate for each deviceId an array of 1000 synthetic devices
Microsoft.DataGenUtils.SyntheticData.GetArrayOfSynteticDevices(deviceid, 1000) AS SyntheticDevices
//Let’s assign to each array of synthetic devices the same data of the corresponding real device
@t3 = SELECT a.SyntheticDevices,
FROM @t0 AS de INNER JOIN @t2 AS a ON de.deviceid== a.deviceid;
//Let’s use the explode function to expand the array to records
emp AS deviceid,
FROM @t3 AS de
EXPLODE(de.SyntheticDevices) AS dp(emp);
//Now we can write the expanded data
Once you have the expanded data for the entire 2017 you can just use c# DateTime functions that add Years, Months or days to a specific date, applied that to timeofevent column and write the new data in a new folder (for example SyntethicData/2016, SyntethicData/2015 etc…).