Indexing Xml files with elastic search and c#

Lately I’ve been struggling with some integration issues and each time I had to reverse engineer workflows, troubleshoot code and search inside the logs of the enterprise service bus to find the message xyz containing the costumer data of client abc. Now since these logs are also stored in an encrypted format, I had to write some code to decrypt them on the fly , search inside file contents, look inside next log etc…

Basically one single search of one costumer was taking 20-30 min…

So I started to look at solutions like elastic search or SOLR that solve exactly these kind of problems and since I already worked with elastic search in the past I went in that direction. The classic combo is Logtash & Elastic Search & Kibana: Logtash is used to parse and transform the incoming log files and send them to Elastic search, here the are indexed and using Kibana you can quickly build nice dashboards on the indexed content.

This time however I had to face a new challenge,  instead of having classic web logs and ready to use logtash transformations (filters) , I had to work on these huge xml log files stored inside the ESB and they also had several levels of nesting. Elastic search supports natively json objects indexing and not xml so you have to manipulate the xml with a logtash transformation. After reading a bit about logtash xml filter I found that (probably because I did not spend so much time on this) it would take too much time to write the right transformation for my case.

So I started to code some c# code to do it and I choose to leverage the NEST library (elastic search .net client).

While looking inside  Nest and elastic search documentation I discovered also that objects nested inside other objects are not so easy searchable like the root ones. So I decided to flatten the xml structure into a flat c# simple class. To have the write the minimum amount of code to do this, I first transformed the xml into a proper c# class , the fastest way I found is to use xsd.exe from windows tool kit (look in C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin) and obtain the xsd file from a single xml document:

xsd “C:\Users\UserA\Desktop\ESB.xml” /o:”C:\Users\UserA\Desktop”

You will obtain ‘C:\Users\UserA\Desktop\ESB.xsd’.

Now use xsd.exe again to generate the c# class:

xsd /c “C:\Users\UserA\Desktop\ESB.xsd” /o:”C:\Users\UserA\Desktop”

You will obtain ‘C:\Users\UserA\Desktop\ESB.cs’.

I manually  created the flat c# class simply coping & pasting the generated c# class nested object properties inside the flat one. Since property names are not changed using reflection we can later automatically copy property values from the nested objects to the flat one:

public static void ReplaceValues(Object source, Object destination)
{
PropertyInfo[] propertiesIncoming = source.GetType().GetProperties(BindingFlags.Public | BindingFlags.Instance);
PropertyInfo[] propertiesDestination = destination.GetType().GetProperties(BindingFlags.Public | BindingFlags.Instance);
//This is a sample code, do not iterate like this 
//and use linq search with large collections!!!
foreach (PropertyInfo p in propertiesIncoming){                
      if (p.PropertyType != typeof(string)) { continue; }
      PropertyInfo dest = propertiesDestination.Where(y => y.Name == p.Name).FirstOrDefault();
      if (dest!=null)
      {
         dest.SetValue(destination, p.GetValue(source));
      }                
   }
}

So once we have the flat c# object using NEST client we can index it very quickly,
here a sample that takes one xml and indexes the xml contents.

//ESBObj is the ESB.cs class type     
XmlSerializer serializer = new XmlSerializer(typeof(ESBObj));
//Even with large xml files this deserialization happens really quickly
                ESBObj resultingMessage = (ESBObj)serializer.Deserialize(new XmlTextReader(this.openFileDialog1.FileName));
//Here we use the NEST library and we connect to the local node
                var node = new Uri("http://localhost:9200");
//we specify that we want to work on the esb_index
                var settings = new ConnectionSettings(
                    node,
                    defaultIndex: "esb_index"
                );
//let's connect 
                var client = new ElasticClient(settings);

//here we fill the flat objects using the ESBObj levels               
                FlatObj tempObj=null;
                int progressive = 0;
//sample code here , this can be largely improved using reflection again 
                foreach (var level1 in resultingMessage.Level1Objects)
                {                  
                    foreach (var level2 in level1.Level2Objects)
                    {
                        foreach (var level3 in level2.Level3Objects)
                        {
                            tempObj = new FlatObj();
                            progressive++;
                            ReplaceValues(resultingMessage, tempObj);
                            ReplaceValues(level1, tempObj);
                            ReplaceValues(level2, tempObj);                          
                            ReplaceValues(level3, tempObj);
//Here before indexing we assign a progressive Id to each object
//in order to have unique id on elastic search
//elastic search uses this id to identify uniquely each object 
//on the index
                            tempObj.Id = progressive.ToString();
//This is the indexing call
                            var index = client.Index(tempObj);
                        }
                    }
                }

Now we want search for contents on the index that stores these contents,however this happened to be more tricky of what I thought ,probably because it was the first time for me using the NEST library, but luckily I had installed also some elastic search plug-ins and one of these was ElasticHQ , a nice front-end for elastic search. Looking inside the JSON requests of the queries done by ElasticHQ I was able to find the right query to issue using NEST raw mode (where you pass directly the commands avoiding NEST library to do it for you).

This is some sample code that “should” work , a search with City=New York but in my case no results..

var searchResults = client.Search<FlatObj>(s => s
                .From(0)
                .Size(50)
                .Query(q => q
                     .Term(p => p.City, "New York")
                )
                );

Here instead how I make it work (and in this way searches automatically on all the properties!):


//In searchbox we type what we want to find 
//we can type here anything and elastic search will search on all 
//flattened properties!!!
string searchVal = @"{""filtered"": {""query"": {""query_string"": {""query"": """ + this.searchbox.Text;
searchVal=searchVal+@"""}}}}";
var client = new ElasticClient(settings);
var searchResults = client.Search<FlatObj>(s => s
            .From(0)
            .Size(50)            
            .QueryRaw(searchVal)           
            );
//Here since Document collections is IEnumerable we can bind it on the fly
// and see the results on a grid!
this.dataGridView1.DataSource = searchResults.Documents;

So in the end I ended up my quick POC indexing a 14 Mb xml log file in 200 ms and searching every possible content of it in 50-100 ms for each search issued to elastic search node. Actually only the index size scares me (5 Mb), and I want to see how much it will grow with several files.

Annunci

Integration with Adobe Marketing Cloud ( aka Neolane )

Crm landscapes are often made of disparate systems interacting , for example you can have ,apart from the crm itself, marketing segmentation tools, data warehouses, e-Commerce sites, online presence sites,etc.. In this scenario one of the leading marketing cloud solutions is Adobe with the recent acquisition of Neolane. At very basics (Neolane experts please forgive me!) Neolane it’s a tool to segment your customers (young & low salary, old & high spending, single women with kids,etc….), create personalized campaigns for them and delivering to them commercial offers or discounts etc… With Neolane you can track the results of these campaigns (for example the clicked links on the campaign email) and optimize your next campaigns.

From an integration perspective the classic way of import/export data with Neolane it’s the usage of files and workflows, but Neolane offers also other means to extract or update information on the system via API .

The Neolane API consists of a single endpoint usually listening here :

https://neolaneserver/nl/jsp/soaprouter.jsp

This service accepts only POST requests and using standard SOAP message exchanges can help you to extract or change the data that you want on Neolane.

However in order to access this resource you have to authenticate against Neolane. Once authenticated Neolane will give us a session token and a security token and using both we can finally query Neolane.

Theoretically you can also query the soap service passing user/password combination but this security option is by default disabled at configuration level.

For our purposes we will limit our analysis to these two soap services :

1) session (Logon method)

2) query (execute query method)

Now in order to proceed to the API calls we can proceed in two ways:

A) call this page https://neolaneserver/nl/jsp/schemawsdl.jsp?schema= for each service and passing the service name as parameter obtain the wsdl (you have to authenticate so go with your browser and save the wsdl to disk). For session you will make the call with this parameter xtk:session  , for execute query with this parameter xtk:queryDef. Once you have the wsdl you can use the tool of your choice to generate the classes for the integration (for .net wsdl.exe).

B) instead of having proxy classes that handle for you the burden of soap communication you can call directly https://neolaneserver/nl/jsp/soaprouter.jsp and modify the request headers according to the Soap Action (service) that you want to call.

We will go with option B for two reasons:

1) the authentication service (session) requires credentials passed inside the http header of the request and do this with .net it’s pretty cumbersome and involves the modification of the proxy classes (you will lose this if you generate again them against the wsdl)

2) the proxy classes generated , in my case at least, are not really that good … just to give you an example :
This the wsdl section for the Logon method of Session Service

 <s:element name="Logon">
  <s:complexType>
    <s:sequence>
       <s:element maxOccurs="1" minOccurs="1" name="sessiontoken" type="s:string"/>
       <s:element maxOccurs="1" minOccurs="1" name="strLogin" type="s:string"/>
       <s:element maxOccurs="1" minOccurs="1" name="strPassword" type="s:string"/>
       <s:element maxOccurs="1" minOccurs="1" name="elemParameters" type="tns:Element"/>
    </s:sequence>
  </s:complexType>
 </s:element>
  <s:element name="LogonResponse">
   <s:complexType>
     <s:sequence>
      <s:element maxOccurs="1" minOccurs="1" name="pstrSessionToken" type="s:string"/>
      <s:element maxOccurs="1" minOccurs="1" name="pSessionInfo" type="tns:Element"/>
      <s:element maxOccurs="1" minOccurs="1" name="pstrSecurityToken" type="s:string"/>
     </s:sequence>
   </s:complexType>
</s:element>

and this is the generated method

[System.Web.Services.Protocols.SoapDocumentMethodAttribute("xtk:session#Logon", RequestNamespace="urn:xtk:session", ResponseNamespace="urn:xtk:session", Use=System.Web.Services.Description.SoapBindingUse.Literal, ParameterStyle=System.Web.Services.Protocols.SoapParameterStyle.Wrapped)]
    [return: System.Xml.Serialization.XmlElementAttribute("pstrSessionToken")]
public string Logon(string sessiontoken, string strLogin, string strPassword, System.Xml.XmlElement elemParameters, out System.Xml.XmlElement pSessionInfo, out string pstrSecurityToken) {
        object[] results = this.Invoke("Logon", new object[] {
                    sessiontoken,
                    strLogin,
                    strPassword,
                    elemParameters});
        pSessionInfo = ((System.Xml.XmlElement)(results[1]));
        pstrSecurityToken = ((string)(results[2]));
        return ((string)(results[0]));
    }

so basically it requires you to pass parameters by reference (out keyword) while in reality there is no need for it.

So let’s go with classic HttpWebRequest:

 //Create the web request to the soaprouter page
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("https://neolaneserver/nl/jsp/soaprouter.jsp");

req.Method = "POST";
req.ContentType = "text/xml; charset=utf-8";
//Add to the headers the requested Service (session) that we want to call
req.Headers.Add("SOAPAction", "xtk:session#Logon");
//Here for testing purpouses username and password are here but you should acquire it in a secure way!
string userName = "username";
string pass = "password";
//We craft the soap envelope creating a session Logon reequest
string body = "<soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:urn=\"urn:xtk:session\">" +
                "<soapenv:Header/><soapenv:Body><urn:Logon>" +
                "<urn:sessiontoken/>" +
                "<urn:strLogin>" + userName + "</urn:strLogin>" +
                "<urn:strPassword>" + pass + "</urn:strPassword>" +
                "<urn:elemParameters/>" +
            "</urn:Logon></soapenv:Body></soapenv:Envelope>";

//We write the body to a byteArray to be passed with the Request Stream
byte[] byteArray = Encoding.UTF8.GetBytes(body);

// Set the ContentLength property of the WebRequest.
req.ContentLength = byteArray.Length;
// Get the request stream.
Stream dataStreamInput = req.GetRequestStream();
// Write the data to the request stream.
dataStreamInput.Write(byteArray, 0, byteArray.Length);
// Close the Stream object.
dataStreamInput.Close();

var response = req.GetResponse();

Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Display the content on the console.
Console.Write(responseFromServer);
// Clean up the streams and the response.
reader.Close();
response.Close();

//Manually parsing the response with an XMLDoc
System.Xml.XmlDocument xResponse = new XmlDocument();
xResponse.LoadXml(responseFromServer);
// We parse manually the response. This is again for testing purpouses
XmlNode respx = xResponse.DocumentElement.FirstChild.FirstChild;

string sessionToken = respx.FirstChild.InnerText;
string securityToken = respx.LastChild.InnerText;
// We have done the login now we can actually do a query on Neolane
HttpWebRequest reqData = (HttpWebRequest)WebRequest.Create("https://neolaneserver/nl/jsp/soaprouter.jsp");
reqData.ContentType = "text/xml; charset=utf-8";
//Add to the headers the requested Service (ExecuteQuery) that we want to call
reqData.Headers.Add("SOAPAction", "xtk:queryDef#ExecuteQuery");
//Add to the headers the security and session token
reqData.Headers.Add("X-Security-Token", securityToken);
reqData.Headers.Add("cookie", "__sessiontoken=" + sessionToken);
reqData.Method = "POST";
// We write a SQL Like query to Neolane using XML syntax. Basically we are asking: SELECT email,firstname,lastname from recipient where email='test@test.com'
string bodyData = "<soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:urn=\"urn:xtk:queryDef\">" +
                "<soapenv:Header/><soapenv:Body><urn:ExecuteQuery><urn:sessiontoken/><urn:entity>" +
                "<queryDef operation=\"select\" schema=\"nms:recipient\">" +
                    "<select><node expr=\"@email\"/><node expr=\"@lastName\"/><node expr=\"@firstName\"/></select>" +
                    "<where><condition expr=\"@email = 'test@test.com'\"/></where>" +
                "</queryDef>" +
            "</urn:entity></urn:ExecuteQuery></soapenv:Body></soapenv:Envelope>";

byte[] byteArrayData = Encoding.UTF8.GetBytes(bodyData);

// Set the ContentLength property of the WebRequest.
reqData.ContentLength = byteArrayData.Length;
// Get the request stream.
Stream dataStreamInputData = reqData.GetRequestStream();
// Write the data to the request stream.
dataStreamInputData.Write(byteArrayData, 0, byteArrayData.Length);
// Close the Stream object.
dataStreamInputData.Close();

var responseData = reqData.GetResponse();

Stream dataStreamData = responseData.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader readerData = new StreamReader(dataStreamData);
// Read the content.
string responseFromServerData = readerData.ReadToEnd();
// Display the content. Here we will see the query results in form of recipient collection
Console.Write(responseFromServerData);
// Clean up the streams and the response.
readerData.Close();
responseData.Close();

Using this technique you can simply pull or push data to Neolane without having to touch the Neolane system itself.

UPDATE : here you can see how to write to Neolane!