Indexing Xml files with elastic search and c#

Lately I’ve been struggling with some integration issues and each time I had to reverse engineer workflows, troubleshoot code and search inside the logs of the enterprise service bus to find the message xyz containing the costumer data of client abc. Now since these logs are also stored in an encrypted format, I had to write some code to decrypt them on the fly , search inside file contents, look inside next log etc…

Basically one single search of one costumer was taking 20-30 min…

So I started to look at solutions like elastic search or SOLR that solve exactly these kind of problems and since I already worked with elastic search in the past I went in that direction. The classic combo is Logtash & Elastic Search & Kibana: Logtash is used to parse and transform the incoming log files and send them to Elastic search, here the are indexed and using Kibana you can quickly build nice dashboards on the indexed content.

This time however I had to face a new challenge,  instead of having classic web logs and ready to use logtash transformations (filters) , I had to work on these huge xml log files stored inside the ESB and they also had several levels of nesting. Elastic search supports natively json objects indexing and not xml so you have to manipulate the xml with a logtash transformation. After reading a bit about logtash xml filter I found that (probably because I did not spend so much time on this) it would take too much time to write the right transformation for my case.

So I started to code some c# code to do it and I choose to leverage the NEST library (elastic search .net client).

While looking inside  Nest and elastic search documentation I discovered also that objects nested inside other objects are not so easy searchable like the root ones. So I decided to flatten the xml structure into a flat c# simple class. To have the write the minimum amount of code to do this, I first transformed the xml into a proper c# class , the fastest way I found is to use xsd.exe from windows tool kit (look in C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin) and obtain the xsd file from a single xml document:

xsd “C:\Users\UserA\Desktop\ESB.xml” /o:”C:\Users\UserA\Desktop”

You will obtain ‘C:\Users\UserA\Desktop\ESB.xsd’.

Now use xsd.exe again to generate the c# class:

xsd /c “C:\Users\UserA\Desktop\ESB.xsd” /o:”C:\Users\UserA\Desktop”

You will obtain ‘C:\Users\UserA\Desktop\ESB.cs’.

I manually  created the flat c# class simply coping & pasting the generated c# class nested object properties inside the flat one. Since property names are not changed using reflection we can later automatically copy property values from the nested objects to the flat one:

public static void ReplaceValues(Object source, Object destination)
{
PropertyInfo[] propertiesIncoming = source.GetType().GetProperties(BindingFlags.Public | BindingFlags.Instance);
PropertyInfo[] propertiesDestination = destination.GetType().GetProperties(BindingFlags.Public | BindingFlags.Instance);
//This is a sample code, do not iterate like this 
//and use linq search with large collections!!!
foreach (PropertyInfo p in propertiesIncoming){                
      if (p.PropertyType != typeof(string)) { continue; }
      PropertyInfo dest = propertiesDestination.Where(y => y.Name == p.Name).FirstOrDefault();
      if (dest!=null)
      {
         dest.SetValue(destination, p.GetValue(source));
      }                
   }
}

So once we have the flat c# object using NEST client we can index it very quickly,
here a sample that takes one xml and indexes the xml contents.

//ESBObj is the ESB.cs class type     
XmlSerializer serializer = new XmlSerializer(typeof(ESBObj));
//Even with large xml files this deserialization happens really quickly
                ESBObj resultingMessage = (ESBObj)serializer.Deserialize(new XmlTextReader(this.openFileDialog1.FileName));
//Here we use the NEST library and we connect to the local node
                var node = new Uri("http://localhost:9200");
//we specify that we want to work on the esb_index
                var settings = new ConnectionSettings(
                    node,
                    defaultIndex: "esb_index"
                );
//let's connect 
                var client = new ElasticClient(settings);

//here we fill the flat objects using the ESBObj levels               
                FlatObj tempObj=null;
                int progressive = 0;
//sample code here , this can be largely improved using reflection again 
                foreach (var level1 in resultingMessage.Level1Objects)
                {                  
                    foreach (var level2 in level1.Level2Objects)
                    {
                        foreach (var level3 in level2.Level3Objects)
                        {
                            tempObj = new FlatObj();
                            progressive++;
                            ReplaceValues(resultingMessage, tempObj);
                            ReplaceValues(level1, tempObj);
                            ReplaceValues(level2, tempObj);                          
                            ReplaceValues(level3, tempObj);
//Here before indexing we assign a progressive Id to each object
//in order to have unique id on elastic search
//elastic search uses this id to identify uniquely each object 
//on the index
                            tempObj.Id = progressive.ToString();
//This is the indexing call
                            var index = client.Index(tempObj);
                        }
                    }
                }

Now we want search for contents on the index that stores these contents,however this happened to be more tricky of what I thought ,probably because it was the first time for me using the NEST library, but luckily I had installed also some elastic search plug-ins and one of these was ElasticHQ , a nice front-end for elastic search. Looking inside the JSON requests of the queries done by ElasticHQ I was able to find the right query to issue using NEST raw mode (where you pass directly the commands avoiding NEST library to do it for you).

This is some sample code that “should” work , a search with City=New York but in my case no results..

var searchResults = client.Search<FlatObj>(s => s
                .From(0)
                .Size(50)
                .Query(q => q
                     .Term(p => p.City, "New York")
                )
                );

Here instead how I make it work (and in this way searches automatically on all the properties!):


//In searchbox we type what we want to find 
//we can type here anything and elastic search will search on all 
//flattened properties!!!
string searchVal = @"{""filtered"": {""query"": {""query_string"": {""query"": """ + this.searchbox.Text;
searchVal=searchVal+@"""}}}}";
var client = new ElasticClient(settings);
var searchResults = client.Search<FlatObj>(s => s
            .From(0)
            .Size(50)            
            .QueryRaw(searchVal)           
            );
//Here since Document collections is IEnumerable we can bind it on the fly
// and see the results on a grid!
this.dataGridView1.DataSource = searchResults.Documents;

So in the end I ended up my quick POC indexing a 14 Mb xml log file in 200 ms and searching every possible content of it in 50-100 ms for each search issued to elastic search node. Actually only the index size scares me (5 Mb), and I want to see how much it will grow with several files.

Annunci