Session Analyzer traffic non productionStarting a long time ago, we wanted to reproduce workload in a non-production environment, and there were different attempts to achieve that goal (Query Playback is just one of them). But there is another point of view, where you need to write your own workload to do so.

Both Have Pros and Cons

Reproduce Workload:

Pros:

  • Simple to implement
  • Ready to go

Cons:

  • Need to rebuild the environment each time

Custom Scripts:

Pros:

  • Possible to have a more realistic workload
  • You can reuse the environment
  • You can use Sysbench that allows you to change several options in your test like increasing threads or limiting throughput

Cons:

  • You need to invest a lot of time to create the scripts to have a realistic workload

Is it Possible to Have the Best of Both Worlds?

My idea is simple; use the slow query log to get a usable template script of workload. This might sound simple, but it requires defining the steps needed:

  • We need a slow query reader
  • We need to identify each query template
  • We need to keep track of the queries executed by session, as we want to simulate the sessions

Another two important aspects will be to collect the data in the queries to create variables that will be filled up with valid functions and each session have “variables” that repeat across the whole execution. 

Slow Query Reader

It is not complex, but we need to split the query into three pieces, so a query like this:

Will have this query template:

This data template:

And the real data will be:

Analyzing the Data Per Query

As we are going to have queries that execute the same query template with the same data template, we can collect the real data per position and process. For instance, if we have all these queries executed:

We need to extract all this integer value to determine that in column1, values will be between 2004 and 2005, column2 values between 01 and 12, and for column3, between 01 and 27.

The same analysis should be done to consider:

  • Integers and if they have some specific distribution
  • Alpha or Alphanumerics value and their length
  • Hexadecimal

Analyzing the Workload

As we want to reproduce the real workload, the queries need to be sent in the same order that they were executed by the application and we need to process the variables. For instance, if we have a session executing:

We need to identify the 10 as an app-level variable for the session that could be inside other values and strings. With this information, we can develop a session template, and we need to summarize this information as most of the time the applications execute the same queries in the same order but with different data.

Merging the Information Collected

Once we have the functions that generate the values to fill up the queries templates and the sessions template, we are able to simulate the workload. Sysbench will be able to get the data in files that are generated dynamically and send it to the database. 

What’s Next?

I worked on a tool that does it, but perl and bash were not the best choices, so, now I’m already working on a tool written in C and using GLib. Hopefully, it will be published soon.