Collecting SQL Server performance monitor data

SQL Server provides information for troubleshooting performance. Ed Pollack demonstrates collecting SQL Server performance monitor data.

SQL Server performance monitor data is a vast set of diverse metrics that provide a window into many areas of SQL Server performance. To get the most out of this data, it needs to be collected, stored, and maintained over a long period of time. This article explains a method of collecting SQL Server performance monitor data.

Given the choice, it is always better to be proactive and plan ahead rather than be forced to respond to emergencies when resources run low, or worse, run out.

By regularly sampling and retaining this data, long-term trends and capacity planning models can be constructed that allow an organization to make smart decisions about its resources with plenty of time to spare. This article walks through a process that collects performance monitor data, allowing it to be used for these purposes.

How should performance counter data be stored?

Since performance counters are not persisted in SQL Server, it falls to a developer or administrator to regularly poll dm_os_performance_counters, collect metrics, and store them in a permanent location.

The data can be stored in a similar fashion as it is returned in the view: one row per counter per sample, but I personally find it far easier to work with this data when it is pivoted so that there is a single column per counter per sample. It creates a smaller data set and one that benefits more from compression and columnstore indexes.

For the remainder of this article, a selection of performance counters be used to demonstrate their storage, collection, and usage. Assume that the list of counters used in any organization will vary from these 12 (possibly by a lot):

  • Page life expectancy (Static value. Can be sampled at any point in time)
  • Page reads (Cumulative value. Must be diffed between two points in time)
  • Page writes (Cumulative value. Must be diffed between two points in time)
  • Lazy writes (Cumulative value. Must be diffed between two points in time)
  • Page lookups (Cumulative value. Must be diffed between two points in time)
  • Target pages (Static value. Can be sampled at any point in time)
  • Logins (Cumulative value. Must be diffed between two points in time)
  • Logouts (Cumulative value. Must be diffed between two points in time)
  • Temp Tables Created (Cumulative value. Must be diffed between two points in time)
  • Batch Requests (Cumulative value. Must be diffed between two points in time)
  • Compilations (Cumulative value. Must be diffed between two points in time)
  • Recompilations (Cumulative value. Must be diffed between two points in time)

Note that the collection method is included in the list above to aid later in this article.

A table will be created to store this data that contains a row per sample time and a column per counter. This table includes a clustered columnstore index as it could get quite large and would benefit greatly from the added compression and rowgroup/segment elimination offered by it when performing analytics:

Note that while the counter values stored in dm_os_performance_counters are BIGINTs, DECIMAL(18,4) is used for many counters to support the math needed to calculate per-second numbers, which provides more value than raw numbers would. Units are included in all counters to make it easier to understand what they mean and how to use them.

If metrics are to be collected and aggregated across many servers, then adding a database server column would be an easy way to facilitate that. An additional non-clustered index on that column (plus the sample time) would be valuable if there were a need to perform calculations based on server rather than time.

Collecting performance counter data

With a table created to accept data from dm_os_performance_counters, a process can be built that regularly collects the data and stores it here. For the sake of brevity, this section will cover the key components to data collection only. The full script can be downloaded, reviewed, and tested at the reader’s leisure.

One key concern in this data collection is differentiating between the collection of static and cumulative counters. Most of the counters chosen to demo are cumulative, and thus the only way to measure them is to take the current sample, compare it to the last sample, and perform the necessary math to determine the metric over that time span. There are a handful of ways to manage this, but for simplicity and reliability, a new table will be created:

In addition, little history needs to be saved to this table, therefore it will not grow large and a standard clustered primary key is adequate. Realistically, only one previous sample is required to calculate what is needed for our process, but to play it safe and keep retention straightforward, this example will retain a single day of data.

In an effort to structure this project, the steps needed to collect this data are as follows:

  1. Collect a snapshot of all counters of interest into a temporary table.
  2. Determine the last sample time for the cumulative counters.
  3. Join the new counter data to the previous cumulative counter data.
  4. Insert the results into dbo.performance_counters.
  5. Insert the raw cumulative data into dbo.performance_counters_staging.
  6. Remove any staging data older than a day.

Note that there is no retention on dbo.performance_counters. This is intentional as it is built as a table to store performance counter data long-term. Retention may be added at the end of the process to limit storage to a reasonable time frame if needed. It’s advisable to keep at least 1-2 years of data, if it is to be used for capacity planning or other long term trending. Given the table’s small size, retaining it forever would not be expensive.

The process can run as often as is prudent, and that frequency would be determined by an organization and the environment being monitored. For an average database server, I’d consider starting at samples every minute and adjust more/less often as needed.

Collect a Snapshot

To reduce the performance impact of querying system views, there is a benefit in collecting counter data in a single query and putting it into a temporary table:

The syntax looks a bit messy but performs as well as a PIVOT or some other method would. The goal is to hit dm_os_performance_counters a single time and not need to revisit it for different subsets of counters. The T-SQL above will generate a single row of data with all of the requested metrics placed into the temporary table.

Determine the Last Sample Time for Cumulative Counters

For the cumulative counters, it is necessary to compare the values just collected against the last sample taken. If no previous sample has been taken, then this sample time will be NULL and results for these counters will not be collected until the next time this process runs.

This is straightforward and checks the staging table for the last sample time and stores it in a variable for use soon. The remainder of this process will be nested within an IF statement, ensuring that the first run of this code will not generate any performance counter data. Instead, staging data will be populated and a full data set generated on the next run.

This includes the deletion of any staging data more than a day older than the last sample time, which ensures this table does not get overly large. An administrator may wish to keep the raw numbers for telemetry purposes for a longer period of time, and if so, this retention can be adjusted accordingly here.

Calculate Metrics and Insert into Performance Counter Table

For the two point-in-time metrics, the work is simple, but for the cumulative counters, some logic needs to be added to compare with the previous value and also calculate the per-second metric.

Since a service restart will reset cumulative counters, it is necessary to check if the current value is larger than the previous value. If so, then subtract to get the difference. If the current value is less, then simply use that value as the counter value.

The counter rate per second is calculated by dividing the amount that the counter incremented by since the last sample by the number of seconds since the last sample.

Populate Raw Cumulative Counter Data

This final INSERT ensures that a snapshot of cumulative counters is retained for posterity.

Putting it All Together

When the entire process is pieced together and executed for the first time, it will place data into the staging table only:

collecting sql server performance data into a staging table

These represent raw, cumulative counters for ten of the twelve metrics we are going to track.

When the same process is executed again, the staging table gets a second row, and the primary counter table is populated with its first complete row:

The results of two collections of SQL Server performance data

With this and each subsequent run, a row will be placed in dbo.performance_counters (the bottom result set) that contains both the point-in-time and cumulative counters. The calculation that converts raw/cumulative numbers into rates can be observed with the following example from the data above:

17:25:26.783: Page Lookup Count (Cumulative): 70515235

17:28:05.320: Page Lookup Count (Cumulative): 70521221

Difference: 5986 Page Lookups

Seconds Between Samples: 159

Page Lookups Per Second = 5986 / 159 = 37.65

The final step to collecting this data is to schedule a job or task that executes this code regularly, adding a row to dbo.performance_counters on each run.

The script included in this article contains the full stored procedure code for dbo.populate_performance_counters, which takes all of the code above and packages it into a single stored procedure for easy use by a developer or administrator.

This process is meant for customization and is of most value when tailored to the environment that is to use it. If multiple servers are to be monitored, then run the stored procedure presented here on each server and then (if needed) have a centralized process that collects this data regularly and stores it for central analysis.

The only alteration to the table structure for the centralized table would be the addition of a column to store the source database server name and possibly an index to support searches on that column.

Customization

Since dm_os_performance_counters contains a wide variety of performance counters, it is natural that only a small subset will be useful to any one person, and that specific subset will vary from organization to organization. For the purpose of this article, a selection of what I considered to be some of the more useful performance counters were chosen.

Adding or removing counters can follow the patterns illustrated in this code and should not be difficult. To add a counter, follow a process similar to this:

  1. Add a column to dbo.performance_counters for the new counter.
  2. If the counter is cumulative, then also add it to dbo.performance_counters_staging.

Within the stored procedure:

  1. Add the counter to #performance_counters.
  2. Add the counter to the INSERT/SELECT into #performance_counters
  3. Add the counter to the INSERT into dbo.performance_counters.
  4. If the counter is cumulative, then also add it to the INSERT into dbo.performance_counters_staging.

To remove a counter, follow a similar process to what is outlined above, but instead of adding a counter, remove it instead.

While no database-scoped counters were included in the examples above, they can be added via similar processes. If the number of databases to be tracked is small and finite, then adding them to the tables already presented here makes sense. If the number of databases is large or variable, then creating a separate table for them is a more scalable solution.

In the latter scenario, the database name would be added as a column in a new table and a row would be inserted per sample time per database. A separate table for database-scoped counters is the easiest way to avoid confusion between server and database metrics and a need to differentiate between them whenever querying one single table.

Monitoring vs. capacity planning and analytics

Before wrapping up this discussion of performance monitor data, it is important to differentiate between two functional uses of these metrics:

  • Monitoring and Alerting
  • Analytics and Capacity Planning

The code in this article captures performance monitor metrics for long-term storage and is ideally used for capacity planning and analytics. It can help identify long-term trends and allow developers and administrators to plan ahead, avoiding resource crunches and emergencies.

For example, a long-term downward trend of page life expectancy indicates an impending memory shortage. This may be caused by code, increased usage, more data, or something else. Seeing the long-term trend and identifying this challenge with months to spare would allow for additional memory to be purchased and allocated. If a release were to blame, it would provide the time needed to research and resolve those code changes before becoming a systemic problem. If the cause were unclear, time would be available to research and find it before it is too late.

While this data can be used for monitoring, it is not in of itself a monitoring tool. Having a reliable SQL Server monitoring tool is an invaluable resource that can:

  • Alert in the event of immediate problems.
  • Help identify code or resource problems.
  • Let you know when resources run low at certain times of the day.
  • Correlate an app problem to the code or configuration causing it.

Always consider the tools used for monitoring database servers. If you are in need of a tool for the job, consider Redgate’s SQL Monitor. The time (and sleep!) saved by having a reliable monitoring tool is significant; therefore it is worth the time needed to test and implement one.

Collecting SQL Server performance monitor data

Performance counter data is exceptionally useful, but its value cannot be fully realized until it is reliably collected and stored for future use. The more data that is retained, the more value it has. Trending IO for a week may be useful, but being able to evaluate it over a full year (or longer) will provide far more perspective and the ability to make better decisions that can stretch far into the future.

The value in a process like this is versatility, and by customizing the data and usage, a developer or administrator can learn a great deal about their database servers and the underlying databases. No single set of metrics applies to everyone and by tailoring a data set to an organization’s specific needs, an ideal set of data can be maintained that allows for database servers to be effectively analyzed and planned for over time. As conditions change, metrics can be added or removed, ensuring that even as data evolves, the ability to understand it is not compromised.

If you like this article, you might also like SQL Server performance monitor data: Introduction and usage