Page Life Expectancy Doesn’t Mean Jack, and You Should Stop Looking At It.

Monitoring
26 Comments

Page Life Expectancy is a Perfmon counter that’s supposed to track how long pages stay in memory before they get flushed out to make room for other pages that are needed instead.

Paul Randal has blogged this a few times, and rather than rehash ’em, I’d rather point you to a couple of his roundups:

And layer in a few of my own observations for context:

Page Life Expectancy goes up 1 for each second that you’re not under memory pressure. Restart your SQL Server instance and watch PLE: it starts at 1, and goes up by 1 for each second of uptime. 5 minutes of uptime = PLE 300. For the first 5 minutes, it’s not like your server’s under memory pressure – it just woke up, for crying out loud. Give him a minute. Well, 15-20, I guess, because PLE is near useless during that time span.

Page Life Expectancy drops can be triggered by confusing operations. By default, any one running query can get a memory grant the size of 25% of your buffer pool. Run a few of those queries at the same time, and your buffer pool gets drained – but PLE doesn’t necessarily drop. However, the instant an unrelated query runs and needs to get data that isn’t cached in RAM, your PLE will drop catastrophically. Which queries are at fault? The queries getting large grants, or the queries doing reads? “Yes.”

Page Life Expectancy is a lagging indicator. Lagging indicators are something that tell you about an emergency long after the emergency has happened, and lagging indicators don’t recover quickly after the emergency is over. When you combine the above two problems – PLE only rising 1 per second, and PLE dropping at times that aren’t necessarily tied to the dropping buffer pool – then if you’re alerting based on low PLE numbers, you could have already missed the emergency. When you go log in and look for what long-running queries are running, it’s already too late. Instead, you should be using leading indicators: things that tell you a problem is coming up.

With that stuff in mind, I’ve removed the Page Life Expectancy warnings from sp_BlitzFirst altogether. I think they’re distracting y’all from the real leading indicator, wait stats – and of course, sp_BlitzFirst shows that, so I’d rather focus your attention there.

Previous Post
What Was the Worst Database Decision You Ever Made?
Next Post
How Doug Lowing Gets the Most Out of His Live Class Season Pass

26 Comments. Leave new

  • Say it isn’t so!!!!!
    Next thing you’ll say is I shouldn’t be looking at disk queue length and dividing it by the number of spindles in my RAID-5 array

    Reply
  • I do wish “best practice” came with a BBE date.

    Reply
  • Grzegorz ?yp
    June 17, 2020 8:08 am

    What should be used instead PLE to see memory pressure problem?

    Reply
  • “Page Life Expectancy Low” Finding is removed? Please no.

    I have inherited a lot of old apps and procedures from former sql admins which I will never understand. So when I find a “Page Life Expectancy Low” issue I thorow RAM at the server.

    Reply
    • Roman – instead, look for high PAGEIOLATCH or RESOURCE_SEMAPHORE waits, and throw memory at it in those situations. It’ll be way more cost-effective.

      Reply
      • Thanks, for the hint. I get both to together right now. But I have never took a look at PAGEIOLATCH as the priority is much lower
        Priority FindingsGroup Finding
        50 Server Performance Page Life Expectancy Low
        200 Wait Stats PAGEIOLATCH_SH

        Reply
      • John Bougeois
        March 12, 2024 3:43 pm

        I had a situation where I was experiencing with high RESOURCE_SEMAPHORE waits. The issue was Indexing. I had been reorganizing indexes and when I changed it to index rebuild online, it solved the issue. Reorganizing just was no longer doing the job. But be careful with Index rebuild, especially in sql server standard. It can cause table locking and blocking.

        Reply
    • Jeffrey Mergler
      June 17, 2020 9:38 am

      @Roman, you can use Tim Radney’s script if you feel the need to query PLE: http://timradney.com/2013/03/08/what-is-page-life-expectancy-ple-in-sql-server/

      That said, I will now start honing in on waitstats as brent mentions above.

      Reply
  • Jeffrey Mergler
    June 17, 2020 9:35 am

    No complaints here, I didn’t even notice PLE was in BlitzFirst and there are plenty of scripts which give you PLE.

    Reply
  • Vince Thomas
    June 17, 2020 11:01 am

    I agree. When we switched over to very fast SSD drives a couple years ago, I stopped worrying about PLE.

    Reply
  • I remember, back in the day, it was all “buffer cache hit ratio is great”, then “no, BCHR is garbage – use PLE – its great”. Now “PLE is garbage use waitstats”. Am I starting to sound old??? ?

    Reply
  • I take the point about PLE being a lagging indicator but so are wait stats aren’t they?

    We have PLE checks but they look at the direction of travel not the values. If it is going up or flat with ‘high enough’ value we don’t generate alerts. If it is flat with a low value or falling then we need to take a look.

    Reply
  • The good guys of Solarwinds should read this. 🙂

    Reply
    • Well, don’t blame monitoring tools: when I worked for Quest, we had that same discussion, but the product management team said, “We can’t remove that metric. Too many DBAs *THINK* it’s a useful metric, and if we don’t show it, they’ll think our tool sucks.” So I get why the vendors have to put it on there.

      Reply
      • Haha yeah i don’t ? Still love Solarwinds!

        Reply
      • I like Page Life Expectancy over long periods of time to understand the general load patterns on a server. Agree, it is a lagging indicator, but lagging indicators are often good “acausal metrics”. Acausality is a weird topic for most engineers to wrap their heads around, but it’s how airplanes remain stable in flight – wings of the plane get input from a sensor at the nose cone of the plane on possible lift conditions, so the wings know milliseconds before it happens what the wind conditions will be.

        In the same way, if you’re coming in as a consultant and trying to understand load patterns, page life expectancy is a really cheap way to learn something about a server that has no prior diagnostic data available – assuming they didn’t just recycle the box.

        With cloud databases becoming more common, eventually it will fade away completely, since you will just have all the useful metrics available to you.

        And, back in the 2000s when Trust In The Rust spinning disks were still common, page life expectancy was somewhat useful for a different reason, but back then I preferred to just watch query cache metric in Windows PerfMon along with CPU activity and see if my query was parallelizing perfectly and how much of the cache was hit or missed.

        Reply
  • Haha yeah i don’t 🙂 Still love Solarwinds!

    Reply
  • Heyo Brent, ironically timed article because I was just asking about this on StackExchange two weeks ago. One of the first things I looked into we’re wait stats for any indication of memory pressure. I noticed my top 2 wait stats were: “MEMORY_ALLOCATION_EXT” and “RESERVED_MEMORY_ALLOCATION_EXT”. Are these any indication of memory pressure or I should be looking for other specific wait stats? (These two were my highest wait types by an order of magnitude more than the next highest.) Thanks.

    Reply
  • I just came across this post from someone who recommended looking at it. 😛 [deleted] Kinda old, but it pops up on Google right away. 🙂

    Reply
  • I must break down and confess I use this to indicate that I have done a good job.

    Some 3 years ago the Page life expectancy on our SQL Server were always between 6.000 and 12.000.
    We then installed some more memory and it went up between 40.000 and 50.000

    I then had the BI and .Net team start using a redundant database for their work.
    Their last queries on the production database were performed 2 days ago and the PLE has been climbing since.
    It is now showing 141.000 and we have never had a more stable production system.

    Am I wrong for using this as one (of many) indicators that things are better now than 3 years ago.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.