Troubleshooting MongoDB with Percona Monitoring and ManagementPercona Monitoring and Management (PMM) is an open-source tool developed by Percona that allows you to monitor and manage your MongoDB, MySQL, and PostgreSQL databases.  This blog will give you an overview of troubleshooting your MongoDB deployments with PMM.

Let’s start with a basic understanding of the architecture of PMM. PMM has two main architectural components:

  1. PMM Client – Client that lives on each database host in your environment.  Collects server metrics, system metrics, and database metrics used for Query Analytics
  2. PMM Server – Central part of PMM that the clients report all of their metric data to.   Also presents dashboards, graphs, and tables of that data in its web interface for visualization of your metric data.

For more details on the architecture of PMM, check out our docs.

Query Analytics

PMM Query Analytics (“QAN”) allows you to analyze MongoDB query performance over periods of time.  In the below screenshot you can see that the most longest-running query was against the testData collection.

Percona Monitoring and Management query analytics

If we drill deeper by clicking on the query in PMM we can see exactly what it was running. In this case, the query was searching in the testData collection of the mpg database looking for records where the value of x is 987544.

Percona Monitoring and Management

This is very helpful in determining what each query is doing, how much it is running, and which queries make up the bulk of your load.

The output is from db.currentOp(), and I agree it may not be clear at a glance what the application-side (or mongo shell) command was. This is a limitation of the MongoDB API in general – the drivers will send the request with perfect functional accuracy but it does not necessarily resemble what the user typed (or programmed).  But with an understanding of this, and focusing first on what the “command” field contains it is not too hard to picture a likely original format. For example the example above could have been sent by running “use mpg; db.testData.find({“x”: { “$lte”: …, “$gt”: … }).skip(0)” in the shell. The last “.skip(0)” is optional as it is 0 by default.

Additionally, you can see the full explain plan for your query just as you would by adding .explain() to your query.   In the below example we can see that the query did a full collection scan on the mpg.testData collection and we should think about adding an index to the ‘x’ field to improve the performance of this query.

Metrics Monitor

Metrics Monitor allows you to monitor, alert, and visualize different metrics related to your database overall, its internal metrics, and the systems they are running on.

Overall System Performance View

The first view that is helpful is your overall system performance view.   Here you can see at a high level, how much CPU and memory are being used, the amount of writes and reads from disk, network bandwidth, # of database connections, database queries per second, RAM, and the uptime for both the host and the database.   This view can often lead you to the problematic node(s) if you’re experiencing any issues and can also give you a high level of the overall health of your monitored environment.

Percona Monitoring and Management system overview

WiredTiger Metrics

Next, we’ll start digging into some of the database internal metrics that are helpful for troubleshooting MongoDB.  These metrics are mostly from the WiredTiger Storage Engine that is the default storage engine for MongoDB since MongoDB 3.0.  In addition to the metrics I cover, there are more documented here.

The WiredTiger storage engine uses tickets as a way to handle concurrency, The default is for WiredTiger to have 128 read and 128 write tickets.  PMM allows you to alert when your available tickets are getting low.  You can also correlate with other metrics as to why so many tickets are being utilized. The graph sample below shows a low-load situation – only ~1 ticket out of 128 was checked out at any time.

Percona Monitoring and Management wiredtiger

One of the metrics that could be causing you to use a large number of tickets is if your checkpoint time is high.   WiredTiger, by default, does a full checkpoint at least every 60 seconds, this is controlled by the WiredTiger parameter checkpoint=(wait=60)).  Checkpointing flushes all the dirty pages to disk. (By the way ‘dirty’ is not as bad as it sounds – it’s just a storage engine term meaning ‘not committed to disk yet’.)  High checkpointing times can lead to more tickets being in use.

Finally, we have WiredTiger Cache Activity metrics. WiredTiger Cache activity indicates the level of data that is being read into or written from the cache.  These metrics can help you baseline your normal cache activity, so you can notice if you have a large amount of data being read into the cache, perhaps from a poorly tuned query, or a lot of data being written from the cache.

WiredTiger Cache Activity

Database Metrics

PMM also has database metrics that are not WiredTiger specific.   Here we can see the uptime for the node, queries per second, latency, connections, and number of cursors.   These are higher-level metrics which can be indicative of a larger problem such as connection storms, storage latency, and excessive queries per second.  These can help you hone in on potential issues for your database.

Percona Monitoring and Management database metrics

Node Overview Metrics

System metrics can point you towards an issue at the O/S level that may or may not correlate to your database.  CPU, CPU saturation, core usage, DISK I/O, Swap Activity, and Network Traffic are some of the metrics that can help you find issues that may start at the O/S level or below.  Additional metrics to the below can be found in our documentation.

Node Overview Metrics

Takeaways

In this blog, we’ve discussed how PMM can help you troubleshoot your MongoDB deployment, whether you’re looking at the WiredTiger specific metrics, system-level metrics, or database level metrics PMM has you covered and can help you troubleshoot your MongoDB deployment.  Thanks for reading!

Additional Resources:

Download Percona Monitoring and Management

PMM for MongoDB Quick Start Guide

PMM Blog Topics

MongoDB Blog Topics

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Kay Agahd

Hi Mike, thanks for the intro! I have two questions though:
1) You wrote: “the example above could have been sent by running “db.testData.find({“x”: { “$lte”: …, “$gt”: … })” My question is why you think a range query was executed – I rather think it was an equality condition where x=987544 (at least this what your screenshot shows).
2) How long PMM stores the slow operations? Are they stored only in the capped collection system.profile which size is per default only 1 MB? However, this would mean that PMM could show only the very latest slow operations. How is this implemented in PMM?

You may be interested in the following open sourced project:
https://github.com/idealo/mongodb-slow-operations-profiler

It collects slow operations from one or multiple mongoDB system(s) in order to visualize and analyze them. It might inspire you or your team to add some of its functionality to PMM. That would be great!

Mike Grayson

Kay,

1) You’re correct, it was an equality condition, not a range, I changed my screenshot example and neglected to update my query, good catch :). It should be db.testData.find({“x”: “97544”)
2) PMM can store slow operations for 8 days by default, it’s configurable though and can be longer if necessary:
https://www.percona.com/doc/percona-monitoring-and-management/faq.html#how-to-control-data-retention-for-pmm
I’ll take a look at the project you shared, thank you very much for sharing!

Kay Agahd

Thanks Mike! However, you still wrote “the example above could have been sent by running “use mpg; db.testData.find({“x”: { “$lte”: …, “$gt”: … }).skip(0)” while your screenshot shows an equality query. 🙂