The case against averages
One of the main arguments is that most tools focus on the average, while that is rarely what users experience. The point is made that today’s websites issue many requests per page, making it very likely that one of those requests always is a worse-than-average one.
Gil Tene claims that most users even “experience the 99.9%’lie once in ten page view attempts”. This is based on the assumption that most pages request more than 40 additional resources. The same article also provides some data that suggests that the average page today issues even more requests than that.
The “99.9%’lie” (actually 99.9th percentile) is the number that 99.9% of values fall below. Meaning that only 0.1% of requests take longer than the 99.9th percentile, yet most users who regularly use your app will often have to wait longer than this. And this number can be vastly different from the average, as we will see later.
Percentiles on PryIn
To not hide anything from you, PryIn lets you freely decide what percentile you want to look at in addition to the average - no matter if that’s the 80th, 95th, 99th or even 99.9999th percentile. You don’t need to decide on that percentile when data is being imported, but when you’re looking at it.
And those percentiles really can print a very different picture. Consider the following charts taken from PryIn, all showing data for the same period and the same controller (taken from PryIn monitoring it’s staging instance). The only difference is the percentile we’re looking at:
95th percentile data on PryIn
99th percentile data on PryIn
99.99th percentile data on PryIn
The 95th and even the 99th percentile actually look OK, while the 99.99th percentile makes it clear that there are outliers taking more than 10 times longer than the average.
How PryIn notifies you about outliers
To handle these outliers, PryIn lets you define “Alert Thresholds” - durations after which you consider traces to be taking too long. We believe that after some threshold, the actual time a trace takes doesn’t matter that much anymore. What is more important is how often such bad traces happen. Your users won’t care if they wait for 13 or 15 seconds to log in. Both is too long. It is probably important to you whether that happens to 1 or to 1000 users a day, though.
PryIn sends you a daily summary of exactly that: traces that took longer than the thresholds you defined. And you can define those thresholds on multiple levels. Global thresholds (“no request should take longer than 100ms”) or more granular thresholds (“StatisticsController#show should not take longer than 800ms”).
Give it a try and create your free account at PryIn.