metrics Archives - ISbyR

Predicting multiple metrics in Splunk

Ilya Reshetnikov — Thu, 15 Dec 2022 08:01:28 +0000

Splunk has a predict command that can be used to predict a future value of a metric based on historical values. This is not a Machine Learning or an Artificial Intelligence functionality, but a plain-old-statistical analysis.

So if we have a single metric, based on historical results we can produce a nice prediction for the future (of definable span), but predicting multiple metrics in Splunk might not be as straightforward.

For example, I want to see a prediction for a used_pct metric for the next 30 days based on max daily reading for the last 30 days.

Easy (for a single metric/dimension):

Just get the metric (field) you want to predict into a timechart function and add the predict command followed by the field name (of interest) and the future_timespan parameter

| mstats max("latest_capacity|used_space") as used_space, max("latest_capacity|total_capacity") as total_capacity WHERE "index"="telegraf_metrics" "name"="vmware_vrops_datastore" sourcename=bla_vsan01 span=1d BY sourcename 
| eval used_pct=round(used_space/total_capacity*100,2) 
| timechart max(used_pct) by sourcename useother=false span=1d
| predict bla_vsan01 future_timespan=30

The resulting graph has the historical (last 30 days) reading (the wiggly line) and the prediction straight line with an average prediction with a “fan” upper/lower 95th percentile predictions.

Splunk single metric prediction timechart

BUT! “we don’t know how many storage instances we will have! we can’t just hardcode all the fields in the predict command or create a panel/search for each!”

I know, I know.

Predicting multiple metrics

Here it becomes a bit tricky. Remember that you need to specify all the fields you want to predict as part of predict command | predict MY_FIELD1, MY_FIELD2 ....

The way to deal with it is the map command https://docs.splunk.com/Documentation/Splunk/9.0.1/SearchReference/Map.

What it does, is executes an enclosed parametrised search command using provided parameters.

| mstats max("latest_capacity|used_space") as used_space WHERE "index"="telegraf_metrics" "name"="vmware_vrops_datastore" sourcename != "*local*" span=1d BY sourcename 
| stats values(sourcename) as sourcename 
| mvexpand sourcename 
| map 
    [| mstats max("latest_capacity|used_space") as used_space, max("latest_capacity|total_capacity") as total_capacity WHERE "index"="telegraf_metrics" "name"="vmware_vrops_datastore" sourcename=$sourcename$ span=1d by sourcename 
    | eval used_pct=round(used_space/total_capacity*100,2) 
    | timechart max(used_pct) span=1d by sourcename useother=false limit=0 
    | predict $sourcename$ future_timespan=30 
        ] maxsearches=30

The first part of the SPL (|mstats until the | map) is used to prepare a list of sourcenames (storages) that will be passed instead of the $storename$ parameter.

The second part is the same SPL from before, where literal sourceneme was tokenised.

Here is the visualisation of the above SPL using the “Line Chart” option:

I personally find the same visualisation a bit more readable when using an “Area Chart”

The actual measurements are visualised using the shaded areas, while the predictions are drawn using lines

Some thoughts about putting it all in a dashboard

Save the search as a saved search that will run hourly or maybe even daily. Then update the SPL to load results from the saved search. Pass :: to the loadjob command

| loadjob savedsearch="admin:my_app:vmware_vrops_datastore usage prediction"

Add a dropdown selector to allow us to either see all storages or focus on single storage at a time and update the SPL with tokens

| fields + _time $tkn_datastore$, prediction($tkn_datastore$), lower95(prediction($tkn_datastore$)), upper95(prediction($tkn_datastore$))

add static “threshold” lines by adding below to the SPL

| eval WARN=70, ERROR=90

The resulting SPL:

| loadjob savedsearch="admin:def_sahara:vmware_vrops_datastore usage prediction"
| fields + _time $tkn_datastore$, prediction($tkn_datastore$), lower95(prediction($tkn_datastore$)), upper95(prediction($tkn_datastore$))
| eval WARN=70, ERROR=90

and maybe add an annotation “tick” to show when is “now”, which will indicate that everything before (left of it) was collected data and everything after (right of it) is a prediction. That is done by adding the following section to the XML of the panel.


  
| makeresults 
| bin _time span=1d 
| eval annotation_label = "Now" , annotation_color = "#808080"
| table _time annotation_label annotation_color
  
  -24h@h
  now

if you have too many storages the graph might become unreadable, so one of the options will be first to pre-calculate the predictions, and then chart only the metrics that are predicted to cross a threshold.

Another option will be to show all our “final” predictions in a table format (instead of the timechart)

| loadjob savedsearch="admin:my_app:vmware_vrops_datastore usage prediction" 
| fields - upper*, lower* 
| stats latest(*) as current_* 
| transpose 
| rename "column" AS m, "row 1" AS r 
| eval r = round(r,2) 
| rex field=m "(?:(current_prediction\((?.*)\))|(current_(?.*)))" 
| eval datastore_name=coalesce(datastore_name_predict, datastore_name_current), r_type = if(isnull(datastore_name_predict),"current","prediction") 
| stats values(r) as r by datastore_name, r_type 
| xyseries datastore_name, r_type, r 
| eval WARN=40, ERROR=60 
| eval status = if(prediction > ERROR,"ERROR",if(prediction > WARN, "WARN","OK")) 
| sort - prediction 
| fields - ERROR, WARN 
| table datastore_name, current, prediction, status 
| rename datastore_name AS "Datastore", current as "Last Value", prediction as "Predicted Value", status as "Predicted Status"

Splunk Failed to apply rollup policy to index… Summary span… cannot be cron scheduled

Ilya Reshetnikov — Wed, 14 Dec 2022 23:12:08 +0000

I started playing for Splunk Metrics rollups and but then tried to step out of the box and got a “Failed to apply rollup policy to index=’…’. Summary span=’7d’ cannot be cron scheduled” error.

While you can configure the rollup policies in the Splunk UI, you can also do it using Splunk’s REST API or using spunk configuration files (metric_rollups.conf).

Here is the interesting part: your rollup span value is limited only to the values available in the UI drop-down (no matter which method of configuration you are using).

The options that you can use are:

Minute Span	Hourly Span	Daily Span	Notes
1m*			requires `limits.conf` update
2m*			requires `limits.conf` update
3m*			requires `limits.conf` update
5m
6m
10m
12m
20m
30m
60m	1h
	1h
	3h
	4h
	6h
	8h
	12h
	24h	1d

Splunk rollup spans allowed for metrics rollups

If you will try to use any other value you will see the following error in the splunkd.log: 12-12-2022 14:04:08.645 +1100 ERROR MetricsRollupPolicy [12181677 TcpChannelThread] - Failed to apply rollup policy to index='...'. Summary span='7d' cannot be cron scheduled.

This is not (as of 15/12/2022) a documented “feature”.

Plotting Splunk with the same metric and dimension names shows NULL

Ilya Reshetnikov — Wed, 05 Oct 2022 12:19:34 +0000

When you try plotting on a graph Splunk metric split by a dimension with the same name (as the metric itself) will show NULL instead of the dimension.

Splunk timechart visualisation with breakdown by dimension with the same metric and dimension names will show NULL

The Problem

Let’s rewind a little.

Below is the payload that is sent to Splunk HEC and you will notice that there are 2 “statuses”:

"status": "success" – which is one of the dimensions and it can represent a collector/monitor status
"metric_name:status": 0 – which is the actual metric value that was collected by the collector/monitor

{
    "time": 1664970920,
    "event": "metric",
    "host": "host_5.splunk.com",
    "index": "d_telegraf_metrics",
    "fields": {
        "collector": "collector_a",
        "status": "success",
        "metric_name:query_time_seconds": 10.869,
        "metric_name:status": 0
    }
}

In the perfect world where you would probably rename one of these not to confuse the end-user in Splunk, but that (living in a perfect world) is not always the case.

As a result, we end up with NULLs in the graphs

The Solution

Lucky for us Splunk’s search language (SPL) is very powerful and flexible and with two little modifications to the “original” SPL (that was produced by the Metrics Analyzer), we can solve the issue.

All you need to do is :

instead of prestats=true rename the metric function result using as command.
update the avg function in the timechart command to use the renamed field name.

Original SPL:

original Splunk SPL that was causing NULL in graphs

Revised SPL:

fixed Splunk SPL that shows the breakdown by dimension with the same name as the metric

The Result

Splunk fixed timechart visualisation