Multivariate Outlier Detection
This anomaly detection method is useful to detect anomalies across multiple time series. Anomalies are detected based on deviations from the predicted steady state behavior. The steady state behavior of a system of metrics is predicted by modeling the linear interdependencies between time-series using a VAR model. This approach is especially suited for detecting multivariate anomalies - small anomalies but persistent across a large number of time series.
In addition to identifying an anomalous event, this method has useful utilities to flag specific time series that were affected for a high level root cause analysis. For more details about the approach, please refer to this note.
API:
class MultivariateAnomalyDetector(data, params, training_days)
Parameters:
data: TimeSeriesData - Note that data should be deseasonalized and detrended prior
to detection
params: [VARParams](https://fb.quip.com/iYpgAq8zh1x4) class initiated with appropriate parameters
for the VAR model training
training_days: Initial number of days (can be a fraction) to use for training the model.
As a result, the first selected number of data points will be excluded
from the results.
Methods
detector():
# fit VAR model and calculate overall and individual anomaly scores
Returns:
DataFrame with each column representing the overall anomaly score and individual scores
of each timeseries
plot():
# Plot the timeseries metrics and overall anomaly score at each timesteps.
# Useful for choosing a threshold on the overall anomaly score
get_anomaly_timepoints(threshold):
# Helper function to returns list of time instants when anomaly was
# detected based on the chosen threshold
Args:
threshold: Threshold on the overall anomaly score
get_anomalous_metrics(t, top_k):
# Helper function to get 'top_k' time series that were affected at the
# identified anomalous time instant 't'
Args:
t: Anomalous time instant
top_k: Number of highest ranked time series to display
Example
We use CDN working set size data to illustrate this multivariate anomaly detection approach below:
import pandas as pd
from infrastrategy.kats.consts import TimeSeriesData
from infrastrategy.kats.models.var import VARParams
from infrastrategy.kats.detectors.outlierDetection import (
MultivariateAnomalyDetector
)
# read data and convert to TimeSeriesData structure
DATA_multi = pd.read_csv("../data/`cdn_working_set`.csv")
TSData_multi = TimeSeriesData(DATA_multi)
# select parameters to use for VAR modeling
params = VARParams(maxlags=3)
# detect anomalies in a rolling fashion
d = MultivariateAnomalyDetector(TSData_multi,
params,
training_days=3)
anomaly_score_df = d.detector()
# choose a threshold based on plot of anomaly scores for various anomalies
d.plot()
# get time instants for identified anomalous events
threshold = 40
anomalies = d.get_anomaly_timepoints(threshold)
# get top 5 anomalous metric during one of these instants
d.get_anomalous_metrics(anomalies[0], top_k=5)
Output of plot: