TsFeatures
What is TsFeatures
In time series analysis, there is often the need for characterizing a time series using a set of meaningful features. Example features include strength of seasonality, strength of trend, spikiness, amount of level shift, presence of flat segments, ACF/PACF based features, linearity, Hurst exponent, ARCH statistic, etc. The features are usually used for identifying similar/outlying time series from a large number of samples. They also play a crucial role in many downstream projects, including (1). “meta-learning”, i.e., choosing the best forecasting model based on characteristics of the input time series, (2). time series classification and clustering analysis, (3). Nowcasting algorithms for better short-term forecasting, (4). Forecaster’s Guardrail project, which aims to compare user’s forecast accuracy relatively within same clusters, and more. TsFeatures is the module in Kats which computes such features. A similar package is openly available in R (https://pkg.robjhyndman.com/tsfeatures/index.html). Our implementation follows the feature definitions in the R package.
What features are included
The module currently supports the following features:
- Length: length of time series data. The appropriate forecast method for a time series depends on how many observations are available. For example, shorter series tend to need simple models such as a random walk. On the other hand, for longer time series, we have enough information to be able to estimate a number of parameters.
- Mean/Variance (2): mean/variance of time series.
- Spectral entropy: Shannon entropy of spectral density function. High value indicates noisy and unpredictable time series.
- Lumpiness: variance of variances within non-overlapping windows
- Stability: variance of means within non-overlapping
- Trend/Seasonal strength (2): measures of trend and seasonality of a time series based on an STL decomposition. Trend strength is the variance explained by STL trend term, and seasonal strength is the variance explained by STL seasonality terms.
- Spikiness: variance of leave-one-out variances of STL remainder after an STL decomposition.
- Peak/Trough (2): location of peak/trough in the seasonal component after an STL decomposition.
- Flat spots: presence of flat segments. It’s computed by dividing the sample space of a time series into
nbins
equal-sized intervals, and computing the maximum run length within any single interval. - Level shift based features (2): These two features compute features of a time series based on sliding (overlapping) windows.
level_shift_size
finds the largest mean shift between two consecutive windows, andlevel_shift_idx
finds the location of this largest mean shift. - Hurst Exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases. The goal of the Hurst Exponent is to provide a scalar value that will help identify whether a series is mean reverting, random walking or trending. A value of H near 0 is a highly mean reverting series, while for H near 1 the series is strongly trending.
- ACF based features (7): summarizes the strength of a relationship between an observation in a time series with observations at prior time steps. The autocorrelation is used for finding repeating patterns, such as the presence of a periodic signal obscured by noise. AC is comprised of both the direct correlation and indirect correlations. These indirect correlations are a linear function of the correlation of the observation, with observations at intervening time steps. We compute ACs of the series, the differenced series, and the twice-differenced series, and then provide a vector comprising the first AC in each case, the sum of squares of the first 5 ACs in each case, and the AC at the first seasonal lag.
- PACF based features (4):summarizes the strength of a relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations (indirect correlation) removed. We compute PACs of the series, the differenced series, and the second-order differenced series, and then provide a vector comprising the sum of squares of the first 5 PACs in each case, and the PAC at the first seasonal lag.
- First min AC: the time of first minimum in the autocorrelation function.
- First zero AC: the time of first zero crossing the autocorrelation function.
- Linearity: R square from a fitted linear regression.
- Standard deviation of the first derivative of the time series
- Crossing Points: the number of times a time series crosses the median line.
- Binarize mean: converts time series into a binarized version: time series values above its mean are given 1, and those below the mean are 0, and then returns the average value of the binarized vector.
- ARCH statistic: Lagrange multiplier test statistic from Engle’s Test for Autoregressive Conditional Heteroscedasticity (ARCH). It measures the heterogeneity of the time series.
- Histogram mode: measures the mode of the data vector using histograms with a given number of bins.
- KPSS unit root statistic: a test statistic based on KPSS test, which is to test a null hypothesis that an observable time series is stationary around a deterministic trend. Linear trend and lag one are used here.
- Holt Parameters (2): estimates the smoothing parameter for the level-alpha and the smoothing parameter for the trend-beta of Holt’s linear trend method.
- Holt-Winter’s Parameters (3): estimates the smoothing parameter for the level-alpha, trend-beta of HW’s linear trend, and additive seasonal trend-gamma.
Hyperparameters
window_size
: used for calculating Level shift based features, Lumpiness, and Stability.spectral_freq
: used for calculating Spectral entropy.stl_period
: used for calculating Trend/Seasonal strength, Spikiness, Peak, Trough, Holt-Winter’s Parameters, and ACF/PACF based features.nbins
: used for calculating Flat spots and Histogram mode.lag_size
: used for calculating Hurst Exponent.acfpacf_lag
: used for calculating ACF/PACF based features.
Examples
- Hurst Exponent : a time series data can be characterized in the following manner based on Hurst Exponent:
- H < 0.5 : The time series is mean reverting
- H = 0.5 - The time series is a Geometric Brownian Motion
- H > 0.5 - The time series is trending
The following figure shows three synthetic time series, and they are Geometric Brownian Motion, Mean-Reverting and Trending Series, respectively. Their corresponding Hurst Exponent outputs are 0.4171, 0.0022, and 0.7185, which are consistent with our expectation.
How can I use TsFeatures
TsFeatures is available through Kats’ Bento kernel. The following code provides an example.
from infrastrategy.kats.tsfeatures.tsfeatures import TsFeatures
from infrastrategy.kats.consts import TimeSeriesData
from sklearn.preprocessing import scale
import pandas as pd
# Calculate features for one time series
ts_data = TimeSeriesData(df_data[['time', 'value']])
model = TsFeatures()
features = model.transform(ts_data)
What is Spark Transformer
Many use cases of TsFeatures involve computations across a large number of time series. A Spark transformer is created for TsFeatures extraction, which enables large batches of time series analysis in parallel. The transformer can be called directly from Daiquery, or from dataswarm pipelines using the HiveQLOperator. Example dataswarm code:
features_table = HiveQLOperator(
dep_list=[
wait_for_ts_table
],
migration_mode='spark',
spark_opts={
'spark.sql.transform.enabled': 'true',
'spark.sql.shuffle.partitions': '500'
},
hive_query="""
CREATE TABLE IF NOT EXISTS {output_table} (
id BIGINT,
length double,
mean double,
var double,
entropy double,
lumpiness double,
stability double,
flat_spots double,
hurst double,
std1st_der double,
crossing_points double,
binarize_mean double,
unitroot_kpss double,
heterogeneity double,
histogram_mode double,
linearity double,
trend_strength double,
seasonality_strength double,
spikiness double,
peak double,
trough double,
level_shift_idx double,
level_shift_size double,
y_acf1 double,
y_acf5 double,
diff1y_acf1 double,
diff1y_acf5 double,
diff2y_acf1 double,
diff2y_acf5 double,
y_pacf5 double,
diff1y_pacf5 double,
diff2y_pacf5 double,
seas_acf1 double,
seas_pacf1 double,
firstmin_ac double,
firstzero_ac double,
holt_alpha double,
holt_beta double,
hw_alpha double,
hw_beta double,
hw_gamma double
)
PARTITIONED BY (ds STRING)
TBLPROPERTIES (
'oncall' = 'kats_dev',
'noUII' = '1',
'RETENTION' = '7'
);
ADD FILE <FBPACKAGE:kats.tsfeatures.transformer:LATEST>/transformer;
INSERT OVERWRITE TABLE {output_table}
PARTITION(ds='<DATEID>')
SELECT TRANSFORM(id, FB_MAKE_JSON_OBJ(ts_values))
USING 'transformer window_size=20 \
input_schema=id,values \
output_schema=id,length,mean,var,entropy,lumpiness,stability,flat_spots,hurst,std1st_der,crossing_points,binarize_mean,unitroot_kpss,heterogeneity,histogram_mode,linearity,trend_strength,seasonality_strength,spikiness,peak,trough,level_shift_idx,level_shift_size,y_acf1,y_acf5,diff1y_acf1,diff1y_acf5,diff2y_acf1,diff2y_acf5,y_pacf5,diff1y_pacf5,diff2y_pacf5,seas_acf1,seas_pacf1,firstmin_ac,firstzero_ac,holt_alpha,holt_beta,hw_alpha,hw_beta,hw_gamma'
AS
id BIGINT,
length double,
mean double,
var double,
entropy double,
lumpiness double,
stability double,
flat_spots double,
hurst double,
std1st_der double,
crossing_points double,
binarize_mean double,
unitroot_kpss double,
heterogeneity double,
histogram_mode double,
linearity double,
trend_strength double,
seasonality_strength double,
spikiness double,
peak double,
trough double,
level_shift_idx double,
level_shift_size double,
y_acf1 double,
y_acf5 double,
diff1y_acf1 double,
diff1y_acf5 double,
diff2y_acf1 double,
diff2y_acf5 double,
y_pacf5 double,
diff1y_pacf5 double,
diff2y_pacf5 double,
seas_acf1 double,
seas_pacf1 double,
firstmin_ac double,
firstzero_ac double,
holt_alpha double,
holt_beta double,
hw_alpha double,
hw_beta double,
hw_gamma double
FROM (
SELECT id, ts_values
FROM tmp_tsfeatures_ts_values_test:infrastructure
WHERE ds = '<DATEID>'
LIMIT 10000
)
""".format(
output_table='tsfeatures_test'
),
)