Dataswarm Operators
CusumOperator
This Dataswarm operator performs CUSUM detection.
Steps:
- Operator grabs data from existing Hive Table using
history_query
. Query must retrieve two columns, one calledtime
with the timestamps and the other calledvalue
ory
with the variable values. - Operator performs CUSUM changepoint detection on data. Default params are assumed, custom ones can be supplied. See
regressionDetection.CusumDetector
for more details. - Operator uploads data to table specified by
output_table
. Each row in the output table corresponds to a changepoint. The column headers are the following:changepoint_found
,direction
,mu0
,mu1
,changetime
,stable_changepoint
,delta
,llr_int
,llr
,p_value
,regression_detected
. For more details on the columns, seeregressionDetection.CusumDetector
.
API:
# Abstract Parent Class
`class`` ``CusumOperator``(``BashOperator``):`
self,
user,
schedule,
dep_list,
owner,
history_query,
history_namespace,
output_table,
output_namespace,
ds_partition="<DATEID>",
retention=90,
datetime_format="%Y-%m-%d",
cusum_params=None,
):
BashOperator
):
Parameters (in addition to parent class history_query: str. SQL query to pull data on which to perform CUSUM detection.
Must retrieve only two columns named time and value.
history_namespace: str. Namespace for history_query
output_table: str. Output table to write results to
output_namespace: str. Namespace for output_table
ds_partition (optional): str. `ds partition associated ``with`` the uploaded output``.`
Default "<DATEID>"
retention (optional): int. How long output data will be retained for. Default 90.
datetime_format (optional). str. Datetime format for output data. Default "%Y-%m-%d"
cusum_params (optional). dict. Custom params for CUSUM detection. Default None.
Example:
#!/usr/bin/env python3
from dataswarm.operators import GlobalDefaults
from dataswarm_extension_kats.cusumoperator import CusumOperator
GlobalDefaults.set(
user="rohanfb",
oncall="kats_dev",
secure_group="kats",
schedule="@never",
partition="ds=<DATEID>",
depends_on_past=True,
num_retries=3,
task_tags=["python-version-3"],
)
history_query = """
SELECT
*
FROM test_cusum_kats_dev
"""
history_namespace = "di"
cusum_params = {
"threshold": 0.01,
"max_iter": 10,
"delta_std_ratio": 1.0,
"min_abs_change": 0,
"start_point": None,
"change_directions": None,
"interest_window": [0, 100],
"magnitude_quantile": None,
"magnitude_ratio": 1.3,
"magnitude_comparable_day": 0.5,
}
wait_cusum_detector = CusumOperator(
dep_list=[],
history_query=history_query,
history_namespace=history_namespace,
output_table="test_cusum_operator_dev",
output_namespace="di",
retention=30,
cusum_params=cusum_params,
owner="kats",
)
Output Table Result:
changepoint_found | direction | changepoint | mu0 | mu1 | changetime | stable_changepoint | delta | llr_int | llr | p_value | regression_detected |
---|---|---|---|---|---|---|---|---|---|---|---|
TRUE | increase | 64 | 175.23077 | 366.74684 | 5/1/2054 | TRUE | 191.51607 | 100.33933 | 145.3254 | 0 | TRUE |
TRUE | decrease | 98 | 217.05051 | 419.44444 | 3/1/2057 | FALSE | 202.39394 | -106.11898 | 137.72771 | 0 | FALSE |