Kats Sample Looks Like this
This document is a set of guidelines for Dataswarm pipelines.
It's purpose is to document behaviors and practices that experienced employees generally follow but are not otherwise documented or hidden in official documentation. Please feel free to edit.
Python3
Any new pipelines should be Python3.
- Please review this Wiki page for the steps necessary for creating Python3 pipelines.
- More details here. (8/1/2019)
- Python2 -> Python3 Guide
Conventions for Maintainability
After a Dataswarm pipeline starts running, we still need to monitor the pipeline by going to the dataswarm dashboard and to backfilling occasionally. Remember, you may be maintaining these for years so it behooves you to invest effort.
To make monitoring and backfilling easier, most programmers and the documentation recommend and follow two practices:
- Set the schedule to
@daily
or@hourly
and use one of theWaitFor
operators. This convention ensures that the dashboard remains simple, with one column per daily task. If one uses cron strings, the dashboard becomes cluttered. For example, if one has an@daily
pipeline and a another that runs at 5am the dashboard looks like a checkerboard. - Ensure that the temporal day being analyzed, the scheduled day, and the partition being inserted into all agree. This makes backfilling easier. If there is something wrong with the input data between November 1 and November 8, I will need to use the
maintainer
with arguments2018-11-01
and2018-11-08
, and replace the partions with the same range ofds
-stamps. Here is a post that discussed this more.
These two conventions are are designed to protect pipeline maintainers from making mistakes. Most pipelines follow them by using WaitForHiveOperator
s and waiting for one day.
For pipelines that ingest ODS counters and SQL tables, there is nothing to wait on that is directly tied to the data source. Workarounds include:
- Wait on the
SleepUntilOperator
in thedataswarm.operators.infrastructure
module. - Waiting on
dim_all_employees:hr
Always wrap the names and retentions of ALL tables created by in your pipeline in the <TABLE:...>
and <RETENTION:...>
macros, respectively. This includes tables intended as temporary. This serves two purposes:
- It separates allow testing without disturbing production runs.
- Shorter retentions save storage space and protects data privacy.
Python
The prototypical daily dataswarm pipeline has the following structure. Global defaults is at the top to facilitate finding the owner.
#!/usr/bin/env python3
...imports: time, datetime,...
from dataswarm.operators import (
GlobalDefaults,
...Other operators...
)
GlobalDefaults.set(
...User and other arguments...
partition='ds=<DATEID>',
task_tags=["python-version-3"],
)
...Variables containing parameters, table names,...
...Functions...
...Exported Tasks...
The line partition='ds=<DATEID>'
is the one way to follow the second convention above. Note that some other operator use different arguments. See below.
Other recommended conventions include:
- Place all exported variables together so that it is easier for the reader to figure out what is being run.
- For each input table, see of a corresponding signal tables exists. If so, make sure you wait on a signal table.
- A signal table is Hive table whose purpose is to be "waited on" by a
WaitForHiveOperator
. A Dataswarm pipeline will wait on a signal table to ensure a set of partitions has landed in a related multiple-partitioned data table instead of waiting directly on the partitions in the data table. Typically, if the table containing the data is namedx
, the signal table will be namedx_signal
. - More documentation can be found at this Wiki.
- A signal table is Hive table whose purpose is to be "waited on" by a
- Ensure that the names of source tables appears once as a string literal. This is typically followed by either using
GlobalDefaults.add_macros
or defining variables containing table names and using Python's string formatting. - Anything that is not an exported variable (i.e., not a task, or list or dictionary of tasks) and exists module level should have variable names that starts with a
_
. For example, one might have a function called_get_day_of_week(date)
. - Use the schematized operators
(Hive|Presto)InsertOperatorWithSchema
instead of the non-schematized ones. For example, preferPrestoInsertOperatorWithSchema
toPrestoInsertOperator
- Functions can be used to generate tasks and should be if either one has many similar tasks or to combine redundant arguments.
- If you have many similar tasks, consider grouping them into a module level
dict
orlist
and generating tasks programatically. This group can then be treated like a single task that will complete when all referenced sub-tasks complete. See thedep_list
section of this page.
tasks = {
'stage1': { 'task1': Operator(), 'task2': Operator() },
'stage2': { 'task1': Operator(), 'task2': Operator() },
}
A = Operator( dep_list = [tasks] )
Watch Out!
- Operators tend to use one of two arguments for specifying partitions. One uses no quotes and forward slash separators. The other uses quotes and comma separators
partition
: Used byWaitForHiveOperator
andPrestoInsertOperator
partition="ds=<DATEID>"
partition="ds=<DATEID>/interface=ANDROID"
hive_partition
: Used by MySQLToHivehive_partition="ds='<DATEID>'",
hive_partition="ds='<DATEID>', ts='<TSUTC>'",
SQL
General SQL/Presto
- Cut-and-paste your query into Daiquery and click .
- You will probably be reviewing Chronos logs when debugging. Try to make the query readable in these logs.
- In
SELECT
statement with a join, explicitly label each column with a source table. Avoid using*
.- Do not use
USING
clauses. This clause obscures which table a field came from and causes unintuitive behavior. Do you remember whenc1 IS NULL
inSELECT a.*, c1 FROM a LEFT JOIN b USING (c1)
?
- Do not use
- In Presto, consistently use one of
WITH...AS
or subqueries.- When using outer joins, put filtering into subqueries and joining logic into the
ON
-clause. - "If your brain hurts when reasoning where a filter should go, or when trying to understand a query written in such a way, chances are you should be using subqueries instead." Outer Join Differences.
- When using outer joins, put filtering into subqueries and joining logic into the
- Review and try to follow the joining best practices.
- Review the table creation section of the Hitchhiker Guide to Presto page on the details of table creation.
Partitioning
- For Hive tables, partition columns should be of type
VARCHAR/STRING
. Other types will not raiser errors at table creation but will cause havoc with other internal tools. - Presto: In a multi- partitioned table that has a
ds
-partition, theds
field should be listed first. E.g.,PARTITONED_BY = ARRAY['ds','content_type']
. Many internal tools expect this ordering. See this post. - In
ds
-partitioned tables, the fieldds
contains the day in the format2018-06-30
. - In
ds
/ts
-partitioned tables, thets
field specifies the hour in the format2018-06-04+14:00:99
. Note the postfix:99
.- Timestamps of this format can be generated in a Datawarm pipeline using
ts='<TSUTC>'
. Refer to this page for the exact format. - The
ds
partition is the date in Menlo Park while thets
is the time in UTC. For example, in one table,ds
-partition2018-07-02
containedts
partitions between2018-07-02+07:00:99
and2018-07-03+06:00:99
.
- Timestamps of this format can be generated in a Datawarm pipeline using
- Use Dataswarm macros to create time-valued partition values. Avoid custom formats like
"06182018"
. It seems that some tools sort time-based partions lexicographically to find the last temporally. See this page.
Table Names
- Don't include
tbl
in the name of a table. - Use collective or plural forms. E.g.,
dim_all_users
, notdim_all_user
. - Try to follow a naming convention. Ask your team. Existing naming convention include
- Entity Tables Each row corresponds to one entity in some data mode. Each
ds
partition contains all the entities that have ever existed up to thatds
with updated attributes. Seedim_all_users
. - Fact Tables Each row contains an event with a timestamp. Each
ds
partition contains the event that occurred during a particular day. - Cumulative Fact Tables Each row is an event with a time stamp. Each
ds
-partition history of all events that ever happened up to that day. - Temporary Tables Intermediate tables not intended to exceed one rune of a particular pipeline. These are usually created with the Dataswarm macro
TMP_TABLE
. - Sampled Table Denotes a sampled version of another table. E.g.,
dim_all_users_sampled
. - Staging Tables Denotes an intermediary tables should not be used by other pipelines
- Aggregation Tables Denotes a table containing data aggregated at a single entity level (e.g., user, page, group, etc.).
- Cube Table Table contains a union of aggregations. For instance, revenue at the zip code, state, and country levels.
- Entity Tables Each row corresponds to one entity in some data mode. Each
Column Names
- Names should document the contents of the field.
- Column names are singular. For instance,
user_id
, notuser_ids
. - Include the unit of the field. For example,
duration_sec
, notduration
. Append_utc
to to the names of fields containing timestamps specified in UTC. - Field names should not be SQL keywords or common functions. Avoid
count
andsum
. In addition to not being descriptive, such terms can cause a valid query to become invalid after being formatted in Daiquery.
- Column names are singular. For instance,
- Column names should be lowercase letters and underscores.
- Avoid escaping column names if possible. That is, no
"duration_sec"
(Presto) orduration_sec
(SQL).
- Avoid escaping column names if possible. That is, no
External Documentation
Dataswarm
- Official Wiki
- List of operators
- Dynamic Partitions
- This document explains how to use a single
PrestoInsertOperator
to modify multiple partitions. As of 2019-03-07, Hive partitions are read only. You may need to explicitly write aDELETE
statement(s). It wasn't clear that the PrestoInsertOperator handles this case correctly.
- This document explains how to use a single
- Official Best Practices
- Things to do
- Things NOT to do. Read this once.
- Style Guide
- Don't do computation in module scope.
Presto and Spark
- Spark Documentation
- Presto Documentation
- More details on window function. From Transact-SQL, but seems to be correct.