• API

›Kats Project

Kats Project

  • katsSample

First Category

  • Viz Doc

Second Category

  • doc3

Kats Sample Looks Like this

This document is a set of guidelines for Dataswarm pipelines.

It's purpose is to document behaviors and practices that experienced employees generally follow but are not otherwise documented or hidden in official documentation. Please feel free to edit.

Python3

Any new pipelines should be Python3.

  • Please review this Wiki page for the steps necessary for creating Python3 pipelines.
  • More details here. (8/1/2019)
  • Python2 -> Python3 Guide

Conventions for Maintainability

After a Dataswarm pipeline starts running, we still need to monitor the pipeline by going to the dataswarm dashboard and to backfilling occasionally. Remember, you may be maintaining these for years so it behooves you to invest effort.

To make monitoring and backfilling easier, most programmers and the documentation recommend and follow two practices:

  • Set the schedule to @daily or @hourly and use one of the WaitFor operators. This convention ensures that the dashboard remains simple, with one column per daily task. If one uses cron strings, the dashboard becomes cluttered. For example, if one has an @daily pipeline and a another that runs at 5am the dashboard looks like a checkerboard.
  • Ensure that the temporal day being analyzed, the scheduled day, and the partition being inserted into all agree. This makes backfilling easier. If there is something wrong with the input data between November 1 and November 8, I will need to use the maintainer with arguments 2018-11-01 and 2018-11-08, and replace the partions with the same range of ds-stamps. Here is a post that discussed this more.

These two conventions are are designed to protect pipeline maintainers from making mistakes. Most pipelines follow them by using WaitForHiveOperators and waiting for one day.

For pipelines that ingest ODS counters and SQL tables, there is nothing to wait on that is directly tied to the data source. Workarounds include:

  • Wait on the SleepUntilOperator in the dataswarm.operators.infrastructure module.
  • Waiting on dim_all_employees:hr

Always wrap the names and retentions of ALL tables created by in your pipeline in the <TABLE:...> and <RETENTION:...> macros, respectively. This includes tables intended as temporary. This serves two purposes:

  • It separates allow testing without disturbing production runs.
  • Shorter retentions save storage space and protects data privacy.

Python

The prototypical daily dataswarm pipeline has the following structure. Global defaults is at the top to facilitate finding the owner.

#!/usr/bin/env python3
...imports: time, datetime,...

from dataswarm.operators import (
    GlobalDefaults,
    ...Other operators...
)

GlobalDefaults.set(
    ...User and other arguments...
    partition='ds=<DATEID>',
    task_tags=["python-version-3"],
)
...Variables containing parameters, table names,...
...Functions...
...Exported Tasks...

The line partition='ds=<DATEID>' is the one way to follow the second convention above. Note that some other operator use different arguments. See below.

Other recommended conventions include:

  • Place all exported variables together so that it is easier for the reader to figure out what is being run.
  • For each input table, see of a corresponding signal tables exists. If so, make sure you wait on a signal table.
    • A signal table is Hive table whose purpose is to be "waited on" by a WaitForHiveOperator. A Dataswarm pipeline will wait on a signal table to ensure a set of partitions has landed in a related multiple-partitioned data table instead of waiting directly on the partitions in the data table. Typically, if the table containing the data is named x, the signal table will be named x_signal.
    • More documentation can be found at this Wiki.
  • Ensure that the names of source tables appears once as a string literal. This is typically followed by either using GlobalDefaults.add_macros or defining variables containing table names and using Python's string formatting.
  • Anything that is not an exported variable (i.e., not a task, or list or dictionary of tasks) and exists module level should have variable names that starts with a _. For example, one might have a function called _get_day_of_week(date).
  • Use the schematized operators (Hive|Presto)InsertOperatorWithSchema instead of the non-schematized ones. For example, prefer PrestoInsertOperatorWithSchema to PrestoInsertOperator
  • Functions can be used to generate tasks and should be if either one has many similar tasks or to combine redundant arguments.
  • If you have many similar tasks, consider grouping them into a module level dict or list and generating tasks programatically. This group can then be treated like a single task that will complete when all referenced sub-tasks complete. See the dep_list section of this page.
tasks = {
    'stage1': { 'task1': Operator(), 'task2': Operator() },
    'stage2': { 'task1': Operator(), 'task2': Operator() },
}

A = Operator( dep_list = [tasks] )

Watch Out!

  • Operators tend to use one of two arguments for specifying partitions. One uses no quotes and forward slash separators. The other uses quotes and comma separators
    • partition: Used by WaitForHiveOperator and PrestoInsertOperator
      • partition="ds=<DATEID>"
      • partition="ds=<DATEID>/interface=ANDROID"
    • hive_partition: Used by MySQLToHive
      • hive_partition="ds='<DATEID>'",
      • hive_partition="ds='<DATEID>', ts='<TSUTC>'",

SQL

General SQL/Presto

  • Cut-and-paste your query into Daiquery and click .
    • You will probably be reviewing Chronos logs when debugging. Try to make the query readable in these logs.
  • In SELECT statement with a join, explicitly label each column with a source table. Avoid using *.
    • Do not use USING clauses. This clause obscures which table a field came from and causes unintuitive behavior. Do you remember when c1 IS NULL inSELECT a.*, c1 FROM a LEFT JOIN b USING (c1)?
  • In Presto, consistently use one of WITH...AS or subqueries.
    • When using outer joins, put filtering into subqueries and joining logic into the ON-clause.
    • "If your brain hurts when reasoning where a filter should go, or when trying to understand a query written in such a way, chances are you should be using subqueries instead." Outer Join Differences.
  • Review and try to follow the joining best practices.
  • Review the table creation section of the Hitchhiker Guide to Presto page on the details of table creation.

Partitioning

  • For Hive tables, partition columns should be of type VARCHAR/STRING. Other types will not raiser errors at table creation but will cause havoc with other internal tools.
  • Presto: In a multi- partitioned table that has a ds-partition, the ds field should be listed first. E.g., PARTITONED_BY = ARRAY['ds','content_type']. Many internal tools expect this ordering. See this post.
  • In ds-partitioned tables, the field ds contains the day in the format 2018-06-30.
  • In ds/ts -partitioned tables, the ts field specifies the hour in the format 2018-06-04+14:00:99. Note the postfix :99.
    • Timestamps of this format can be generated in a Datawarm pipeline using ts='<TSUTC>'. Refer to this page for the exact format.
    • The ds partition is the date in Menlo Park while the ts is the time in UTC. For example, in one table, ds-partition 2018-07-02 contained ts partitions between 2018-07-02+07:00:99 and 2018-07-03+06:00:99.
  • Use Dataswarm macros to create time-valued partition values. Avoid custom formats like "06182018". It seems that some tools sort time-based partions lexicographically to find the last temporally. See this page.

Table Names

  • Don't include tbl in the name of a table.
  • Use collective or plural forms. E.g., dim_all_users, not dim_all_user.
  • Try to follow a naming convention. Ask your team. Existing naming convention include
    • Entity Tables Each row corresponds to one entity in some data mode. Each ds partition contains all the entities that have ever existed up to that ds with updated attributes. See dim_all_users.
    • Fact Tables Each row contains an event with a timestamp. Each ds partition contains the event that occurred during a particular day.
    • Cumulative Fact Tables Each row is an event with a time stamp. Each ds-partition history of all events that ever happened up to that day.
    • Temporary Tables Intermediate tables not intended to exceed one rune of a particular pipeline. These are usually created with the Dataswarm macro TMP_TABLE.
    • Sampled Table Denotes a sampled version of another table. E.g., dim_all_users_sampled.
    • Staging Tables Denotes an intermediary tables should not be used by other pipelines
    • Aggregation Tables Denotes a table containing data aggregated at a single entity level (e.g., user, page, group, etc.).
    • Cube Table Table contains a union of aggregations. For instance, revenue at the zip code, state, and country levels.

Column Names

  • Names should document the contents of the field.
    • Column names are singular. For instance, user_id, not user_ids.
    • Include the unit of the field. For example, duration_sec, not duration. Append _utc to to the names of fields containing timestamps specified in UTC.
    • Field names should not be SQL keywords or common functions. Avoid count and sum. In addition to not being descriptive, such terms can cause a valid query to become invalid after being formatted in Daiquery.
  • Column names should be lowercase letters and underscores.
    • Avoid escaping column names if possible. That is, no "duration_sec" (Presto) or duration_sec (SQL).

External Documentation

Dataswarm

  • Official Wiki
  • List of operators
    • Important arguments
  • Dynamic Partitions
    • This document explains how to use a single PrestoInsertOperator to modify multiple partitions. As of 2019-03-07, Hive partitions are read only. You may need to explicitly write a DELETE statement(s). It wasn't clear that the PrestoInsertOperator handles this case correctly.
  • Official Best Practices
    • Things to do
    • Things NOT to do. Read this once.
    • Style Guide
    • Don't do computation in module scope.

Presto and Spark

  • Spark Documentation
  • Presto Documentation
    • More details on window function. From Transact-SQL, but seems to be correct.

SQL

  • Sample queries
  • sqlstyle.guide
  • Data_Engineering_Best_Practices
  • HQL Style Guide
Viz Doc →
  • Watch Out!
  • General SQL/Presto
  • Partitioning
  • Table Names
  • Column Names
  • Dataswarm
  • Presto and Spark
  • SQL
Kats Project
More
GitHubStar
Facebook Open Source
Copyright © 2021 Kats Project @ Facebook