A library for making stability analysis simple, following the veridical data-science framework.
Why use vflow
?
Using vflow
's simple wrappers easily enables many best practices for data science, and makes writing pipelines easy.
Stability | Computation | Reproducibility |
---|---|---|
Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results | Automatic parallelization and caching throughout the pipeline | Automatic experiment tracking and saving |
Here we show a simple example of an entire data-science pipeline with several perturbations (e.g. different data subsamples, models, and metrics) written simply using vflow
.
import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from vflow import init_args, Vset
# initialize data
X, y = sklearn.datasets.make_classification()
X_train, X_test, y_train, y_test = init_args(
sklearn.model_selection.train_test_split(X, y),
names=['X_train', 'X_test', 'y_train', 'y_test'] # optionally name the args
)
# subsample data
subsampling_funcs = [
sklearn.utils.resample for _ in range(3)
]
subsampling_set = Vset(name="subsampling",
modules=subsampling_funcs,
output_matching=True)
X_trains, y_trains = subsampling_set(X_train, y_train)
# fit models
models = [
sklearn.linear_model.LogisticRegression(),
sklearn.tree.DecisionTreeClassifier()
]
modeling_set = Vset(name="modeling",
modules=models,
module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)
preds_test = modeling_set.predict(X_test)
# get metrics
binary_metrics_set = Vset(name='binary_metrics',
modules=[accuracy_score, balanced_accuracy_score],
module_keys=["Acc", "Bal_Acc"])
binary_metrics = binary_metrics_set.evaluate(preds_test, y_test)
Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.
Documentation
See the docs for reference on the API
Notebook examples (Note that some of these require more dependencies than just those required for vflow - to install all, use the
notebooks
dependencies in thesetup.py
file)
Installation
Install with pip install vflow
(see here for help). For dev version (unstable), clone the repo and run python setup.py develop
from the repo directory.
References
- interface: easily build on scikit-learn and dvc (data version control)
- computation: integration with ray and caching with joblib
- tracking: mlflow
- pull requests very welcome! (see contributing.md)
Expand source code
"""
.. include:: ../readme.md
"""
from .vfunc import *
from .vset import *
from .pipeline import *
from .convert import init_args, dict_to_df, compute_interval, perturbation_stats
from .helpers import *
Sub-modules
vflow.convert
-
Useful functions for converting between different types (dicts, lists, tuples, etc.)
vflow.helpers
-
User-facing helper functions included at import vflow
vflow.pipeline
-
Class that stores the entire pipeline of steps in a data-science workflow
vflow.subkey
vflow.vfunc
-
A perturbation that can be used as a step in a pipeline
vflow.vset
-
Set of modules to be parallelized over in a pipeline. Function arguments are each a list