Follow your inner conductor: A comparison of file versus abstract dependency-based orchestration

TLDR; GNU Make and Apache Airflow are two DAG orchestration tools. Make gets the job done and is, without a doubt, simpler than Airflow. However, conformance to a purely file dependency-based orchestration (Make) could require refactoring that Airflow would not. It’s worth learning how to use Airflow (or another abstract orchestration tool), even if you have no immediate need for most of its features. Every quarter or so, I deliberately slow down to learn and, if I find it worthwhile, incorporate a new way of working....

February 3, 2024 · Aaron Slowey

Histograms? SQL has you covered

TL;DR Binning data with SQL to plot histograms may seem an odd choice to using a DataFrame method such as df.hist() in Pandas, but it can be done, and delegating the task to your data warehouse can save space and time when data sets are large and continually updating. I extend other tutorials with a bin scalar that adapts to multiple groups within a table. SQL frequency distributions Some, perhaps a majority, of data scientists are more comfortable programming in Python, and particularly pandas, than SQL....

March 5, 2023 · Aaron Slowey

Explainable insights from sequence regression

Business considerations Linear models are among the most explainable, and yet producing insights salient to business problems is not trivial. Adopt a simple formulation, and a direct interpretation of parameters will require multiple footnotes to bridge the gap between what is meaningful on the terms of learned associations and what can confirm or alter a manager’s point of view and strategy. Adopt a more complex formulation and, well, you must have amazing infrastructure....

January 8, 2023 · Aaron Slowey

Handle non-sensical operations to avoid downstream errors

When attempting to log-transform an array of values with NumPy, keep in mind Given negative numbers and zeroes, NumPy will output NaN and -inf, respectively, along with a RuntimeWarning. Such values can cause downstream processing to fail or behave unexpectedly. numpy.log provides an argument to handle this situation How that argument affects numpy.log’s behavior depends on whether the output goes to a preexisting container or if that container is created on the fly....

January 5, 2023 · Aaron Slowey

Slice well

In this post, I briefly review a few methods to select rows and/or columns of a DataFrame that satisfy one or more criteria. I then introduce two additional requirements that arises frequently in practice–slicing with previously unknown criteria and managing serialization and deserialization to recover the desired data structure. Lever multiIndexes I often find pandas’ multiIndex to be helpful, although I do not observe it used very often. With a multi-indexed DataFrame, pandas’ ....

November 15, 2022 · Aaron Slowey

That which is aggregated and its metadata

It’s impossible to include an associated field value alongside an aggregate of another variable Unlike ndarrays, DataFrames are often heterogeneous. They are a more complete map of how we think of a data set as a whole. When we alter the structure of tabular data, often through aggregation of one field, we want to include values from other fields. This is an example of an issue that arises at the interface of pandas and scikit-learn, for which the ColumnTransformer was created....

August 12, 2022 · Aaron Slowey

Aggregation: Implications of indexing

While there are multiple syntaxes and methods to produce the same aggregated data, those variations produce different indices. The format and contents of the index can impact other processes, such as serialization and deserialization. Consider the following artificial transactional data. txns = pd.concat([pd.DataFrame({'dt': pd.date_range("2022", freq="D", periods=10), 'amount': np.random.random(10), 'segment': ['ex'] * 10})] * 10, axis=0) dt amount segment (Timestamp(‘2022-01-01 00:00:00’), 0) 2022-01-01 00:00:00 0.992821 ex (Timestamp(‘2022-01-01 00:00:00’), 0) 2022-01-01 00:00:00 0....

July 22, 2022 · Aaron Slowey