Follow your inner conductor: A comparison of file versus abstract dependency-based orchestration

TLDR; GNU Make and Apache Airflow are two DAG orchestration tools. Make gets the job done and is, without a doubt, simpler than Airflow. However, conformance to a purely file dependency-based orchestration (Make) could require refactoring that Airflow would not. It’s worth learning how to use Airflow (or another abstract orchestration tool), even if you have no immediate need for most of its features.

Every quarter or so, I deliberately slow down to learn and, if I find it worthwhile, incorporate a new way of working. I keep this to perhaps two days to at most two weeks of sunk cost. In other words, I spend 10 to 20 percent of my coding time on strategic, rather than tactical, programming.

Perhaps it is no surprise that, within three months, the cost is comfortably repaid in productivity, reliability, quality, and/or sanity.

When I evaluate each new tool, I’ve been reminding myself to judge its utility by the degree to which it pushes out the efficiency frontier, rather than merely perturbing the interplay between interface and implementation, modularity and dependency, obscurity and information hiding. These and other design criteria have become quiet, but persistent brain waves vibrating under the tactical notes of the dozens of small tasks executed each day.

the cognitive load born of all of these aspects, which inevitably hinders or even prevents the attainment of value.

Confronted with a sufficiently dependent set of processes involving enough data to forbid unnecessary repetition, I recently refactored a linear pipeline into a directed acyclic graph (DAG) and executed it with both an older and more recent technology: GNU Make and Apache Airflow. Make got the job done and is, without a doubt, simpler than Airflow. However, in my use case, Make required changes to existing code that Airflow did not, and so that cost ultimately led me to equip myself with a more powerful tool for posterity, even though I have no immediate need for most of its features.

Let’s describe a stylized problem, explain the techniques, and look at some very simple implementations.

Problem setting

We need to produce a half-dozen analyses. The requirements of some analyses imply a more-or-less unique set of data, which further implies a one-to-one ratio of SQL query to analytical code (in Python, in my case) to end product (tables and graphs in a LaTex report). Other analyses, or let’s use the more generic term processes, share data. Some processes consume the same (interim) output of another process. There are clean linear paths from data to interim data to result as well as some tree-like paths. In all cases, queries to obtain starting data are time-consuming, as are some Python processes.

Repetition would reduce productivity, so as we progress through a sequence of processes, we cache data upon the completion of each stage. Should a downstream error occur or be discovered, don’t rerun completed upstream tasks.

My first implementation involved caching files and including if-then logic in a master driver script. If a certain file exists, do not run the methods that would create it; start with the next stage of the pipeline. To limit the program to a subset of tasks, I added a Boolean to the “does the file exist” logic. Recall that some endpoints in my use case were independent of others, but to simplify the interface to the overall project, I would rather not split them off into their own pipelines.

As the project grew, repeating that logic started to feel like the wrong approach. What if instead of looking from left to right, we start at the end and work back? In a sense that is what a DAG view is, and is exactly how we compose Makefiles.

GNU Make

A Makefile defines a DAG, formalizing workflow steps in terms of input and output file dependencies, as opposed to abstract (code-configured) dependencies that are definable with other tools like Airflow. make resolves these dependencies and determines which commands need to be run and in what order. It can orchestrate the execution of code in virtually any language and so learning how to use it generalizes and lays a foundation for learning more recent technologies like Drake (Make for data), Pydoit (Python functions, but close to Make), Luigi (more explicit; object-oriented), Airflow and Prefect.

make is a command; when executed, it looks for a file called Makefile (no extension) in the current directory. You could instead have a file with a customized name followed by the extension .make, as long as you add -f to the make command; i.e., $ make -f special_file.make. It is more common to use Makefile, so you can just type make.

Here is the format of a Makefile:

target [target ...]: [prerequisite ...]
	[recipe]
	...
	...

Think of each line as a recipe, and each sub-line as a list of ingredients (@echo…” “), and the third sub-line as instructions
target is the name of a file to create
recipe could be the python command to execute a script.
Indentations may appear to be 4 spaces, but they need to be a tab.
If the name of the result (i.e., the target name) of a task already exists, make will report that the output ‘is up to date.’ It does not check the contents of the target against the intent expressed by the Makefile, just the name and timestamp.
List the last task first. The dependencies will be discovered and prior tasks run as

Here is another symbolic example with a little more detail:

stat_summmary.csv: input_data.csv
	python summarize.py

input_data.csv: raw_data.csv
	python clean.py

raw_data.csv:
	python load_data.py

housekeeping:
	rm -f tempfile.csv

If necessary to fulfill the default goal of producing stat_summary.csv, executing make by itself at the CL will run python, which will perform the operations in clean.py, before performing the python summarize.py recipe. If input_data.csv already exists, however, make will just invoke python.summarize.py.

For this type of DAG to work, 1 module consumes 1 file and generates 1 file. Constrain function scope accordingly.

If we just wanted to perform an upstream task–namely load_data.py–without any downstream task(s), we would execute make input_data.csv.

During an earlier stage, it may be helpful to have the full task pipeline listed in the Makefile, but focus on testing a subset of dependencies. To accomplish that objective, configure the default goal at the top of the Makefile with .DEFAULT_GOAL := input_data.csv, where input_data.csv is the second target in this example. Calling make will then check for raw_data.csv and execute python load_data.py recipe as necessary, ignoring the statistical summary recipe. This ability to easily iterate is one of the advantages of having file dependencies. It is also helpful to have intermediate data stored in files at various stages of transformation to find and resolve errors.

If your makefile includes a recipe that does not produce a dependency, you would have to explicitly tell make to run it: make housekeeping in the example above. In this mode, make is shorthand for Unix commands or perhaps a substitute for $ python do_something.py, although the utility of such a recipe is questionable.

If a recipe does not produce a file, we say that recipe has a phony target. Since the orchestration capability of make depends on (interim) files, recipes with phony targets (non-target operations) cannot produce dependencies.

There is more to GNU make, although the preceding likely covers what’s needed for data processing for statistical modeling. See this terse summary for more.

Cons of Make include lack of scheduling. Also, your program may involve many (phony) targets that, if conformed to this approach, would imply a burdensome number of files (not to mention code to write and read such files).

Here is a working example you can replicate: Example 1: Compute the average difference of two columns of randomly generated numbers (a) minus an array from a second data file (b). We’ll do 1b with Make and do the simpler 1a in Airflow.

result.txt: data1.csv data2.csv  
    python summarize.py  
  
data1.csv:  
    python create_data.py  
data2.csv:  
    python create_data2.py

# create_data.py
import numpy as np  
import pandas as pd  
  
def create_data():  
    df = pd.DataFrame({'a': np.random.random(10), 'b': np.random.random(10)})  
    df.to_csv('data1.csv')  
  
  
if __name__ == "__main__":  
    create_data()


# create_data2.py
import numpy as np  
import pandas as pd  
  
if __name__ == "__main__":  
    df = pd.DataFrame({'z': np.random.random(10)})  
    df.to_csv('data2.csv')


# summarize.py
import numpy as np
import pandas as pd
from pathlib import Path


def summarize():  
    df = pd.read_csv('data1.csv')  
    df.loc[:, 'difference'] = np.subtract(df.a, df.b)  
  
    df2 = pd.read_csv('data2.csv')  
  
    avg_diff = np.average(df.difference - df2.z)  
    o = Path('result.txt')  
    o.open('w').write(str(avg_diff))  
  
  
if __name__ == "__main__":  
    summarize()

In the following sequence, we run Make from an empty state; it shows us what processes ran. Rerun Make and nothing is done other than to notify us that the primary target is “up to date.” Since create_data2 is an un-seeded RNG, rerun the module by itself, and we change a dependency (data2.csv). Make recognizes the change in file metadata and exercises logic to rerun summary.py, but not create_data, since the dependency data.csv already existed unchanged. A minimally viable way to avoid unnecessarily repetitive processes, given that each process consumes and produces one or more files.

% Make
python create_data.py
python create_data2.py
python summarize.py
% Make
Make: `result.txt' is up to date.
% python create_data2.py
% Make                  
python summarize.py

Airflow

Airflow addresses the same core concern as Make, and many more secondary concerns, accumulated from a variety of use cases, of which building machine learning models is probably only a small fraction. Implications of using Airflow include

Airflow needs to be installed and setup – this is not difficult as a local installation and setup, but you need to be aware of airflow.cfg and other things.
The DAG script, written in Python, includes a lot of (default) arguments
Among other preliminary tasks, DAGs need to be deployed (registered?) to a backend database before it can execute the data processing tasks.
Executing a DAG (backfill) at the command line logs messages and auto-generated metadata like run_id and run_duration that could aid in auditing, but obtaining signal from noise will require effort.
The web browser monitoring dashboard is certainly appealing, such as the Graph view, and another tool to learn

Spend time with Airflow, and it’s easy to see why GNU Make could be a better choice for localized processing that is complex enough to warrant a DAG orchestration, but not in need of the execution logging and visualization that Airflow provides.

Other things to be aware of:

Documentation provides illustrative examples in a different way. Sifting through it takes time.
Airflow does not explicitly rely on file dependencies.
Includes two APIs:
1. Explicitly instantiate ‘operators’, like PythonOperator, BashOperator, EmailOperator, SQLExecuteQueryOperator, OracleOperator
2. TaskFlow – more abstract interface and, therefore, concise

Thinking in Airflow jargon

An operator defines a unit of work for Airflow to complete. Instantiating operators is the classic approach to defining work in Airflow. For some use cases, it’s better to use the TaskFlow API. The choice of API will determine how you define dependencies.

Let’s orchestrate Example 1a (average difference of random arrays) with Airflow’s PythonOperator.

# rng_summarizer.py
from datetime import datetime, timedelta  
from airflow.models.dag import DAG  
from airflow.operators.python import PythonOperator  
from create_data import create_data  
from summarize import summarize

with DAG(  
    dag_id="rng_summarizer",  
	default_args={  
        "depends_on_past": False,  
        "email": ["aaron.slowey@gmail.com"],  
        "email_on_failure": False,  
        "email_on_retry": False,  
        "retries": 1,  
        "retry_delay": timedelta(minutes=5),  
    },  
    description="Aaron trying to learn Airflow",  
    schedule=timedelta(days=1),  
    start_date=datetime(2024, 1, 30),  
    catchup=False,  
    tags=["aaron"],  
) as dag:  
        task_id="generate_random_numbers",  
        python_callable=create_data,  
    )  
    t2 = PythonOperator(  
        task_id="take_difference_and_average",  
        python_callable=summarize,  
    )  
    t1.set_downstream(t2)

Note the Python module name matches the dag_id. I don’t know if this is required, but there is no apparent reason for them to be different, as the module defines the DAG and nothing else. The python_callable can be imported into the DAG definition module, or the functions can reside in that module, and is assigned without ().

Check for syntax errors; at the Terminal:

python ~/airflow/dags/rng_summarizer.py

If no exceptions occur, we can rule out some problems, but we do not yet know if Airflow can complete the tasks. At some point, the DAG needs to be parsed; it is unclear if the preceding command parses the DAG or not. Run the following sequence at the Terminal:

# re-initialize the database tables with newly created/modified DAGs
airflow db migrate

# Verify by printing the list of active DAGs
airflow dags list

# Optional: prints the hierarchy of tasks in the "tutorial" DAG
airflow tasks list name_of_module_defining_dag --tree

You can test a task with airflow tasks test; this will ignore dependencies and will not communicate state to the database. Similarly, airflow dags test considers dependencies but also does not communicate state to the database.

You can launch various utilities one by one, such as airflow webserver, which will launch a browser utility, or altogether with airflow standalone (at the Terminal). Login to localhost:8080 with username and password provided in stdout upon running standalone. Experience suggests that you do not need to run webserver or standalone to go to the local URL, login, and see a recently completed DAG, provided that you ran airflow db migrate and backfill.

backfill is the Airflow command that (conditionally) executes the tasks comprising a DAG (errors are provided for learning):

$ airflow dags backfill rng_summarizer --start_date 2024-01-30
Dag 'rng_summarizer' could not be found; either it does not exist or it failed to parse.

The stdout of a successful run will be verbose with metadata that are logged; potentially useful for tracking your work.

Problem setting#

GNU Make#

Airflow#

Thinking in Airflow jargon#

Problem setting

GNU Make

Airflow

Thinking in Airflow jargon