21 Oct 2025

Categories:
- engineering 1

Tags:
- python 29
- reproducibility 1
- pipelines 1
- uv 1
- zarr 1
- hpc 1
- minimalism 1

Elegant processing pipelines with no frameworks: just `uv`, Zarr, and Python

Large-scale analysis pipelines are often complex, brittle, and fragmented across multiple tools. They grow organically - one more dependency here, a helper script there, a bash script to submit the job, a YAML configuration to glue it all together - suddenly, you have a system that no one understands, no even future you.

In practice, if you are not making a software package for reuse by others, most of that complexity adds little scientific or engineering value. It only makes the code harder to reason about, harder to reproduce, and harder to maintain.

The cost of that complexity is enormous:

Technical debt accumulates invisibly. Each new “framework” or “helper utility” promises order but instead creates another layer to maintain.
Dependency drift means that six months later, half the code no longer runs because upstream libraries changed APIs or stopped supporting your Python version. (in scientific programming, most data analysis code won’t be run by others so environment abstractions via frameworks rarely pay off).
Cognitive overhead explodes: to re-start working or to fix a bug, you first need to remember which workflow manager or config file actually controls execution. How do we specify that? Add one more file/document to glue it all together. Ups.
Human frustration follows: you open your old repository and think, who was the idiot who wrote this mess? - before realizing it was you.

For most exploratory or iterative scientific work, the real goal is not to build software infrastructure; it is to run analyses quickly, clearly, and reproducibly. And for that, simplicity is not a luxury — it is an essential design principle.

This post describes a pattern for scalable, concurrent-safe processing that is:

a single-file: everything in one portable script
robust: no race conditions, no overwrites, no half-written files
scalable: trivially parallelizable across HPC jobs
lightweight: no external workflow managers or container layers
and above all, beautifully simple

1. Inline dependencies with `uv`

The traditional advice is to maintain a requirements.txt, a pyproject.toml, or a Docker container with pinned dependencies. This is reproducible, but cumbersome — and most importantly, it breaks the principle of locality: the code and its dependencies are not in the same place.

uv fixes this. Dependencies can be defined directly at the top of the script:

#!/usr/bin/env uv --script
# /// script
# dependencies = ["numpy", "pandas", "zarr", "fasteners"]
# ///

That’s it. Now the script can be executed anywhere with:

uv run --script process_images.py

or if made executable with:

chmod +x process_images.py  # run once only
./process_images.py

No virtual environments to activate, no requirements.txt, no container rebuilds. The script is self-contained, portable, and reproducible by default.

2. Zarr: concurrency-friendly storage

Many pipelines still rely on writing large .h5ad, .npy, or .csv outputs, serialized and rewritten each time. That design collapses once you distribute work across multiple jobs — it’s fragile, slow, and error-prone.

Zarr solves this elegantly. It stores data as chunked arrays in directories, meaning:

jobs can write different parts of the dataset concurrently;
partial results are immediately available;
failures don’t corrupt the entire dataset;
and the format is portable across languages and environments.

In other words, Zarr lets you treat disk like a database for large-scale numeric data — perfect for incremental, parallel feature extraction.

3. Randomized order and lock files

Traditional HPC workflows try to control parallelism explicitly using SLURM array indices or workflow DAGs. These approaches create coordination overhead: each job must know which subset to process and where to write.

Instead, randomization and lightweight locks are enough:

np.random.shuffle(files)
np.random.shuffle(model_names)

Each worker processes items in random order, minimizing collisions. When a write must be exclusive (e.g., creating a new Zarr group), a simple interprocess lock ensures safety:

import fasteners with fasteners.InterProcessLock(lock_file): # write metadata or create array

No central controller, no global scheduler, no external state. Just Python files operating independently, safely, and efficiently.

4. Cluster submission: one clean command

People often rely on workflow managers (Snakemake, Nextflow, Cromwell) for job orchestration. They work well for static pipelines, but for exploratory computational research, they introduce overhead — complex configuration files, extra abstraction layers, and a steep maintenance cost.

Here, we use a single command embedded inside the Python script docstring, so the whole pipeline lives in one place:

"""
Run in cluster with:

echo 'for ID in {00..79}; do
  sbatch --job-name histo.$ID \
         --qos tinyq -p tinyq --time 02:00:00 -c 8 --mem 48G \
         --output logs/processing/tinyq.$ID.log \
         --wrap "uv run --frozen --no-sync python process_images.py"
done' | at now + 2 hours
"""

This approach avoids queuing hundreds of jobs simultaneously. The scheduler stays healthy, you stay in control, and there is still zero Bash boilerplate beyond this one line.

5. Why this works better

Aspect	Conventional approach	This pattern
Dependency management	`requirements.txt`, Conda, or containers	Inline dependencies via `uv`
Workflow control	Snakemake / Nextflow / Bash scripts	Self-contained randomized for-loops
Data output	Multiple serialized files (.csv, .h5)	Concurrent-safe Zarr store
Parallelization	Overhead to setup job arrays with explicit indices	Randomized distribution across workers
Safety	Risk of overwrite or corruption	Interprocess locks
Complexity	Multiple moving parts	One script
Reproducibility	Environment files and containers	Single versioned script

This design does not compete with heavy frameworks - it replaces them when you need something simpler, faster, and more reliable.

6. A philosophy of simplicity

Modern research software often drifts toward overengineering. But scaling should not require frameworks within frameworks.

There is beauty in solving large-scale problems with the smallest, most transparent system possible. This pattern embodies that principle: locality, minimalism, reproducibility.

No workflow managers. No bash gymnastics. Just Python, done right.

7. Code appendix: minimal runnable example

Below is a minimal, self-contained implementation using uv, zarr, and fasteners. It demonstrates randomized processing, concurrency-safe writes, and a single embedded job submission snippet.

#!/usr/bin/env uv --script
# /// script
# dependencies = ["numpy", "pandas", "zarr", "fasteners"]
# ///

"""
Minimal example for distributed image processing with uv + Zarr.

Submit a wave of jobs (one-time execution) on the cluster:

echo 'for ID in {00..79}; do
  sbatch --job-name histo.$ID --qos tinyq -p tinyq --time 02:00:00 -c 8 --mem 48G \
         --output logs/processing/tinyq.$ID.log \
         --wrap "uv run --frozen --no-sync python process_images.py"
done' | at now + 2 hours
"""

from pathlib import Path
import numpy as np
import pandas as pd
import zarr
import fasteners


data_dir = Path("data"); data_dir.mkdir(exist_ok=True)
store_dir = Path("processed") / "images.zarr"; store_dir.parent.mkdir(parents=True, exist_ok=True)
locks_dir = Path("locks"); locks_dir.mkdir(exist_ok=True)

# pretend input files
files = sorted(data_dir.glob("*.tif"))
np.random.shuffle(files)

# open or create a Zarr store
root = zarr.open_group(store_dir, mode="a")

def main() -> None:
    for f in files:
        extract(f)

def extract(file: Path) -> None:
    """Process one file and append to Zarr in a concurrency-safe way."""
    arr_name = file.stem
    lock = fasteners.InterProcessLock(locks_dir / f"{arr_name}.lock")
    if arr_name in root:
        print(f"Skip {file.name}: already processed")  # this will be written to log by slurm
        return

    # Simulate some heavy processing
    feats = np.random.rand(4096).astype("float32")
    meta = {"filename": file.name}

    with lock:
        root.create_dataset(arr_name, data=feats, chunks=(512,), overwrite=False)
        root[arr_name].attrs.update(meta)

    print(f"Processed {file.name}")  # this will be written to log by slurm

if __name__ == "__main__":
    main()

Notes:

uv run with --frozen and --no-sync ensure deterministic environments across nodes.
Locks are only used for short metadata operations — heavy computation happens outside.
Zarr chunking lets you scale I/O efficiently and resume interrupted jobs.

8. Final thoughts

This pattern has served me remarkably well for various pipelines that need to run in high-throughput (tens of thousands of tasks, each lasting hours). It demonstrates that robust, distributed computation does not require complex infrastructure — only careful design.

In many cases, the best workflow manager is none at all.

Elegant processing pipelines with no frameworks

Elegant processing pipelines with no frameworks: just `uv`, Zarr, and Python

1. Inline dependencies with `uv`

2. Zarr: concurrency-friendly storage

3. Randomized order and lock files

4. Cluster submission: one clean command

5. Why this works better

6. A philosophy of simplicity

7. Code appendix: minimal runnable example

Notes:

8. Final thoughts

Latest posts:

All categories:

All tags:

Elegant processing pipelines with no frameworks

Elegant processing pipelines with no frameworks: just uv, Zarr, and Python

1. Inline dependencies with uv

2. Zarr: concurrency-friendly storage

3. Randomized order and lock files

4. Cluster submission: one clean command

5. Why this works better

6. A philosophy of simplicity

7. Code appendix: minimal runnable example

Notes:

8. Final thoughts

Latest posts:

All categories:

All tags:

Elegant processing pipelines with no frameworks: just `uv`, Zarr, and Python

1. Inline dependencies with `uv`