Migrating from a Single Cloud Composer Instance to a Multi-Instance Architecture: Lessons from a Production Cutover · Ismail Ait Bahammou — Portfolio

Why a single Composer instance eventually runs out of road

Cloud Composer is Google's managed Airflow, and for a long time a single environment is genuinely the right answer: one scheduler, one set of workers, one Airflow metadata database, one place to look when something breaks. The problems don't show up because the architecture is wrong on day one — they show up gradually, as a side effect of success. More teams onboard DAGs. More pipelines get scheduled at the same time. More operators start hitting the same external APIs and databases. And at some point, a single environment stops being a convenience and starts being a shared bottleneck that everyone is quietly afraid to touch.

There are three symptoms that typically show up first, and they're worth naming because they're the actual justification for a migration, not "best practice" in the abstract:

Scheduler contention. As DAG count grows into the hundreds, the scheduler's parse loop takes longer, and DAGs that used to trigger within seconds of their scheduled time start drifting by minutes. This is invisible until someone downstream depends on a DAG finishing by a specific SLA, and it quietly stops meeting it.

Before assuming this is the cause, it's worth actually measuring DAG file parse time rather than guessing. Airflow exposes this directly:

# Check parse duration for every DAG file in the environment
airflow dags list-import-errors

# Pull per-DAG parse time from the scheduler's own accounting
airflow dags report

On Composer specifically, the scheduler heartbeat and parse latency are also visible through Cloud Monitoring, which is usually faster than digging through logs:

gcloud monitoring time-series list \
  --project=YOUR_PROJECT_ID \
  --filter='metric.type="composer.googleapis.com/environment/dag_processing/total_parse_time"' \
  --interval-end-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --interval-start-time="$(date -u -d '-6 hours' +%Y-%m-%dT%H:%M:%SZ)"

If total_parse_time trends upward in lockstep with DAG count, that's the confirmation to stop guessing and start planning the split — not a symptom that will resolve itself with a worker resize.

Blast radius. A single misbehaving DAG — an infinite retry loop, a runaway query, a bad dependency — can consume enough worker slots or database connections to degrade every other pipeline in the environment. In a single-instance world, one team's mistake becomes everyone's incident.

Metadata database load. Every task instance, every XCom, every log pointer lives in the same Airflow metadata database. High DAG concurrency means that database becomes a shared resource under contention, and slow queries against it slow down scheduling for every pipeline, not just the one causing the load.

A quick way to confirm this before blaming it blindly is to query the metadata database directly for task instance volume and XCom bloat, both of which are common silent contributors:

-- Task instances created in the last 24 hours, grouped by DAG
-- (run against the Airflow metadata DB, not BigQuery)
SELECT dag_id, COUNT(*) AS task_instance_count
FROM task_instance
WHERE start_date >= NOW() - INTERVAL '24 hours'
GROUP BY dag_id
ORDER BY task_instance_count DESC
LIMIT 20;

-- XCom rows are a common, overlooked source of metadata DB bloat
-- when DAGs pass large payloads instead of references
SELECT dag_id, COUNT(*) AS xcom_row_count,
       pg_size_pretty(SUM(pg_column_size(value))) AS approx_size
FROM xcom
GROUP BY dag_id
ORDER BY xcom_row_count DESC
LIMIT 20;

If a handful of DAGs dominate both queries, that's usually enough evidence to justify isolating them into their own environment rather than trying to tune the shared database further.

None of these are dramatic on their own. They're the kind of thing that shows up as "pipelines have been a little flaky lately" for months before anyone connects it to a root cause. That slow accumulation is exactly why the migration decision tends to arrive later than it should.

Choosing a split strategy before touching infrastructure

The instinct once you've decided to move to multiple Composer environments is to jump straight to Terraform and provisioning. The far more important decision — and the one that determines whether the migration actually solves the problem — is how you split the workload across environments. There are a few common strategies, and they trade off differently:

By criticality. Separate environments for SLA-critical pipelines versus best-effort or exploratory ones. This directly solves the blast-radius problem: a broken ad-hoc DAG in a dev-tier environment can never take down a revenue-critical pipeline.

By team or domain. Each owning team gets its own environment. This is the easiest to reason about organizationally and gives teams full autonomy over their own dependency versions and scheduling, but it can lead to environment sprawl and duplicated infrastructure cost if not managed carefully.

By resource profile. Group DAGs by their resource characteristics — CPU/memory-heavy transformation jobs in one environment, lightweight orchestration/trigger DAGs in another. This is the least intuitive split organizationally but often gives the best actual performance improvement, since it lets you size worker pools specifically for the workload they run.

In practice, the strongest migrations combine the first two: split by criticality first, then by team within each criticality tier. Purely technical splits (by resource profile alone) tend to create ownership confusion — nobody is quite sure which environment a new DAG belongs in — which undermines the whole point of the migration.

Planning the cutover: the part that actually determines success

The infrastructure provisioning for a second (or third) Composer environment is usually the easy part — it's largely repeating what already exists via Terraform with a different set of parameters. The cutover itself is where migrations succeed or fail, and it deserves far more planning time than it usually gets.

A few things matter more than they seem to during planning:

DAG-by-DAG migration, not a big-bang switch. Moving every DAG at once maximizes risk for no real benefit. A better approach is to migrate DAGs in waves, starting with lower-risk, non-SLA pipelines to validate the new environment's behavior — scheduler timing, worker sizing, connection and variable configuration — before moving anything business-critical.

Connections, variables, and secrets don't migrate automatically. Every Airflow connection, environment variable, and secret manager reference configured in the old environment needs to be explicitly recreated in the new one. This sounds obvious until a migrated DAG fails at 2 AM because a connection ID that existed in the old environment was never recreated in the new one — a completely avoidable failure that happens constantly in real migrations because it's tedious, unglamorous work that's easy to under-scope.

Doing this by hand through the Airflow UI is exactly how connections get missed. It's far more reliable to export and diff them as data:

# Export connections and variables from the source environment
gcloud composer environments run OLD_ENV_NAME \
  --location us-central1 -- connections export /tmp/connections.json

gcloud composer environments run OLD_ENV_NAME \
  --location us-central1 -- variables export /tmp/variables.json

# Import into the new environment
gcloud composer environments run NEW_ENV_NAME \
  --location us-central1 -- connections import /tmp/connections.json

gcloud composer environments run NEW_ENV_NAME \
  --location us-central1 -- variables import /tmp/variables.json

Then verify nothing was dropped, rather than assuming the import succeeded silently:

import json

with open("connections_old.json") as f:
    old = set(json.load(f).keys())
with open("connections_new.json") as f:
    new = set(json.load(f).keys())

missing = old - new
if missing:
    print(f"Missing {len(missing)} connections in new environment: {missing}")
else:
    print("All connections accounted for.")

This kind of diff takes minutes to write and catches exactly the class of error that otherwise surfaces as a production failure at the worst possible time.

A defined cutover window with an explicit rollback plan. For any DAG being migrated, there should be a clear point at which the old environment's copy is disabled and the new environment's copy takes over — and a clear, tested way to reverse that if the new environment's copy fails. Running the same DAG active in both environments simultaneously, even briefly, is a reliable way to get duplicate writes into downstream tables, which is often a worse outcome than the DAG simply not running for a few hours.

A small script that pauses the DAG in the old environment and unpauses it in the new one atomically (from the operator's perspective) removes the temptation to do this manually and get the order wrong under time pressure:

#!/usr/bin/env bash
# cutover_dag.sh: move a single DAG from old to new environment
set -euo pipefail

DAG_ID="$1"
OLD_ENV="composer-old"
NEW_ENV="composer-new"
LOCATION="us-central1"

echo "Pausing $DAG_ID in $OLD_ENV..."
gcloud composer environments run "$OLD_ENV" --location "$LOCATION" \
  -- dags pause "$DAG_ID"

echo "Verifying $DAG_ID exists and is healthy in $NEW_ENV..."
gcloud composer environments run "$NEW_ENV" --location "$LOCATION" \
  -- dags list-import-errors | grep -q "$DAG_ID" && {
    echo "Import errors found for $DAG_ID in $NEW_ENV — aborting cutover."
    echo "Rolling back: unpausing $DAG_ID in $OLD_ENV."
    gcloud composer environments run "$OLD_ENV" --location "$LOCATION" \
      -- dags unpause "$DAG_ID"
    exit 1
  }

echo "Unpausing $DAG_ID in $NEW_ENV..."
gcloud composer environments run "$NEW_ENV" --location "$LOCATION" \
  -- dags unpause "$DAG_ID"

echo "Cutover complete for $DAG_ID."

The rollback branch is the part worth keeping even after the migration is finished — the same script becomes the incident-response tool if the newly migrated DAG misbehaves days later and needs to move back temporarily.

Monitoring parity before cutover, not after. If your alerting, dashboards, and on-call runbooks are wired to the old environment's logs and metrics, migrating the DAG without migrating the observability around it means the team loses visibility exactly when they need it most — during the highest-risk period right after cutover.

What actually broke in practice

A few failure modes are common enough across Composer migrations that they're worth calling out directly, because they're rarely covered in the official migration documentation:

Timezone and scheduling drift. If DAGs use datetime.now() or naive schedule intervals without an explicit timezone, moving to a new environment (potentially in a different region, with a different default configuration) can shift when a DAG actually fires, even though the schedule_interval string looks identical.

The fix is to never let a DAG's schedule depend on the environment's implicit local time:

# Fragile — depends on the environment's default timezone
from datetime import datetime

dag = DAG(
    "daily_aggregation",
    schedule_interval="0 6 * * *",
    start_date=datetime(2026, 1, 1),
)

# Explicit and portable across environments
import pendulum

local_tz = pendulum.timezone("UTC")

dag = DAG(
    "daily_aggregation",
    schedule_interval="0 6 * * *",
    start_date=pendulum.datetime(2026, 1, 1, tz=local_tz),
)

Before cutover, it's worth diffing the next N scheduled run times for a DAG between both environments rather than assuming they match:

from airflow.models import DagBag

def next_runs(dag_folder, dag_id, n=5):
    dagbag = DagBag(dag_folder=dag_folder)
    dag = dagbag.get_dag(dag_id)
    return [str(d) for d in dag.iter_dagrun_infos_between(
        dag.start_date, dag.start_date.add(days=7)
    )][:n]

print("Old env:", next_runs("/old/dags", "daily_aggregation"))
print("New env:", next_runs("/new/dags", "daily_aggregation"))

Pool and concurrency settings reset to defaults. Custom Airflow pools, max_active_runs, and task concurrency limits configured manually in the old environment's Airflow UI don't travel with a DAG's code. If those limits existed to prevent a specific pipeline from overwhelming a downstream system, the new environment silently defaults back to unrestricted concurrency until someone notices the downstream system is under load again.

Since pools live in the metadata database rather than in DAG code, they need to be exported and recreated explicitly, the same way connections do:

# List pools in the old environment
gcloud composer environments run composer-old --location us-central1 \
  -- pools list -o json > pools_old.json

# Recreate each pool in the new environment
python3 - <<'EOF'
import json, subprocess

with open("pools_old.json") as f:
    pools = json.load(f)

for pool in pools:
    if pool["pool"] == "default_pool":
        continue
    subprocess.run([
        "gcloud", "composer", "environments", "run", "composer-new",
        "--location", "us-central1", "--",
        "pools", "set", pool["pool"], str(pool["slots"]), pool.get("description", "")
    ], check=True)
EOF

Dependency version drift. A new Composer environment usually means a new base image with updated provider package versions. A DAG that worked fine on an older provider version can break on operator behavior changes that aren't obvious from a changelog — this is the kind of thing that's worth testing in the new environment well before the DAG is anywhere near production cutover, not discovering during the cutover window itself.

Diffing installed provider versions between environments takes one command and catches most of these before they become a cutover-day surprise:

gcloud composer environments run composer-old --location us-central1 \
  -- python -c "import importlib.metadata as m; \
  print('\n'.join(sorted(f'{d.name}=={d.version}' for d in m.distributions() \
  if 'apache-airflow-providers' in d.name)))" > providers_old.txt

gcloud composer environments run composer-new --location us-central1 \
  -- python -c "import importlib.metadata as m; \
  print('\n'.join(sorted(f'{d.name}=={d.version}' for d in m.distributions() \
  if 'apache-airflow-providers' in d.name)))" > providers_new.txt

diff providers_old.txt providers_new.txt

Anything in the diff is a candidate for regression testing before that DAG moves — especially operators for BigQuery, GCS, and Pub/Sub, where behavior changes between provider versions are common.

The mental model that makes this tractable

The single biggest mindset shift in a Composer migration is treating it as a series of small, reversible changes rather than one large infrastructure project. Every DAG migrated is its own mini-project with its own validation, its own connections checklist, and its own rollback plan. That sounds slower than a coordinated big-bang cutover, and in wall-clock time it often is — but it fails safely. A single DAG migrating badly is a contained incident. A hundred DAGs migrating badly at once, on the same day, is an outage with no clear starting point to debug from.

Takeaway

A multi-instance Composer architecture solves real problems — scheduler contention, blast radius, and metadata database load — but only if the split strategy is chosen deliberately and the cutover is executed as a controlled, incremental process rather than a single event. The infrastructure work is the least risky part of the migration. The connections, variables, monitoring parity, and DAG-by-DAG validation are where production incidents actually come from, and they deserve at least as much planning time as the Terraform.