The problem: pipelines don't know when their tables disappear
In any BigQuery environment that's been running for a while, tables accumulate a quiet kind of technical debt: they get renamed during a refactor, dropped after a dataset migration, replaced by a new version with a different naming convention, or deprecated but never actually referenced anywhere obvious. None of that is a problem until a pipeline written six months ago — by someone who has since moved teams — tries to read from a table that no longer exists, fails at 4 AM, and the on-call engineer has no idea whether the table was ever supposed to be there.
The failure mode is almost always the same: a DAG task fails with something like
google.api_core.exceptions.NotFound: 404 Not found: Table
project:dataset.table_name was not found in location US
and the investigation that follows is manual, slow, and repeated by a different engineer every time it happens. The actual fix — auditing table existence proactively, before a pipeline depends on it — takes almost no BigQuery-specific cleverness. It just requires treating INFORMATION_SCHEMA as a monitoring source instead of a debugging tool you reach for after the fact.
Step one: know what your pipelines actually expect to exist
Before you can check whether tables exist, you need an inventory of which tables should exist — the list of every table any pipeline reads from or writes to. If that list only lives implicitly in scattered SQL files and DAG code, the audit has nothing to check against. So the first real piece of work is building that inventory, not writing the existence check itself.
A lightweight way to do this without a metadata catalog tool is to grep pipeline source code for fully-qualified table references and normalize them into a flat list:
import re
from pathlib import Path
# Matches project.dataset.table or `project.dataset.table`
TABLE_REF_PATTERN = re.compile(
r"`?([a-zA-Z0-9_-]+)\.([a-zA-Z0-9_]+)\.([a-zA-Z0-9_]+)`?"
)
def extract_table_refs(pipeline_dir: str) -> set[tuple[str, str, str]]:
refs = set()
for path in Path(pipeline_dir).rglob("*.sql"):
text = path.read_text(errors="ignore")
for match in TABLE_REF_PATTERN.finditer(text):
project, dataset, table = match.groups()
refs.add((project, dataset, table))
return refs
expected_tables = extract_table_refs("./pipelines")
print(f"Found {len(expected_tables)} distinct table references across pipeline SQL.")
This won't be perfect — dynamically constructed table names (f"{dataset}.{prefix}_{date}") won't get caught by a regex — but it gets you a real, versionable inventory instead of relying on tribal knowledge, and it's usually good enough to cover 80–90% of static references.
Step two: query INFORMATION_SCHEMA for what actually exists
Once you have the expected list, the check itself is a straightforward INFORMATION_SCHEMA.TABLES query — but the detail that matters is scoping it correctly. INFORMATION_SCHEMA.TABLES is dataset-scoped in BigQuery, so auditing "hundreds of pipelines" across many datasets means either querying each dataset individually or using the region-level INFORMATION_SCHEMA view that spans a project:
-- Region-level view: covers every dataset in the project, one region at a time
SELECT
table_catalog AS project_id,
table_schema AS dataset_id,
table_name,
table_type,
creation_time
FROM
`your-project.region-us`.INFORMATION_SCHEMA.TABLES
WHERE
table_type = 'BASE TABLE'
ORDER BY
dataset_id, table_name;
If pipelines span multiple projects (common with a raw/staging/gold medallion setup, each in its own project), the query needs to run once per project and the results unioned:
SELECT 'project_raw' AS source_project, table_schema, table_name
FROM `project-raw.region-us`.INFORMATION_SCHEMA.TABLES
UNION ALL
SELECT 'project_gold' AS source_project, table_schema, table_name
FROM `project-gold.region-us`.INFORMATION_SCHEMA.TABLES;
It's easy to lose an afternoon here to a subtle bug: forgetting that INFORMATION_SCHEMA is region-scoped, and querying region-us while half your datasets live in region-eu, silently returns an incomplete result set with no error — the query just runs and comes back "clean" because it never saw the missing region's tables at all.
Step three: diff expected against actual
With both lists as data, the actual audit is a straightforward set difference — the interesting engineering is in how you run it and what you do with the result, not the diff logic itself:
from google.cloud import bigquery
def get_existing_tables(client: bigquery.Client, project: str, region: str) -> set[tuple[str, str, str]]:
query = f"""
SELECT table_catalog, table_schema, table_name
FROM `{project}.region-{region}`.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
"""
return {
(row.table_catalog, row.table_schema, row.table_name)
for row in client.query(query).result()
}
client = bigquery.Client()
existing = get_existing_tables(client, "your-project", "us")
missing = expected_tables - existing
if missing:
print(f"{len(missing)} tables referenced by pipelines but not found:")
for project, dataset, table in sorted(missing):
print(f" - {project}.{dataset}.{table}")
else:
print("All referenced tables exist.")
Running this as a nightly scheduled check — rather than a script someone remembers to run occasionally — is what turns it from a debugging tool into an actual monitoring system. A missing table found by a scheduled audit at 6 AM, with time to investigate before any pipeline runs, is a completely different experience than the same missing table discovered by a failed production DAG at 4 AM with no warning.
Handling the noisy cases: views, wildcards, and sharded tables
A naive existence check produces false positives fast if it doesn't account for a few BigQuery-specific patterns that look like missing tables but aren't:
Views vs. base tables. The WHERE table_type = 'BASE TABLE' filter above deliberately excludes views. If a pipeline reads from a view rather than a table, checking only BASE TABLE will flag it as missing even though it resolves fine. The fix is to check both:
SELECT table_schema, table_name, table_type
FROM `your-project.region-us`.INFORMATION_SCHEMA.TABLES
WHERE table_type IN ('BASE TABLE', 'VIEW');
Wildcard table references. Pipelines that query dataset.events_* with a _TABLE_SUFFIX filter don't reference a single table — they reference a whole family of date-sharded tables. Checking for the literal string events_* against INFORMATION_SCHEMA.TABLES will never match anything, since no table is literally named that. These references need to be recognized during the extraction step and checked with a prefix match instead of an exact match:
def check_wildcard_table(client, project, dataset, prefix):
query = f"""
SELECT COUNT(*) AS shard_count
FROM `{project}.region-us`.INFORMATION_SCHEMA.TABLES
WHERE table_schema = @dataset
AND table_name LIKE @prefix
"""
job_config = bigquery.QueryJobConfig(query_parameters=[
bigquery.ScalarQueryParameter("dataset", "STRING", dataset),
bigquery.ScalarQueryParameter("prefix", "STRING", f"{prefix}%"),
])
result = list(client.query(query, job_config=job_config).result())
return result[0].shard_count > 0
Time-partitioned tables with a decorator. A reference like dataset.table$20260615 is a partition decorator, not a distinct table name. These need to be stripped back to the base table name before checking, or every partition-scoped read in the codebase will register as a false miss.
Skipping these three cases is the single biggest reason a first attempt at this kind of audit produces a huge, mostly-wrong list of "missing" tables on the first run and gets abandoned as noisy. Handling them explicitly is what makes the audit trustworthy enough that people actually act on its output.
Turning it into an actual alert, not just a report
A script that prints missing tables to stdout is only useful if someone is looking at the output. The last step is wiring the audit into something that surfaces failures without requiring a human to remember to check:
def run_audit_and_alert(expected, existing, webhook_url):
missing = expected - existing
if not missing:
return
lines = [f"⚠️ Table audit found {len(missing)} missing table(s):"]
lines += [f"• `{p}.{d}.{t}`" for p, d, t in sorted(missing)]
message = "\n".join(lines)
import requests
requests.post(webhook_url, json={"text": message}, timeout=10)
Scheduled as a small Cloud Function or a lightweight Airflow DAG that runs before the main pipeline DAGs kick off, this converts "table existence" from an assumption baked silently into every downstream pipeline into something that's actively checked and reported on — closer to a health check than a debugging exercise.
Takeaway
INFORMATION_SCHEMA.TABLES is a small, well-documented view, but the value isn't in the query — it's in building the expected-table inventory it gets checked against, handling the BigQuery-specific noise (views, wildcards, partition decorators, region scoping) that would otherwise drown the signal in false positives, and running it on a schedule instead of reactively. Done properly, it converts "why did this pipeline fail at 4 AM" into "the nightly audit already told us this table would be a problem, three hours before anything depended on it."