Building Retrieval-Augmented Generation (RAG) Systems on GCP: An Architecture and Debugging Guide · Ismail Ait Bahammou — Portfolio

What RAG is actually solving

A large language model's knowledge is frozen at training time and generic to whatever it was trained on. Ask it about internal documentation, a private codebase, or anything created after its cutoff, and it either says it doesn't know or — worse — generates something plausible-sounding and wrong. Retrieval-Augmented Generation fixes this by inserting a retrieval step before generation: given a user's question, first find the most relevant chunks of your own data, then hand those chunks to the model as context along with the question. The model still does the reasoning and language generation, but it's grounded in real, retrievable text instead of relying purely on what it memorized during training.

The mechanics are conceptually simple — chunk your documents, embed the chunks into vectors, store the vectors in a searchable index, embed the incoming query the same way, and retrieve the nearest chunks by vector similarity — but every one of those five steps has a failure mode that doesn't show up until you're past the prototype stage and dealing with a real document set at real scale. That's the useful part to actually document.

GCP's RAG surface area: three tools, three levels of abstraction

Google Cloud doesn't offer a single "RAG service" — it offers a spectrum of tools at different abstraction levels, and picking the wrong one for your use case is a common early mistake:

Vertex AI RAG Engine — a managed orchestration layer that handles chunking, embedding, indexing, and retrieval for you, with a corpus abstraction as the core object. It's the middle ground: less setup than building everything yourself, more control than a fully turnkey search product.
Vertex AI Vector Search — the underlying managed vector database (formerly Matching Engine), if you want to own the retrieval pipeline yourself rather than use RAG Engine's managed orchestration. The service was substantially redesigned in early 2026 to consolidate vectors, metadata, and chunk text into a single resource instead of requiring you to separately manage an index, an index endpoint, and an external metadata store.
BigQuery vector search — if your data already lives in BigQuery and you want to keep embeddings alongside your existing tables rather than standing up a separate vector database, BigQuery supports storing embeddings as ARRAY<FLOAT64> columns and querying them with a native vector search function.
As of 2026, Google has also reorganized Vertex AI itself under a broader "Agent Platform" umbrella, with RAG Engine and Vector Search now living under that platform's "Build" section rather than as standalone Vertex AI products — worth knowing so console navigation and documentation URLs don't look unfamiliar if you learned this from older material.

The rule of thumb: use RAG Engine when you want a managed pipeline and don't want to own chunking/retrieval logic yourself; use Vector Search directly when you need fine-grained control over indexing behavior or are integrating with a non-Google embedding/generation stack; use BigQuery vector search when your source data and your embeddings both belong in the same analytical warehouse and you want to avoid operating a separate system entirely.

A minimal RAG Engine pipeline

The fastest way to get a working RAG pipeline on GCP is RAG Engine, since it collapses chunking, embedding, and indexing into a handful of API calls:

# create_corpus.py — set up a RAG corpus and ingest documents
import vertexai
from vertexai.preview import rag

vertexai.init(project="your-project-id", location="us-central1")

corpus = rag.create_corpus(
    display_name="internal-docs-corpus",
)

rag.import_files(
    corpus_name=corpus.name,
    paths=["gs://your-bucket/docs/"],
    chunk_size=512,
    chunk_overlap=100,
)

print(f"Corpus created: {corpus.name}")

# query_rag.py — retrieve relevant chunks for a question
import vertexai
from vertexai.preview import rag

vertexai.init(project="your-project-id", location="us-central1")

corpus_name = "projects/your-project-id/locations/us-central1/ragCorpora/CORPUS_ID"

response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus_name)],
    text="How do I configure SSL certificates for the load balancer?",
    similarity_top_k=5,
)

for chunk in response.contexts.contexts:
    print(f"Score: {chunk.score:.4f} | Source: {chunk.source_uri}")
    print(chunk.text[:200], "...")

# rag_generation.py — combine retrieval with generation
import vertexai
from vertexai.preview import rag
from vertexai.generative_models import GenerativeModel, Tool

vertexai.init(project="your-project-id", location="us-central1")

rag_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_resources=[rag.RagResource(rag_corpus=corpus_name)],
            similarity_top_k=5,
        )
    )
)

model = GenerativeModel("gemini-2.0-flash", tools=[rag_tool])
response = model.generate_content("How do I configure SSL certificates for the load balancer?")
print(response.text)

This is enough for a working prototype in an afternoon. The gap between this and something you'd trust in production is almost entirely in the failure modes below.

Failure mode 1: chunking that breaks semantic meaning

The chunk_size/chunk_overlap parameters above look like a minor detail, but bad chunking is the single most common reason RAG retrieval quality is poor. A fixed-size chunker that splits purely on character or token count will happily cut a table in half, separate a heading from the paragraph it introduces, or split a step-by-step procedure across two chunks that get retrieved independently and lose their ordering.

A quick diagnostic before assuming the embedding model or retrieval logic is at fault: pull the actual retrieved chunks for a known-bad query and read them.

def debug_retrieval(corpus_name, query, top_k=5):
    response = rag.retrieval_query(
        rag_resources=[rag.RagResource(rag_corpus=corpus_name)],
        text=query,
        similarity_top_k=top_k,
    )
    for i, chunk in enumerate(response.contexts.contexts):
        print(f"--- Chunk {i+1} (score={chunk.score:.4f}) ---")
        print(chunk.text)
        print()

debug_retrieval(corpus_name, "What's the rollback procedure if a deployment fails?")

If the top-scoring chunks look semantically unrelated to the query, or contain only a fragment of the relevant procedure, the fix is almost always in chunking strategy — larger chunks with more overlap for procedural content, or structure-aware chunking (splitting on headings/sections instead of raw character count) for documents with meaningful hierarchy — rather than in the embedding model.

Failure mode 2: embedding dimension and model mismatches

If you switch embedding models mid-project — say, moving from text-embedding-004 to a newer model for better retrieval quality — every previously indexed chunk was embedded with the old model's vector space. Mixing vectors from two different embedding models in the same index doesn't throw an error; it just produces silently meaningless similarity scores, since the two models don't share a coordinate system.

from vertexai.language_models import TextEmbeddingModel

def check_embedding_dimensions(model_name: str, sample_text: str = "test") -> int:
    model = TextEmbeddingModel.from_pretrained(model_name)
    embedding = model.get_embeddings([sample_text])[0]
    return len(embedding.values)

old_dim = check_embedding_dimensions("textembedding-gecko@003")
new_dim = check_embedding_dimensions("text-embedding-005")
print(f"Old model dimension: {old_dim}, new model dimension: {new_dim}")

if old_dim != new_dim:
    print("Dimension mismatch — index MUST be fully rebuilt, not incrementally updated.")

The rule that prevents this from becoming a production incident: any embedding model change requires a full re-embed and re-index of the entire corpus, never an incremental update. It's worth building this check into your ingestion pipeline as an explicit guard rather than trusting that nobody will change the model config without remembering the consequence.

Failure mode 3: stale indexes after document updates

RAG Engine and Vector Search both index documents at ingestion time. If a source document changes — a policy gets updated, a runbook gets corrected — the index doesn't know unless you explicitly re-ingest it. This produces a particularly dangerous failure: the system doesn't fail loudly, it confidently retrieves and cites the old version of the document as if it were current.

A basic staleness check compares source document modification time in Cloud Storage against last ingestion time for the corpus:

from google.cloud import storage
import datetime

def find_stale_documents(bucket_name: str, last_ingestion_time: datetime.datetime) -> list[str]:
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    stale = []
    for blob in bucket.list_blobs():
        if blob.updated > last_ingestion_time:
            stale.append(blob.name)
    return stale

last_ingest = datetime.datetime(2026, 6, 15, tzinfo=datetime.timezone.utc)
stale_docs = find_stale_documents("your-bucket", last_ingest)

if stale_docs:
    print(f"{len(stale_docs)} document(s) modified since last ingestion:")
    for doc in stale_docs:
        print(f"  - {doc}")

Running this as a scheduled check — the same pattern as any pipeline health audit — and triggering a targeted re-ingestion of just the changed files (rather than a full corpus rebuild every time) keeps the index honest without making ingestion the bottleneck of your document workflow.

Failure mode 4: BigQuery vector search index staleness and query cost

If you're using BigQuery's native vector search instead of a dedicated vector database, the failure modes shift slightly. VECTOR_SEARCH can run against a table directly (brute-force, exact but slow at scale) or against a CREATE VECTOR INDEX-backed approximate index (fast, but the index needs to be refreshed as new rows are added):

-- Exact search: correct but scans every row — fine for small corpora, expensive at scale
SELECT
  base.chunk_id,
  base.chunk_text,
  distance
FROM
  VECTOR_SEARCH(
    TABLE `your_dataset.document_embeddings`,
    'embedding',
    (SELECT embedding FROM `your_dataset.query_embeddings` WHERE query_id = 'q1'),
    top_k => 5,
    distance_type => 'COSINE'
  );

-- Create an approximate index for large corpora
CREATE OR REPLACE VECTOR INDEX doc_embedding_index
ON `your_dataset.document_embeddings`(embedding)
OPTIONS (
  index_type = 'IVF',
  distance_type = 'COSINE'
);

-- Check whether the index is actually being used, or silently falling back to brute force
SELECT
  table_name,
  index_name,
  index_status,
  coverage_percentage
FROM
  `your_dataset.INFORMATION_SCHEMA.VECTOR_INDEXES`;

The detail that catches people off guard: a freshly created vector index doesn't immediately cover 100% of the table. coverage_percentage ramps up over time as BigQuery builds the index in the background, and queries against uncovered rows silently fall back to a brute-force scan for that portion of the data. If retrieval latency spikes unexpectedly after a large ingestion batch, checking coverage_percentage against INFORMATION_SCHEMA.VECTOR_INDEXES — the same diagnostic instinct as auditing table existence — is usually the fastest way to confirm it's an indexing lag rather than a query plan problem.

Failure mode 5: retrieval that's technically correct but contextually wrong

The hardest failure mode to catch mechanically is retrieval that returns chunks with high similarity scores that are nonetheless the wrong answer — a support document that describes the old version of a feature, or a chunk that's topically similar but answers a subtly different question than the one asked. Vector similarity measures semantic closeness, not correctness or recency, and nothing in the pipeline enforces either of those by default.

Two mitigations are worth building in rather than treating as optional polish:

Metadata filtering alongside vector search, so retrieval can be scoped by recency, document type, or source authority rather than relying purely on embedding similarity:

response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus_name)],
    text=query,
    similarity_top_k=10,
    filter=rag.Filter(
        metadata_filter='source_type = "official_docs" AND last_updated >= "2026-01-01"'
    ),
)

A minimum-score threshold, so the system explicitly says "I don't have a confident answer" instead of generating a response grounded in a weak or barely-related match:

MIN_CONFIDENCE = 0.72

relevant_chunks = [c for c in response.contexts.contexts if c.score >= MIN_CONFIDENCE]

if not relevant_chunks:
    print("No sufficiently relevant context found — returning fallback response instead of generating from weak matches.")

Takeaway

RAG on GCP is easy to prototype and genuinely hard to trust in production, and the gap between the two is almost entirely operational: chunking strategy, embedding model consistency, index freshness, and confidence thresholds — not the retrieval or generation APIs themselves, which work as documented. Treating a RAG pipeline with the same debugging discipline you'd apply to any data pipeline — instrumenting retrieval quality, auditing staleness, and checking index coverage rather than assuming the managed service handles all of it invisibly — is what separates a demo that works on the first three questions from a system people actually rely on.