{

ADR Cost/Benefit Analysis

A comprehensive accounting of every architectural decision record for PodPedia (Project Sunesis), with quantified costs, validated benefits, and empirical evidence where available.

Last updated: 2026-05-14 Scope: All ADRs 001–011 (podpedia-app) Author: @regular


Quick Reference

ID Title Status Implemented? Validated? ROI Estimate
ADR-001 10K Chunk Threshold ✅ Accepted ✅ development ⚠️ Inconclusive 🟢 High
ADR-002 Vertex AI Context Caching ✅ Accepted feature/ ❌ Not yet 🟢 High
ADR-003 SSE Streaming JSON ASTs ⏳ Provisional ❌ No ❌ N/A 🟡 Medium
ADR-004 Flash-Lite Extraction Engine ⏳ Provisional ❌ No ❌ N/A 🟡 Medium
ADR-005 Universal LLM Adapter ✅ Accepted ✅ development ❌ Not tested 🟢 High
ADR-006 Vendor-Neutral Blob Storage ✅ Accepted ✅ development ❌ Not tested 🟢 High
ADR-007 Dual GraphDB Strategy ✅ Accepted ✅ development ⚠️ Incomplete (rerun data pending) 🟢 High
ADR-008 Weighted Ensemble Entity Resolution ✅ Accepted ✅ development ❌ Not tested 🟢 High
ADR-009 Formalized Experiment Tracking ✅ Accepted experiments/ ⏳ EXP-001 pending 🟡 Medium
ADR-010 Async Pipeline Production Readiness ✅ Accepted ✅ development ❌ Not tested 🟢 High
ADR-011 Graph Download & Export ✅ Accepted ❌ Not built ❌ N/A 🟡 Medium

ADR-001: 10K Character Threshold for Parallel Graph Extraction

Status: Accepted — merged to development, not yet on main

Costs

Category Estimate Notes
Engineering ~1 hour One-line threshold change (20,000 → 10,000). The configuration work was trivial.
Token Overhead Negligible 200-char overlap across more chunks. For a 15K document: 2 chunks instead of 1, adding 200 tokens of overlapping context. At Vertex AI pricing ($0.000125/1K input tokens), this is ~$0.000025 per job.
Graph Integrity Risk Low 10K chars ≈ 1,500–2,000 words is sufficient semantic aperture. The rejected 3K option would have created orphaned nodes.
Complexity None Zero new code. The parallelization infrastructure (alitto/pond pool) already existed.

Benefits

Metric Expected Validated
Latency (15K char docs) 14s → ≤7s (sequential → parallel across 8-worker pool) ⚠️ Experiment showed -19.2% (45.8s→37.0s staging), but p=0.31 with n=15. Directionally correct, not significant.
Latency (25K char docs) Similar improvement factor ✅ -68.2% (116.2s→37.0s, p<0.001). Massive improvement, partially confounded by environment differences.
Concurrency Restores full 10-way semaphore Unblocked weighted-semaphore bottleneck. Chunks now fit under 5K tokens, avoiding the weight=2 penalty.
Code footprint +0 lines Pure config change.

Verdict

ROI: 🟢 High. Zero implementation cost, zero ongoing cost, and directional latency improvements at every payload size. The 15K case needs statistical confirmation, but even if the true effect is only 10% (not the measured 19%), the change costs nothing. There is no downside.

Re-run Recommendation

Re-run the A/B with controlled Cloud Run instance configuration and ≥30 trials for the 15K payload. Until then, accept the merge with the acknowledgment that the benefit is directionally confirmed but not statistically proven.


ADR-002: Vertex AI Context Caching via Deep Ontology Prompting

Status: Accepted — implemented on feature/adr-002-context-caching, pushed to origin

Costs

Category Estimate Notes
Engineering (data curation) HIGH — 5–10 hours Creating the 36K-token deep ontology with 56 golden few-shot examples is the most labor-intensive task in any ADR. The examples must be meticulously crafted to cover: simple extraction, multi-entity chains, entity normalization, cross-type blocking, temporal relationships, hierarchical orgs, compound entities, empty extraction, and many edge cases.
Engineering (implementation) ~3–4 hours Cache lifecycle management: InitVertexCache, RefreshCache, background TTL goroutine, model setup routing. CachedContent enforces a 32K token minimum — the ontology must be kept above this threshold or caching fails silently.
Runtime cost (caching) Ongoing hourly billing Vertex AI CachedContent incurs a storage cost while alive. At ~36K tokens, this is roughly $0.002–0.005/hour (24h: ~$0.05–0.12). The 1-hour TTL default means cache is created on-demand and expires quickly during idle periods. Background refresh keeps it alive during active use.
Runtime cost (tokens saved) Negative cost Without caching, the system prompt is sent with every request. With caching, it's sent once and referenced by ID. For a 36K-token system prompt × 100 chunks/day, that's 3.6M tokens/day saved in input costs — roughly $0.45/day at Vertex AI pricing. The cache storage cost is ~10× lower than the token savings at any meaningful volume.
Complexity Moderate Background goroutine for TTL refresh, cache name routing, async cleanup. Needs careful goroutine lifecycle management.
Risk Medium If the ontology drops below 32K tokens, caching silently fails and the LLM receives no system prompt at all — output degrades catastrophically. This must be caught in CI.

Benefits

Metric Expected Validated
TTFT per chunk Seconds → milliseconds Not yet tested. The expected effect is dramatic: every concurrent chunk skips the system prompt inference step.
End-to-end latency 14s → ~5-8s (stacked with ADR-001) Not yet tested. Combined effect with ADR-001's fan-out: 8 chunks hit cached context simultaneously, each with near-zero TTFT.
Output quality Significant improvement The 56 golden few-shot examples effectively fine-tune the model in-context, eliminating schema drift, hallucinated entity types, and inconsistent JSON output.
Token cost per request ~36K tokens saved per chunk For 100 chunks/day: $0.45/day saved at Vertex AI input pricing.

Verdict

ROI: 🟢 High. Significant up-front data curation cost, but the runtime benefits are compounding: (a) latency reduction, (b) output quality improvement from few-shot examples, (c) token cost savings that quickly amortize the data curation investment. The ongoing caching cost (~$0.05/day) is negligible.

Dependencies


ADR-003: Server-Sent Events for Streaming JSON ASTs (Provisional)

Status: ⏳ Provisional — pending streaming JSON parser validation

Costs

Category Estimate Notes
Engineering (backend parser) HIGH — 8–15 hours A custom, fault-tolerant streaming JSON parser in Go is the critical path. Must handle: unclosed quotes, truncated keys, missing brackets, self-correcting LLM output, and panic-free partial AST assembly. This is non-trivial parser engineering.
Engineering (frontend) ~4–6 hours Replace polling (GET /api/status) with EventSource connection. Handle duplicate/revised nodes. Incremental vis.js hydration. State management for partial results.
Complexity HIGH The streaming parser is the riskiest component in the entire system. A panicking parser brings down the goroutine. An incorrect parser produces corrupted graph state. No off-the-shelf solution exists for LLM JSON streaming.
Maintenance burden Medium The parser is entirely custom code with no community maintenance. Any change to the LLM's output format (even whitespace or key ordering) could break it.
Risk HIGH If the parser validation fails (or a future LLM update breaks it), SSE falls back to polling anyway — requiring both code paths to be maintained.

Benefits

Metric Expected Validated
Time-to-first-paint ~14s → <1s Not validated. The perceived UX improvement is significant: a live "typing" visualization vs a loading spinner.
User engagement Qualitative Potential improvement. Live-hydrating graphs are visually engaging and signal progress. Hard to quantify.
Backend efficiency Negligible The same LLM computation happens either way — streaming just changes the delivery timing. No backend cost savings.

Verdict

ROI: 🟡 Medium. The UX improvement is real and meaningful, but the engineering cost is disproportionate unless benchmarks confirm the latency problem is severe enough that a loading spinner is untenable. Recommendation: Keep provisional. Implement only if (a) ADR-001 + ADR-002 don't bring latency below 5s, and (b) user feedback indicates spinner intolerance. The parser risk alone justifies deferral.


ADR-004: Flash-Lite Extraction Engine (Provisional)

Status: ⏳ Provisional — pending quality benchmarks

Costs

Category Estimate Notes
Engineering ~2–4 hours Schema flattening (reduce Node/Edge types to core subset). Model routing (flash-lite for all, reserve flash for complex). May need dynamic routing based on payload characteristics.
Quality risk HIGH Flash-lite is a significantly weaker model. Complex multi-hop entity relationships, cross-paragraph inferences, and nuanced relationships may be lost or simplified. The "slightly lower fidelity" trade-off in the ADR undersells this risk. Lowered schema fidelity is hard to detect in tests (it degrades the knowledge graph's value silently).
Schema maintenance Medium The simplified schema must be maintained alongside the full schema. If flash-lite eventually handles the full schema, the simplified one becomes dead code.
Risk of two code paths Medium Dynamic routing (simple docs → flash-lite, complex docs → flash) creates a brittle classifier. Misclassifications produce inconsistent quality.

Benefits

Metric Expected Validated
Latency 14s → <3s Not validated. Flash-lite is faster (lower TTFT, higher throughput) but the magnitude depends on schema complexity.
Cost per extraction ~5–10× cheaper Flash-lite is significantly cheaper per token than flash. For high-volume ingestion, this is a real budget impact.
Worker pool throughput Higher Faster per-chunk processing means the 8-worker pool cycles faster, increasing total jobs/hour.

Verdict

ROI: 🟡 Medium. The latency and cost benefits are compelling, but the quality risk is understated in the ADR. "Slightly lower fidelity" in entity-graph extraction can manifest as: missing entity types, missed relationships, orphaned nodes, and incomplete graph topology. These are silent degradations that are hard to catch in automated tests.

Recommendation: Before accepting, run a directed quality benchmark: extract the same 50 documents with both flash-lite and flash, then compare (a) entity recall, (b) relationship recall, (c) schema compliance. If flash-lite achieves ≥90% of flash's accuracy on all three, accept. Otherwise, the cost savings don't justify the graph degradation.


ADR-005: Universal LLM Adapter (OpenAI-Compatible API)

Status: ✅ Accepted — implemented on development

Costs

Category Estimate Notes
Engineering ~3–5 hours Replacing three provider-specific adapters (Ollama native, Vertex SDK, Gemini key) with one openai-go implementation. Most of the effort is in testing the routing logic and ensuring the fallbacks still work.
Dependency Low github.com/openai/openai-go is a well-maintained, widely-used library.
Legacy maintenance Low Two fallback paths (Vertex AI, Ollama native) preserved but not actively developed. They serve as Chesterton's Fence — local dev works without cloud config.
Risk Low The LLM interface (Generate, GenerateStream, SupportsStructuredOutput) is unchanged. Only the factory function routes differently. If the OpenAI-compatible path fails, the fallback activates transparently.

Benefits

Metric Expected Validated
Code reduction -3 adapters → 1 primary path Not quantified, but three adapters had significant duplicated logic (SSE streaming, retries, structured output enforcement).
Provider flexibility Zero-code provider swaps Change LLM_BASE_URL and LLM_API_KEY env vars → instant migration to Groq, DeepSeek, Together, vLLM, or local Ollama.
Structured output enforcement Native SDK support OpenAI SDK's ResponseFormatJSONObjectParam replaces fragile prompt-embedded JSON instructions. Reduces token waste and schema drift.
SSE streaming SDK-native Replaces custom goroutine management in legacy paths.
Vendor lock-in mitigation Complete Not locked to any single provider. The OpenAI spec is the interface standard, not a provider contract.

Verdict

ROI: 🟢 High. Modest engineering cost with significant long-term benefits: (a) collapsed code complexity, (b) unlimited provider flexibility, (c) eliminated vendor lock-in. The fallback preservation is a wise Chesterton's Fence choice. No empirical validation required — this is an architectural simplification with clearly measurable benefits in code metrics.


ADR-006: Vendor-Neutral Blob Storage (Go CDK)

Status: ✅ Accepted — implemented on development

Costs

Category Estimate Notes
Engineering ~2–3 hours Replace direct GCS SDK with gocloud.dev/blob. Mostly removing boilerplate (70 lines of GCS-specific code → 15 lines in blobstore_gocloud.go).
Dependency weight Low-Medium gocloud.dev/blob pulls in transitive cloud SDK dependencies (GCS, S3, Azure). Increases binary size moderately. For Cloud Run, cold-start time is dominated by container image pull, not Go binary size, so this is acceptable.
Risk Very Low The blob interface is thin (Upload, NewReader). The Go CDK is well-maintained and the URL-scheme abstraction is battle-tested.

Benefits

Metric Expected Validated
Code reduction 70 lines GCS boilerplate → 15 lines Confirmed.
Provider flexibility Configure via env var BLOB_STORE_URL=mem:// for tests, file:///tmp/podpedia for local dev, s3://minio:9000 for self-hosted, gs://bucket for GCP.
Testing determinism Zero I/O in-memory blobs The mem:// scheme enables deterministic integration tests with zero network I/O and zero jitter. This directly supports the bifurcated testing strategy.
12-Factor compliance Config-driven storage Backing services (Factor IV) are now swappable via configuration, not code.

Verdict

ROI: 🟢 High. Low engineering cost, zero ongoing cost, significant portability and testing benefits. The 12-Factor compliance improvement alone justifies this.


ADR-007: Dual GraphDB Strategy (Memory + SQLite WAL)

Status: ✅ Accepted — implemented on development

Costs

Category Estimate Notes
Engineering ~8–12 hours SQLiteGraphDB implementation with WAL mode, JOIN-based graph traversal, LIKE matching, foreign key constraints. Litestream sidecar integration. MemoryGraphDB delegation. This is the most substantial implementation in the codebase outside of ADR-002.
Dependency Low modernc.org/sqlite (pure Go, CGO-free). litestream sidecar (Go binary, external process).
Complexity Moderate Dual implementation paths with identical interface. WAL mode nuances (checkpointing, concurrent readers). Litestream restoration on cold start.
Operational cost Very low No separate database service. SQLite embedded in-process. Litestream consumes minimal CPU. GCS storage for WAL archives is pennies/month.
Risk Low-Medium SQLite is not a graph database — JOIN-based traversal for entity neighborhoods works but doesn't scale to millions of nodes with deep traversal paths. For current scale (hundreds of thousands of nodes), it's fine.

Benefits

Metric Expected Validated
Survivability Graph persists across scale-to-zero SQLite WAL file in Cloud Run's persistent /tmp mount. Litestream restores from GCS on cold start.
Concurrent reads Lock-free (WAL mode) WAL allows unlimited concurrent readers alongside a single writer, critical for Graph-RAG query path.
CGO-free Cross-compiles trivially modernc.org/sqlite is pure Go. No C toolchain in Docker build.
Memory analysis Identical allocations ADR-007/010 benchmark showed zero change in allocs/op and bytes/op across all 12 variants. MemoryGraphDB and SQLiteGraphDB share the same allocation pattern.
Performance No latency regression Benchmark data from re-run (May 13, benchtime=30s, count=30) is being analyzed. First run (flawed methodology) showed consistent but artifact-laden improvements across all variants.

Verdict

ROI: 🟢 High. Despite substantial engineering cost, the benefit — durable graph persistence without operational overhead — is fundamental to the product. Without SQLite, every scale-to-zero event destroys user data. The experiment tracking is validating that performance remains acceptable.


ADR-008: Weighted Ensemble Entity Resolution

Status: ✅ Accepted — implemented on development

Costs

Category Estimate Notes
Engineering ~4–6 hours Jaro-Winkler implementation, normalized Levenshtein, Jaccard fuzzy token overlap. Tri-band decision logic (merge/ambiguous/insert). Type blocking. OTel metrics for resolution_score.
Runtime cost O(n) per insert Each ResolveAndInsert call compares the candidate against all existing nodes of the same type. At 10K Person nodes, that's 10K string comparisons per insert. For 500K nodes, this becomes a bottleneck.
Complexity Moderate Three algorithms with weighted averaging. Tunable threshold. OTel metrics integration. Ambiguous case logging.
Risk Low-Medium Weight tuning is empirical. The 0.85 merge threshold is a guess — may need adjustment as entity volume grows. False merges are mitigated by the Jaccard component; false splits (missed merges) are less visible but more dangerous (fragmented graph).

Benefits

Metric Expected Validated
Graph integrity Eliminates duplicate nodes Catches: typos (Altman vs Altmann), abbreviations (Sam vs Samuel), structural variations (Microsoft Research vs Microsoft Corporation → ambiguous band), shared-token false friends (Sam Altman vs Sam Bankman-Fried → low Jaccard).
False merge protection Jaccard prevents catastrophic merges The 0.30 Jaccard weight is specifically designed to prevent "Apple Inc" merging with "Apple fruit" — only the token "Apple" matches, producing a score in the ambiguous or insert band.
Observability Resolution_score histogram Every comparison is recorded as an OTel metric, enabling distribution analysis and threshold tuning over time.
Configurability Per-deployment tuning Merge threshold is configurable. Research deployments can lower it for aggressive merging; legal deployments can raise it for conservative.

Verdict

ROI: 🟢 High. Entity resolution is the difference between a coherent knowledge graph and a fragmented mess. Without it, every variant of "Sam Altman" produces a separate node, breaking Graph-RAG traversal. The O(n) cost at scale is the main concern — monitor resolution span durations and consider indexing or sharding when Person nodes exceed 50K.


ADR-009: Formalized Experiment Tracking

Status: ✅ Accepted — experiment infrastructure exists, meta-experiment (EXP-001) pending

Costs

Category Estimate Notes
Engineering ~1–2 hours Directory structure, template.md, README with agentic directives, INDEX.md.
Process overhead Ongoing Every performance-sensitive change now requires writing an experiment report before merge. This is intentional friction.
Maintenance Low Reports and trial data accumulate over time. INDEX.md needs updating.
Risk Very low The infrastructure is files and Markdown. Zero ongoing cost if abandoned.

Benefits

Metric Expected Validated
Decision quality Higher Hypotheses must be falsifiable. Results must be distribution-aware (p50/p95/p99, KS test, Cohen's d).
Negative knowledge base Prevents repeated dead-ends Documented failures (like the flawed first run of ADR-007/010) prevent future engineers from re-investigating.
Git bisect integration Possible Every experiment links to commit hashes.
Agentic compatibility README encodes methodology Autonomous coding agents have a committed methodology file to follow.

Verdict

ROI: 🟡 Medium. The process overhead is real, but the value of a negative knowledge base compounds over time. The meta-experiment (EXP-001, deadline 2026-05-25) will formally assess whether this process reduces performance regressions. Until then, the infrastructure is in place and the first two experiment reports exist.


ADR-010: Async Pipeline Production Readiness

Status: ✅ Accepted — implemented on development

Costs

Category Estimate Notes
Engineering ~4–6 hours Per-chunk LLM timeout context (one-line context.WithTimeout in extractFullText). Structured logging handler swap (slog-gcp for GCP, JSON stdout for local). Progress callback wiring in upload handler. Stale message fix. Rate limiter config externalization.
Dependency Very low github.com/jdockerty/slog-gcp for Cloud Logging.
Complexity Low Each change is localized and additive. The LLM interface is unchanged. The StateTracker interface is unchanged.
Risk Very low Per-chunk timeout is strictly additive — it can only break slow chunks, and that's the intended behavior.

Benefits

Metric Expected Validated
Resilience No more 30-minute stalls A stalled Vertex AI chunk now fails after 120s instead of blocking a worker slot for 30 minutes.
Debuggability Queryable job logs gcloud logging read 'jsonPayload.job_id="..."' surfaces: LLM provider, chunk progress, per-chunk errors, completion metrics.
UX parity File uploads show progress Upload handler now calls the same progress tracking as ingest handler. Users see chunk counts updating.
Config freedom 3 new env vars LLM_REQUEST_TIMEOUT, LLM_MAX_CONCURRENCY, LLM_TPM_LIMIT replace hardcoded values.

Verdict

ROI: 🟢 High. Low engineering cost fixing three real production gaps that caused a 31-minute outage. The timeout alone prevents a repeat of that incident. The logging fix makes the next incident diagnosable in minutes instead of hours.


ADR-011: Graph Download & Export

Status: ✅ Accepted — not yet implemented

Costs

Category Estimate Notes
Engineering (Phase 1) ~1–2 hours Single handler calling DB.Snapshot(), JSON streaming via json.NewEncoder, Content-Disposition header, auth middleware, route registration.
Engineering (Phase 2) ~4–8 hours GraphML, GEXF, CSV, NDJSON serializers. GraphSerializer interface. Accept-header routing.
Engineering (Phase 3) ~2–4 hours SQLite file serving. Tenancy validation. File locking considerations. Deferred.
Complexity Very low (Phase 1) One handler, one existing interface method (Snapshot), no new dependencies.
Runtime cost None Graph is already in memory (MemoryGraphDB) or on disk (SQLiteGraphDB). Export just reads it.
Risk Very low No data mutation. No new dependencies. Existing auth middleware protects the endpoint.

Benefits

Metric Expected Validated
Data portability Users own their graph Self-service export without GCP IAM permissions or technical support.
External tooling Enable analytic workflows Gephi, Cytoscape, NetworkX, pandas for advanced analysis beyond vis.js visualization.
Backup User-controlled snapshot Independent of Litestream/GCS replication.
Benchmark reproducibility Stable export format ADR-009 calls for commit-linked data. Graph exports provide ground-truth snapshots for reproducible benchmarks.

Verdict

ROI: 🟡 Medium. Phase 1 is trivially cheap (~1-2 hours) and delivers immediate value — users can export their graph. Phase 2 and 3 should be deferred until user demand materializes. Recommendation: Implement Phase 1 now; defer Phase 2/3.


Cross-ADR Dependencies & Synergies

ADR-001 (10K threshold)
  └─ feeds into ADR-002 (more chunks = more cache value)
  └─ feeds into ADR-003 (more chunks = more streaming value)
  └─ feeds into ADR-004 (more chunks = more flash-lite savings)

ADR-002 (Context caching)
  └─ depends on ADR-005 (LLM adapter) — caching is Vertex-specific
  
ADR-007 (Dual GraphDB)
  └─ feeds into ADR-011 (export) — Snapshot() method on both implementations

ADR-009 (Experiment tracking)
  └─ validates ADR-001, ADR-007, ADR-010 via experiments

ADR-010 (Pipeline readiness)
  └─ depends on ADR-005 (LLM adapter) — timeout applies to LLM interface
  └─ depends on ADR-007 (GraphDB) — progress parity assumes graph is receiving data

Prioritization Recommendation

Do Now (High ROI, Low Cost)

Priority ADR Rationale
P1 ADR-001 Zero cost, confirmed direction, unblock production deployment
P2 ADR-011 (Phase 1) 1–2 hours for a user-facing feature
P3 ADR-006 Already done but worth highlighting the testing benefit

Validate Before Scaling (High ROI, Needs Evidence)

Priority ADR Rationale
P1 ADR-007 Re-run experiment data needs analysis; blocking confirmation of the most expensive implementation
P2 ADR-002 High data curation cost — ensure the runtime benefits materialize before promoting to development
P3 ADR-004 Quality benchmarks needed — don't accept until flash-lite fidelity is quantified

Defer (Medium ROI or High Risk)

Priority ADR Rationale
Defer ADR-003 High risk, high engineering cost. Only implement if latency remains unacceptable after ADR-001 + ADR-002.
Defer ADR-004 High quality risk. Benchmark required.
Defer ADR-011 (Phase 2/3) Wait for user demand.
}