Feb 1, 2024

Toward Safe, Codebase-Scale Program Editing with LLMs

Next steps Toward Safe, Codebase-Scale Program Editing with LLMs

Marco Pranjes

Founder - Software Engineer

QuickIDE: Toward Safe, Codebase-Scale Program Editing with LLMs

Abstract

Large Language Models (LLMs) are transforming developer tooling, yet reliably applying codebase-scale edits remains unsolved in practice. The core challenge is not text generation; it’s context orchestration, semantic change planning, safety, and verification across millions of tokens of source, build rules, generated artifacts, and external dependencies. We present the QuickIDE architecture for end-to-end, verifiably safe program editing at repository scale. The system integrates (1) a hybrid vector–graph semantic index; (2) a token-budgeted, submodular context planner; (3) an edit planner that treats refactors as a constrained optimization over the repository graph; (4) an execution layer with hermetic sandboxes, semantic three-way merges, and provenance; and (5) a multi-stage verifier combining static checks, tests, and symbolic differentials. We detail algorithms, interfaces, and guarantees, along with a practical evaluation protocol that correlates offline metrics with developer trust.

1. Introduction

LLM coding assistants excel at local suggestions but fail in three recurring enterprise scenarios:

global refactors (e.g., API shifts across hundreds of modules),
multi-language edits (build and infra wiring), and
correctness under complex constraints (types, tests, contracts, security policy).

QuickIDE addresses this by reframing “AI coding” as Plan → Context → Edit → Verify → Commit, with each step engineered to be measurable and rollback-safe. LLMs are used where they are strongest (semantic synthesis, transformation), while critical parts—dependency discovery, constraint enforcement, and verification—are handled through deterministic systems.

2. Problem Statement

Given a repository R\mathcal{R}R (files, symbols, tests, build targets) and a change intent I\mathcal{I}I (natural language spec, examples, failing tests, or diffs), produce a minimal, verifiably correct patch set Δ\DeltaΔ such that post-conditions Φ\PhiΦ hold (compiles, tests pass, contracts satisfied), while respecting constraints Γ\GammaΓ (style, security, performance budgets, ownership).

Formally, define a repository graph G=(V,E)G = (V, E)G=(V,E) where nodes VVV are artifacts (files, symbols, targets) and edges EEE capture relations (imports, calls, ownership, test coverage). We seek:

argminΔ cost(Δ)s.t.Φ(R⊕Δ)=true, Γ(R⊕Δ)=true\underset{\Delta}{\text{argmin}} \; \text{cost}(\Delta) \quad \text{s.t.} \quad \Phi(\mathcal{R} \oplus \Delta) = \text{true}, \; \Gamma(\mathcal{R} \oplus \Delta) = \text{true}Δargmincost(Δ)s.t.Φ(R⊕Δ)=true,Γ(R⊕Δ)=true

where ⊕\oplus⊕ applies edits with semantic merge.

3. Repository Understanding via a Hybrid Vector–Graph Index

Pure vector search over chunks misses dependency structure; pure graphs miss fuzzy semantics. QuickIDE builds a hybrid index:

Graph layer.

Parse with Tree-sitter/LSP to materialize ASTs and symbol tables.
Build a call/import graph; annotate edges with compile units and test coverage.
Attach ownership and blast radius scores per node (e.g., critical paths, SLO-sensitive services).

Vector layer.

Chunk code/doc/tests with structure-aware windows (function/class scopes, build rules).
Embed: code-aware dual encoders (code/text), docstrings, commit messages.
Store embeddings with temporal decay and lineage (commit SHA, author).

Cross-links.

Every node has keys into both spaces. A retrieval yields a bundle:
Bundle=(AST span,embedding neighbors,graph neighborhood)\text{Bundle} = (\text{AST span}, \text{embedding neighbors}, \text{graph neighborhood})Bundle=(AST span,embedding neighbors,graph neighborhood).

This design lets us blend precise dependency reachability with semantic similarity and recency.

4. Token-Budgeted Context Planning as Submodular Optimization

The LLM context window is finite. Naïve “top-k chunks” either omit critical dependencies or flood with near-duplicates. We treat context selection as a submodular coverage problem with redundancy penalties:

Let C\mathcal{C}C be candidate spans, each with:

relevance r(c∣I)r(c \mid \mathcal{I})r(c∣I) (intent match),
centrality κ(c)\kappa(c)κ(c) (graph centrality/blast radius),
novelty penalty via similarity ρ(c,S)\rho(c, S)ρ(c,S) to already selected SSS,
size τ(c)\tau(c)τ(c) (tokens).

Objective under token budget BBB:

max⁡S⊆C,∑c∈Sτ(c)≤B[∑c∈S(αr(c)+βκ(c))−λ∑ci,cj∈Sρ(ci,cj)]\max_{S \subseteq \mathcal{C}, \sum_{c \in S} \tau(c) \le B} \left[ \sum_{c \in S} \left( \alpha r(c) + \beta \kappa(c) \right) - \lambda \sum_{c_i, c_j \in S} \rho(c_i, c_j) \right]S⊆C,∑c∈Sτ(c)≤Bmaxc∈S∑(αr(c)+βκ(c))−λci,cj∈S∑ρ(ci,cj)

We use a greedy algorithm with lazy evaluations (near-optimal for submodular functions) and a multi-tier context:

Kernel: minimal specs, interfaces, types required for correctness.
Evidence: exemplars and tests triggering the change.
Neighborhood: immediate dependencies and callers.
Background: style guides, policy snippets (compressed).

5. Edit Planning as Constrained Optimization on the Repository Graph

Global refactors are specified as patterns over the graph (e.g., “rename method X → Y where X: (T1,…,Tn) and update all callers, build rules, and docs; preserve binary compatibility in public modules”). QuickIDE compiles such specs into edit plans:

Discovery.
- Solve for all matches in GGG that satisfy structural patterns and constraints.
- Rank sites by risk (test coverage gaps, ownership, blast radius).
Template synthesis.
- For each site, generate a candidate transformation (LLM guided) bound to AST nodes, not raw text.
Global constraints solving.
- Merge local candidates; enforce cross-site invariants (types unify, visibilities respected, API stability).
- If conflicts, compute explanatory counterexamples (type errors, missing imports) and feed back to the planner.

Mathematically, we solve:

min⁡Δ={δv}∑v∈V′risk(v,δv)s.t.∀global constraints γ∈Γ:γ(Δ)=true\min_{\Delta = \{\delta_v\}} \sum_{v \in V'} \text{risk}(v,\delta_v) \quad \text{s.t.} \quad \forall \text{global constraints } \gamma \in \Gamma: \gamma(\Delta) = \text{true}Δ={δv}minv∈V′∑risk(v,δv)s.t.∀global constraints γ∈Γ:γ(Δ)=true

6. Semantics-Aware Patch Application & Conflict Resolution

Textual three-way merges are brittle. We apply AST-level diffs and semantic merges:

Parse original OOO, proposed PPP, and upstream UUU; compute GumTree/Myers hybrids on trees.
Conflicts are elevated to semantic conflicts (e.g., two edits change the same method signature differently).
QuickIDE proposes repair actions (add overload vs. modify call sites; introduce adapter; deprecate shim).

Ownership policies are enforced: edits to sensitive modules require “gates” (codeowners, security checks). Patches are signed with provenance (intent hash, index version, model build).

7. Hermetic Execution & Provenance

Every plan runs in a hermetic dev container:

Exact toolchains, dependency locks, and environment vars are fixed.
I/O is traced (eBPF) for reproducibility logs.
Any network egress is policy-guarded (e.g., artifact mirrors only).

We record a Provenance Ledger:

Ledger={intent,retrieval bundles,model+params,edits,checks,test matrix}\text{Ledger} = \{\text{intent}, \text{retrieval bundles}, \text{model+params}, \text{edits}, \text{checks}, \text{test matrix}\}Ledger={intent,retrieval bundles,model+params,edits,checks,test matrix}

This underpins auditability and “explain my change” UX.

8. Multi-Stage Verification: Static → Dynamic → Symbolic Differentials

Verification is staged to catch the cheapest errors first:

Static: parsing, formatting, lint, typecheck, build graph sanity (no orphaned targets).
Targeted tests: only impacted tests via coverage maps; then expansion to suite-level if risky.
Symbolic differentials for critical code paths:
- Instead of re-proving whole programs, generate path conditions affected by edits and check safety predicates Ψ\PsiΨ (e.g., null-safety, side-effect bounds).
- For API shifts, generate contracts and check client code obligations automatically.

A patch passes only if all stages succeed; otherwise explanatory artifacts are produced and the plan is revised.

9. Context Lifecycles: Memory Without Drift

LLMs accumulate “soft state” (conversations, scratch). QuickIDE enforces context lifecycles:

Ephemeral Workset: the token-budgeted bundle bound to an intent run; discarded after execution or checkpointed with the ledger.
Stable Knowledge: style, policies, architecture docs—versioned and diffed; changes invalidate caches.
Learned Mappings: embeddings and graph features re-indexed incrementally on commits; drift is monitored via retrieval QA (see §11).

This separation prevents stale context from silently steering future edits.

10. Tool Protocols & Orchestration

QuickIDE treats tools as typed, declarative functions with capabilities and safety classes. Examples:

analyze_graph(query) -> graph_view
select_context(intent, budget) -> bundle
apply_ast_edit(file, patch) -> diff
run_targets(targets, limits) -> results

A lightweight planner composes these tools with LLM calls. Crucially, LLMs do not execute shell; they request tool invocations which the orchestrator mediates with guardrails and rate limits. This prevents prompt-level jailbreaks from escaping the sandbox.

11. Evaluation Protocol (Dev-Trust Aligned)

Offline metrics like BLEU on code or naive pass-at-k correlate poorly with developer trust for repo-scale edits. We propose EditBench-R, a suite with:

Scenarios: API migration, cross-language rename, build rule changes, security-motivated parameterization, incident hotfix with regression guard.
Artifacts: full monorepo snapshots with coverage maps and flaky-test annotations.
Scoring:
- Correctness: compiles, targeted tests, contracts.
- Blast Radius: number of affected files vs. theoretical minimum.
- Risk-Adjusted Latency: time × (risk class weights).
- Rollbackability: clean revert possible (semantic).
- Provenance Completeness: ledger integrity, determinism score.

We track retrieval quality (precision/recall over oracle dependencies), merge conflict rate, and counterexample resolution rate across iterations. These better predict “will engineers accept the PR?”

12. Security & Policy Integration

Edits are policy-checked:

Data exfiltration: redaction filters and taint rules on context; secrets never enter prompts.
Supply chain: build uses pinned digests; generated code must respect approved licenses and dependency allowlists.
Access control: ownership graph gates edits; reviewers mapped from codeowners.
Model governance: model IDs, prompts, and sampling params are logged; diff between prompt versions is reviewable.

13. Practical Walkthrough

Intent.
“Upgrade AuthToken.validate() to support short-lived tokens (≤ 60s) and rotate refresh logic; update all callers, configs, and docs.”

Plan.

Graph query finds AuthToken.validate symbol, callers across api/ and cli/, and tests touching auth flows.
Context planner builds a bundle: interface, representative callers, failing tests, policy snippet on token TTL.
LLM synthesizes a new signature and adapter shim for legacy clients.
Edit planner propagates parameter changes to call sites, adds config keys with defaults, updates OpenAPI and CLI help.
Semantic merge applies changes; ownership check flags security/ team as reviewers.

Verify.

Static checks: types, lint, OpenAPI schema re-gen.
Targeted tests: auth suite + impacted endpoints.
Symbolic diff: ensure validate() new path never accepts expired tokens; prove monotonicity of issued_at + ttl.
Provenance ledger captures every step.

Outcome.
A PR with minimal diff, reviewers pre-assigned, change risk quantified, and a one-click rollback plan.

14. Implementation Notes (What Makes QuickIDE Different)

AST-anchored edits avoid “spooky action at a distance.”
Submodular context selection maximizes utility per token rather than “RAG-spam.”
Edit plans are first-class artifacts you can diff, review, and re-run.
Symbolic differentials give targeted assurance without full program verification cost.
Hermetic, provenance-first execution makes results reproducible and auditable—critical for regulated industries.

15. Limitations & Future Work

Long-tail languages: Tree-sitter support helps, but type systems vary; we fall back to textual diffs with extra verification.
Dynamic frameworks: reflection and runtime patching hinder static reachability; we combine runtime probes and tracing to augment the graph.
Performance: very large monorepos require sharded indices and incremental planning; we’re investing in cache-aware retrieval and streaming verification.
Human factors: even perfect patches need trust. We’re building explainer UIs that show why each file is in the diff, the tests it affects, and the invariants preserved.

16. Conclusion

QuickIDE reframes AI coding from ad-hoc prompt engineering to a controlled systems pipeline that plans, retrieves, edits, and verifies at repository scale. By unifying a hybrid semantic index, submodular context planning, constraint-aware edit planning, hermetic execution, and multi-stage verification, QuickSolutions provides a path to reliable AI-assisted development that organizations can trust in production.

-Marco and spelling checked by AI

Get started

Your Journey for The Best Innovations Starts Here

Request a demo

Get started

Your Journey for The Best Innovations Starts Here

Request a demo

Get started

Your Journey for The Best Innovations Starts Here

Request a demo