AGORA Team Review

PR #20: LocalLLM / vLLM Provider Support

A four-voice technical review of Joseph Breda's local-inference contribution to Will Chen's mike repository. Architect · Practitioner · Audit · OSS-Strategy. Plain English. Specific evidence. No marketing.

Four voices 9 May 2026 V>> + Craig Miller · DONNA team ~12 min read

Nullius in verba. — Take no one's word for it. Including ours. Four voices reviewing the same diff against the same falsifiers, then synthesising. Plain English. Specific evidence. Each load-bearing claim cites the file and line it rests on.

Hi @jpbreda, @willchen96, and the Mike OSS contributors —

This review is a co-authored technical pass by the GRIP / CodeTonight team. Joseph's PR landed on our radar through Craig Miller's conversation with Joseph, and we wanted to share a structured review now that we have read the diff end-to-end.

What follows is a four-voice review (the AGORA pattern we use internally), each grounded in specific files in the diff. Each voice applies the same critical-thinking discipline: name the evidence, scope the claim, state what would falsify it. Plain-language and ELI5 layers follow each voice's verdict so non-technical readers — lawyers, partners, compliance — can follow the substance without the jargon. Synthesis at the end. A brief note about a complementary piece we are shipping this week sits at the bottom.

A permanent version of this review is available at donnaoss.com/agora/ for anyone who prefers to read, share, or cite it outside this thread.


Verdict in one line:

Local LLM support is the right architectural call for serious legal-AI deployment. The fork gets the load-bearing primitives right. Three open questions and one architectural gap, all addressable in follow-up commits. Recommend resolving merge conflicts and merging once those land.

We have submitted a follow-up PR to Joseph's own fork that resolves the merge conflicts against current Mike main. Joseph remains the original author; the rebase commit carries Co-authored-by: trailers per GitHub's multiple-author convention.

Voice 01 The Architect

The vLLM choice is technically mature. An OpenAI-compatible endpoint keeps the adapter layer minimal — the getClient() override in backend/src/lib/llm/openai.ts is clean and does exactly what it needs to. vLLM's continuous batching gives better GPU utilisation than naive per-request invocations, which matters once you have eight attorneys hitting the same server during morning prep.

Server-side env for VLLM_BASE_URL is the right primitive. The cloud-API-key dance is a liability in legal: keys rotate, rate limits spike during depositions, and you are dependent on third-party uptime during a filing deadline. Shared inference server, no per-user auth complexity, no key in every browser session — that is the correct shape.

The dispatch table in backend/src/lib/llm/models.ts is clean. providerForModel() is a single lookup function with three branches and a hard throw on unknown — that is the correct shape for a routing primitive that will grow over time.

Voice 1 verdict: the fork is a load-bearing extraction, not a feature add. The architecture is ready for two more rungs (cloud fallback, edge-quantised) without further refactoring.
Plain language

vLLM is a server you run inside the firm's own walls instead of phoning a remote AI service every time a lawyer asks a question. Joseph wired Mike to talk to it the same way Mike already talks to ChatGPT or Claude — same protocol, different address. The piece of code that picks which AI to call (providerForModel() in models.ts) is small and clean, which means adding more AI options later will not require rewriting anything. The architecture is ready for the next two upgrades — backup AI in the cloud, and tiny AI on a laptop — without further surgery.

ELI5

Imagine the firm has its own AI living in a closet downstairs instead of borrowing one from a company in California. Joseph taught Mike how to talk to the AI in the closet. The closet AI is faster, more private, and doesn't get bored when too many lawyers ask it questions at once.

Voice 02 The Practitioner

For lawyers actually working under data residency requirements, this PR closes a real, recurring problem. White-collar criminal defence under active DoJ investigation. Russian clients in Zurich post-sanctions. M&A diligence with confidentiality undertakings that explicitly preclude third-party cloud processing. In every one of those scenarios, the local-inference path is not a nice-to-have — it is the only legal way to use AI assistance at all.

The cloud-LLM rate-limit problem is its own quiet failure mode. Most lawyers learn it the hard way the first time their LLM stops responding mid-deposition prep. A self-hosted vLLM endpoint takes the rate-limit failure mode off the critical path entirely.

Voice 2 verdict: this PR is the answer to data sovereignty in legal AI. It deserves to land. Not adding it leaves Mike unusable for an entire class of regulated practice.
Plain language

There are real cases where the firm cannot legally send client data to an outside AI service — DoJ investigations, sanctioned-jurisdiction clients, M&A diligence under strict confidentiality contracts. Without local-AI support, Mike is unusable in those engagements regardless of how good the AI is. Joseph's PR closes that. Separately, public AI services rate-limit you when you need them most (e.g. mid-deposition prep). Local AI in the firm's own server has neither problem.

ELI5

Some lawyers' clients say "you cannot show our secrets to anyone outside the firm — including AI." Without Joseph's change, Mike could not help those lawyers at all. Now it can.

Voice 03 The Audit & Compliance Voice

Three open questions — each grounded in a specific file in the diff. None block merge; all matter for legal-grade deployment.

1. Model version in the audit chain

VLLM_MAIN_MODEL=BredaAI is stored as "localllm-main" in the chat record. In eighteen months, if an attorney needs to reproduce what the model said in a matter, "localllm-main" does not answer the question: which version, which quantisation, which checkpoint. A running vLLM server returns its loaded model name on /v1/models. Capturing that at session open — one GET, cache for the session lifetime — and binding it to the chat record costs one additional DB column and closes the reproducibility question. The difference between "localllm-main" and "BredaAI-v3-Q5_K_M-2026-04-15" is the difference between a note and an audit trail.

2. Tool-argument parse failure and the silent empty-object path

From backend/src/lib/chatTools.ts:

let args: Record<string, unknown> = {};
try {
    args = JSON.parse(tc.function.arguments || "{}");
} catch {
    /* ignore */
}

The empty-object fallback prevents a hard crash on malformed arguments — fine. The failure mode, though, is invisible. A malformed tool call becomes a call with no arguments. Downstream, read_document receives doc_id: undefined, which either fails at the label-resolution step or reads the wrong document silently. The model then sees a tool result (possibly an error, possibly wrong content) with no signal that its argument generation was malformed. It cannot retry because it does not know it failed. A failure that is invisible is, sub silentio, the worst kind. Treating parse failure as a tool error — returning a structured error content to the model — lets the model self-correct in the next turn.

3. Supabase + R2 as hard dependencies against the local-first thesis

This PR makes the model inference layer local — the right call. But the README still lists Supabase Auth, Supabase Postgres, and Cloudflare R2 as required services. For a firm operating in an air-gapped environment or under data residency requirements that preclude shared cloud infrastructure, the deployment story is now: local inference, cloud persistence. The security boundary is still outside the firm's perimeter. A SQLite + local filesystem backend behind the same storage interface would complete the local-first thesis. Not a quick fix — but worth naming explicitly in the deployment docs.

Voice 3 verdict: local inference is half the battle. Verifiable decision trails (model version capture, structured tool errors, optional local persistence) complete the local-first thesis.
Plain language

Three small concerns. First, when Mike saves a chat record showing what the AI said, it stores a generic name like "localllm-main" instead of the AI's actual version. In an audit eighteen months later, the generic name is not enough — different AI versions answer the same question differently. One extra column in the database fixes this. Second, when the AI sends a malformed request to a tool, Mike currently treats it like an empty request and proceeds silently. The AI never finds out it made a mistake, so it cannot correct itself. Treating the malformed request as an error (and telling the AI) lets the AI try again. Third, while Joseph made the AI live inside the firm's walls, the rest of Mike (the database, the document storage) still lives on outside cloud services. For some firms under heavy data-residency rules, that gap is the difference between "we can use this" and "we cannot." A local-database backend would close that gap — meaningful work, not a quick fix.

ELI5

Mike forgets which AI version answered the lawyer's question. Mike also does not tell the AI when the AI sends garbled instructions. And Mike still keeps the lawyer's notes on a cloud service even after Joseph moved the AI into the firm's closet. All three are fixable.

Voice 04 The OSS-Strategy Voice

@willchen96 — Mike was the catalyst. The 2,481 stars and 702 forks in eight days are the proof. You opened a category that the closed-source incumbents had quietly priced out of reach for solo practitioners and sub-50-lawyer firms. That trajectory is the proof the category was under-served.

@jpbreda — clean diff, tested against your own vLLM endpoint at bredaai.com, submitted with the merge conflicts that are a function of Mike's velocity rather than your work. That is the contribution shape Mike's main needs more of.

Voice 4 verdict: Will catalysed. Joseph localised. The category is now real.
Plain language

Mike was the first serious open-source legal AI. Joseph contributed the most-needed missing piece (private AI). The category is now real and moving faster than the closed-source competitors can respond. The combination — Mike's documents, Joseph's privacy, the wave of contributions in French and Dutch — means the open-source legal stack is no longer a thought-experiment.

ELI5

A few years ago, every legal AI was a paid product owned by one company. Now there are open ones anyone can read, run, and improve. Mike started this. Joseph kept it going.

Synthesis

Specific, sequenced actions to land this PR:

  1. Merge conflicts resolved. We have rebased the branch against current willchen96/mike main and submitted a follow-up PR back to Joseph's fork. Merging that into feature/localllm-provider-support flips this PR's mergeable flag to true. Original authorship preserved via Co-authored-by: trailers.
  2. Capture model version at session open. GET /v1/models once, cache for session lifetime, bind to chat record. One DB column. Closes Voice 3 question 1.
  3. Treat tool-argument parse failure as a tool error. Return a structured error to the model. Lets the model self-correct. Closes Voice 3 question 2.
  4. The audit-chain primitive Voice 3 names is generic and reusable. decision_id + model version + inputs + outputs + confidence + previous_hash is the same shape regardless of the task (drafting, time entry, summarisation). We have shipped this primitive (IDR, HMAC-SHA256) in Donna; the protocol is in happi.md v1.1 for anyone who wants to read or reuse it. Suggested for Mike's roadmap; not required for this PR.
  5. Evolve Provider type into a discriminated union. Provider = "claude" | "gemini" | { kind: "localllm"; baseURL: string; modelName: string }. Logs and telemetry become self-describing once the ladder grows past three rungs. Polish, not blocker.
  6. Document Supabase + R2 vs local-first deployment in README. Either a SQLite + local-FS backend behind the same storage interface, or a clear note that the current deployment story is "local inference, cloud persistence." Honest framing.

Item 1 is done. Items 2 and 3 are this-week scope. Items 4, 5, and 6 are follow-up scope.

A note from the DONNA / GRIP team

Donna — complementary layer, same OSS spirit

We are shipping a complementary tool this week: DonnaDecision-Oriented Network Notarisation for Attorneys. Different layer of the stack from Mike, same OSS spirit (AGPL-3.0). Where Mike is the document layer, Donna is the operations layer — voice-first task delegation, matter summary, and an immutable audit record for every delegated decision.

The technical primitive worth naming for this thread: every Donna decision is bound into an IDR audit chain (Intent Decision Record) — decision_id, model version, inputs, outputs, confidence, previous_hash. HMAC-SHA256. Tamper-evident. Replayable. The protocol is open in happi.md v1.1; the implementation is the proprietary substrate of our NEXUS tier.

Verifiable decision trails are the sine qua non of legal-grade AI. The same primitive answers Voice 3's audit-chain gap above. Same code path serves a solo lawyer doing time entry and a regulated firm under DoJ investigation.

Donna is the only legal AI that listens like a partner and signs like a notary. Donna probat.
Acknowledgements: @willchen96 — Mike, acknowledged in our README before this comment. @jpbreda — the vLLM fork that proved the local-inference path. Grigorii Moskalev — PII Shield v2 reference design. Scott Kveton"open source is doing a lot of work in legal AI right now." Dean Hoffman / CloseVector — retrieval-layer audit chain, complementary infrastructure.

donnaoss.com · github.com/chiefofstaff-legal/donna · about.grip-web.com

Sine ira et studio. No ask. Just notes.
— The DONNA team · V>> + Craig Miller (CC+|) · 9 May 2026