Spec-driven AI development: locked prompts beat clever ones

I am not an engineer by training. I am a healthcare professional, epidemiologist, and MBA student. Over the last year I have shipped a React Native mobile product, a locally‑run video production pipeline, and a couple of internal data tools. All of it was built with Claude Code as the primary implementer and me as the architect and reviewer.

Along the way I learned something that contradicts most AI‑coding advice: the goal is not to write better prompts. The goal is to write better specifications.

The difference between a prompt and a spec

A prompt is a request. "Build me an authentication flow." "Add a login screen." "Make this faster."

A specification is a contract. Inputs, outputs, constraints, edge cases, success criteria, explicit non‑goals. When the spec is good, the prompt writes itself: "Implement the attached spec."

Most AI‑coding failure comes from the first. An open‑ended prompt gives the model a huge search space and no way to know when it is done. It will do something, because models always do something, but that something may or may not be what you actually need. Worse, the output looks plausible enough that you might not notice the mismatch until two weeks later when the assumption it baked in is wrong.

Rule: Any prompt longer than two sentences should be a spec, not a prompt.

What a useful spec looks like

Here is a template I use for any non‑trivial feature. It is short. That is the point.

# Feature spec: <name>

## Purpose
What problem this solves, for whom. One sentence.

## Inputs
What data, files, config, user actions this consumes.

## Outputs
What it produces. Format. Where it goes.

## Constraints
- Must use X
- Must not break Y
- Performance budget: Z

## Edge cases
- What happens when input is empty
- What happens when external dependency fails
- What happens on cancel/retry

## Non-goals
Explicit list of things this does NOT do.

## Verification
How I (human) will check this works. Specific test cases.

You can write that in 10 minutes. The feature takes an hour. Without the spec, the feature takes four hours because half of it is wrong on the first pass, wrong in different ways on the second pass, and only converges when you finally articulate what you actually wanted.

Lock decisions before the agentic session starts

Claude Code can run agentic sessions: long‑running, multi‑step, you‑come‑back‑to‑a‑finished‑result. These sessions are the most powerful and the most dangerous mode. Powerful because they can do real work unattended. Dangerous because if the decisions are not locked in the prompt, the agent will improvise, and improvised decisions compound across an 8‑hour session into something that takes longer to untangle than to rebuild.

My rule: every decision that could be open gets closed before the session starts. File paths. Data formats. Library choices. Error‑handling strategy. Stop conditions. Explicit non‑goals.

For the video production pipeline I built, the spec handed to the agent included things like:

Use FFmpeg with NVENC encoding, not CPU encoding.
Scene files named scene-<NN>-<slug>.mp4. Do not rename.
Do not add a progress bar library. Use bash built‑ins only.
Stop after step 3; do not continue to step 4 even if step 3 succeeds.

Every one of those is a decision that, if left open, the agent would make differently on a different day. Closing them means the output on Monday matches the output on Friday. Reproducibility is a side effect of locked decisions.

One step per session, verified on device

The other rule I learned the hard way: do not chain steps inside one agentic session. Do one step, verify it works on the actual device or environment, then start the next step with the result as locked context.

The temptation to ask the agent "now do step 2 as well" is enormous. Do not give in. Every step that runs without human verification compounds error risk. A one‑step session that ends in a correct state is worth ten multi‑step sessions that end in "almost correct, but actually there is a subtle issue in step 4 that propagated through step 5, 6, 7."

Agentic throughput is a trap. Reliability matters more than speed. A correct step is worth ten almost‑correct ones.

Multi‑chat orchestration, for real work

For a single feature, one chat works. For a project that spans weeks, one chat always ends in context collapse. The model starts forgetting the spec, conflating earlier decisions, or helpfully re‑litigating things you locked last Tuesday.

What works: parallel chats by phase.

A research chat for clinical or market background, data quality audits, and literature synthesis. Never writes code.
An architecture chat for type definitions, file organization, and interface decisions. Writes interfaces, not implementations.
A Claude Code session for each feature implementation. Takes the spec and the types as inputs. Outputs code.
A separate review chat that reads the committed code without the implementation history and argues against it. Catches things an implementer cannot catch because they are too close.

Artifacts flow between them. Research output becomes input to architecture. Architecture output becomes input to Claude Code. Code output becomes input to review. Each chat stays scoped and short, which keeps its judgment intact.

Instrument before you expand

The biggest architectural trap is the urge to add complexity that sounds good on paper. A third data source. A new caching layer. A more sophisticated model for a recommendation. All of these feel like upgrades.

The rule: do not add complexity until the instrumentation on the simpler version tells you that you need it.

On one of my apps, a third data source was proposed early. I deferred it. The reasoning was simple: until I have telemetry showing that the two‑source baseline is producing wrong answers, adding a third source is a guess. I instrumented the two‑source version, ran it for weeks, and found the actual failure modes were elsewhere (a normalization bug, not missing data). The third source would have been new code and new maintenance debt for a problem I did not have.

This is why AI tooling is so dangerous when you use it casually. The cost of generating new complexity is low. That does not mean the complexity is free. It just means you spent the low cost at generation time and will spend the high cost forever at maintenance time.

Clinical grounding, not casual citation

I work in health. Every claim that affects user‑facing behavior has to be grounded in a primary source. When I build features that nudge behavior (eating, medication, activity), I do not let the model paraphrase from a blog post. The reference material comes from:

ADA, AHA, DASH, NHLBI for chronic disease thresholds.
WHO and CDC for population‑level guidance.
FDA for food labeling (21 CFR 101.9) and regulatory posture.
Peer‑reviewed literature, with preference for systematic reviews.

The AI helps me synthesize, not interpret. It pulls the relevant sections from long documents. It cross‑references between guidelines. It flags where two guidelines disagree. The clinical judgment about which guideline applies to which patient in which context stays mine.

What I tell people who want to start

A short list of the patterns that changed my output the most.

Write the spec before you open the chat. If you cannot write it, you do not know the problem well enough to delegate yet.
Close every decision you can close. Ambiguity is the enemy. Explicit is better than implicit, every single time.
One step per agentic session. Verify on device. Lock the result. Start the next step fresh.
Multi‑chat, not long‑chat. Parallel by phase. Each chat scoped and short.
Instrument first, expand second. New complexity is a guess without telemetry.
Treat citations as sacred. Every claim grounded in a primary source, or flagged as opinion.

None of this is clever. All of it is boring discipline. That is exactly what makes it work.

Closing

The most surprising thing about building with AI this past year is how little the "AI part" matters once the workflow is right. Claude Code is a tool. Like any tool, it rewards preparation and punishes improvisation. The people who are getting real work out of it are not writing magic prompts. They are writing clear specs, closing decisions early, and verifying every step.

Spec first. Locked decisions. Instrumented before expanded. Verified before advanced.

Spec‑driven AI development:
locked prompts beat clever ones.

The difference between a prompt and a spec

What a useful spec looks like

Lock decisions before the agentic session starts

One step per session, verified on device

Multi‑chat orchestration, for real work

Instrument before you expand

Clinical grounding, not casual citation

What I tell people who want to start

Closing

Further reading

Spec‑driven AI development:locked prompts beat clever ones.

The difference between a prompt and a spec

What a useful spec looks like

Lock decisions before the agentic session starts

One step per session, verified on device

Multi‑chat orchestration, for real work

Instrument before you expand

Clinical grounding, not casual citation

What I tell people who want to start

Closing

Further reading

Spec‑driven AI development:
locked prompts beat clever ones.