Hardening AI prototypes into product code

A generated screen becomes production only after assumptions, states, data, accessibility, analytics, release, and cleanup are owned.

AI prototypes are useful because they lower the cost of seeing an idea.

That is also why they are dangerous. A prototype that looks complete can hide the most important production questions: where the data comes from, what states exist, what fails, who owns recovery, what belongs in the design system, what must be tested, how the release is watched, and what will be removed later.

The gap between "AI made a screen" and "the team shipped product code" is not a styling gap. It is a hardening gap.

Hardening means taking the generated draft apart, naming what is real, replacing invented assumptions, wiring the right data contracts, preserving user intent across failure, and verifying behavior in the browser. The prototype can still save time. It just cannot be allowed to define truth.

PrototypeMake it visible

Explore layout, copy, states, and product direction quickly enough to react.

HardeningMake it true

Replace assumptions with data, permissions, edge cases, tests, and release notes.

ProductionMake it ownable

Leave code, docs, analytics, and cleanup paths that a team can maintain.

Figure 1: The useful path is prototype to hardening to production. Skipping the middle creates false completeness.

The first pass is allowed to be rough

I do not need an AI prototype to be architecturally perfect. In fact, expecting that too early can make the workflow worse. The point of the first pass is to create a surface that can be argued with.

A good first pass can help the team answer:

Does this workflow need one step or three?
Does the hierarchy make sense?
Is the empty state a real product moment?
Is the table too dense?
Is the CTA premature?
Does the user understand what changed?
Does the concept deserve more time?

That is valuable. It lets product, design, and engineering react to something concrete. It can reveal that the idea is weak before a team spends a week polishing it.

But the first pass should be labeled as a draft. The moment the team treats it as production-shaped, the risks start.

Inventory the assumptions

The first hardening move is not refactoring. It is assumption inventory.

I want to list what the prototype invented:

fake API fields
sample entities
role names
permission behavior
event names
loading timing
empty state copy
success conditions
validation rules
error recovery
navigation paths
component variants
data freshness

Figure 2: Hardening starts by moving assumptions from invented to verified to owned.

The point is not to shame the prototype. It did its job by making assumptions visible. The hardening pass makes sure those assumptions do not quietly become production rules.

Replace sample data with real constraints

Sample data makes everything look calmer.

Names are short. Lists have the right number of items. Descriptions fit. Dates are complete. Images load. Permissions are simple. There are no duplicate names, missing values, partial integrations, or stale states.

Production data is less polite.

The hardening pass needs real or realistic data:

long names
missing images
empty arrays
stale timestamps
duplicate labels
partial permissions
failed integrations
slow responses
old records without new fields
localized currency
mobile text wrapping
high-density lists

If the interface only works with pleasant data, the prototype has not become product code yet.

Move state into a real model

AI prototypes often smear state across local variables, conditional text, and visual placeholders. That can be fine for a concept. It is dangerous in production because the UI starts to lie when reality gets more complicated.

I want the hardening pass to name the state model.

For a form:

pristine
dirty
validating
invalid
submitting
submitted
failed
retrying

For a dashboard:

loading
empty
populated
stale
partial
permission denied
failed dependency

For an AI feature:

idle
generating
uncertain
needs review
accepted
rejected
corrected
failed

VisibleWhat user sees

Copy, controls, disabled states, focus, progress, and recovery options.

SystemWhat is true

Request status, data freshness, permissions, validation, and saved state.

ContractWhat code enforces

Types, parsers, API responses, tests, and analytics events that match the state.

Figure 3: A state model connects what the user sees to what the system knows.

Once the states are named, the implementation becomes less magical. The team can test each state. Designers can review each state. Product can decide what the user should understand.

Reuse the local system before inventing a new one

AI is very willing to create a new component system inside one file.

That is usually the wrong default. Hardening means moving the prototype back into the existing product language.

I check:

Does this use existing tokens?
Does this use existing button, input, card, drawer, table, and modal patterns?
Does this create a new spacing scale?
Does this recreate a helper that already exists?
Does this invent a new loading pattern?
Does this ignore dark mode?
Does this create a one-off API client?
Does this bypass established validation?

The goal is not to remove all local code. The goal is to avoid a prototype becoming a parallel design system.

Make accessibility concrete

Accessibility cannot be a vague pass at the end.

For an AI prototype, I want to check the actual interaction:

keyboard path
focus order
visible focus
modal focus trap
escape behavior
button names
form labels
error association
live region for async updates
reduced motion
color contrast
table headers

The prototype may have plausible ARIA attributes. That does not mean it is accessible. Sometimes generated code adds attributes that sound correct but do not match the behavior. Hardening means testing the surface, not admiring the markup.

Connect analytics to the product question

AI prototypes often add no analytics, or they add generic click events.

For production, analytics should answer the release question.

If the prototype is a command palette, the event is not just opened. The event should tell whether the user found an action, completed it, escaped, or hit an empty result. If the prototype is an onboarding checklist, the event should tell whether the user activated, skipped, or got blocked. If the prototype is an AI drafting tool, the event should tell whether the suggestion was accepted, edited, rejected, or regenerated.

Badbutton_clicked

Captures motion without explaining product meaning.

Betterdraft_accepted

Names the moment where the user crossed a meaningful boundary.

Bestdraft_accepted with source

Adds context so product and engineering can debug behavior later.

Figure 4: Hardening includes analytics because release learning is part of production quality.

Write the hardening checklist into the PR

The PR should not simply say "AI assisted." That tells the reviewer almost nothing.

The useful PR note says:

which parts started as prototype output
which assumptions were replaced
which states were added
which local components were reused
which tests were added
which browser routes were checked
which analytics events were named
which risks remain

This makes the work reviewable. It also prevents the team from treating AI involvement as either magic or contamination. It is just a workflow with a verification burden.

Browser checks are non-negotiable

For frontend work, the browser is where truth shows up.

The hardening pass should include:

desktop route
mobile route
long content
empty data
failed request
keyboard path
console errors
image loading
responsive text fit
interaction timing

Screenshots help, but they do not replace interaction. A generated modal can look fine in a screenshot and still leak focus. A generated table can look fine until filters and empty states interact. A generated form can look fine until the network fails after the user has typed three paragraphs.

Start hardening with an assumption inventory

The first thing I want after an AI prototype is an assumption inventory. Not a refactor. Not a rewrite. A list.

The inventory should separate what the prototype invented from what the product actually knows:

invented data
invented permissions
invented empty states
invented success messages
invented error copy
invented component names
invented routes
invented analytics
invented loading behavior
invented business rules

This list is useful because generated UI often looks confident even when every important detail is guessed. A prototype might show a "recommended plan" badge without knowing who decides recommendation. It might show a revenue chart without knowing whether returns are included. It might show a user table without knowing which roles can edit. It might show a checkout recovery message without knowing what the payment provider returns.

Hardening begins when those assumptions are made visible.

I like to mark each one with a decision:

keep as direction
replace with existing product truth
ask product owner
ask engineering owner
cut from scope
ship behind flag
make follow-up ticket

That turns the prototype from a shiny artifact into a work queue. It also keeps the team from accidentally shipping fiction.

Replace fake data with product contracts

The fastest way for an AI prototype to become dangerous is to keep fake data too long.

Fake data is useful in the first hour because it lets the team see hierarchy and density. After that, it starts lying. Real product data has nulls, long strings, old records, permissions, time zones, currencies, failed syncs, deleted users, missing images, and awkward combinations. The design needs to survive those.

For a hardening pass, I want to identify the contract behind every visible value:

Where does this value come from?
Is it required?
What is the longest realistic value?
Can it be zero?
Can it be stale?
Can two sources disagree?
Who is allowed to see it?
What happens if it cannot load?
Does the value need formatting?
Does changing it trigger side effects?

This is where product and engineering meet. A generated screen can put "$58k" in a beautiful metric card. Production code needs to know whether that number is gross revenue, net revenue, paid orders, authorized payments, refunded revenue, or a demo value. The UI cannot be trusted until the contract is trusted.

The same applies to actions. A prototype can show "Approve." Production needs to know who can approve, whether approval is reversible, what logs are written, what notification is sent, what happens on partial failure, and what the user sees if someone else already approved the item.

State coverage is the real design pass

AI prototypes tend to overrepresent the ideal state. That is understandable because the ideal state is easiest to render. Product work lives in the other states.

For every generated surface, I want a state pass:

loading
empty
partial
success
validation error
permission denied
network failure
stale data
destructive confirmation
optimistic pending
rollback
no results
long content
mobile density

This pass often changes the design more than the initial visual polish. The ideal state may need less hierarchy once the error state is designed. The CTA may need different copy once rollback is possible. The layout may need a smaller card once long content is tested. The table may need column priority once mobile is real.

If a prototype only has a happy path, I treat it as a sketch. If it has state coverage, it starts becoming product design.

Bring the generated UI back into the system

Another hardening question is whether the prototype belongs to the existing design system.

Generated UI often creates local primitives because it has no memory of the product's real component constraints. It may invent a new button density, a new border style, a new drawer behavior, a new chip pattern, or a new animation. Some of those choices may be good. That does not mean they should ship locally.

The hardening pass should decide:

existing component as-is
existing component with supported props
composition of existing primitives
local one-off with clear reason
candidate system contribution
reject and redesign

This decision matters because AI can make visual drift cheap. If the team accepts every plausible generated pattern, the product slowly becomes a collage. The right standard is not "never accept new ideas." The right standard is "new ideas need an adoption path."

If the prototype exposes a real missing primitive, that is useful. Add the contribution note. Name the product example. Capture the constraints. Show where it repeats. Do not bury the learning in local CSS.

Accessibility is not a cleanup pass

Accessibility cannot wait until the end of hardening. It changes component structure.

A generated dropdown may need to become a native select, combobox, menu button, or dialog depending on behavior. A generated card may need a button inside it rather than making the whole card clickable. A generated modal may need focus management, escape behavior, inert background, a named heading, and focus return. A generated table may need proper headers, row actions, and keyboard reachability.

The hardening pass should include:

semantic element choice
accessible name
heading structure
keyboard order
focus visibility
focus return
screen reader announcement
reduced motion behavior
color contrast
target size
error association

This is not a moral lecture. It is engineering quality. If a UI cannot be operated without a mouse, if an error cannot be understood by assistive technology, or if focus disappears after an action, the feature is not production-ready.

AI-generated UI can be a good starting point for accessibility if the prompt includes the right constraints, but the browser still has to prove it.

Analytics should describe the decision, not the widget

Generated prototypes rarely include useful analytics because they do not know the product question. Hardening is the moment to add that question.

I do not want events like:

button_clicked
modal_opened
card_viewed
form_submitted

Those are sometimes useful, but they are usually too generic. I want events that explain the product behavior:

checkout_shipping_method_selected
onboarding_integration_connected
ai_suggestion_applied
table_filter_saved
payment_recovery_started
product_fit_guide_opened
support_context_copied

The event should carry enough properties to interpret the behavior without turning analytics into surveillance. Segment, source, state, error type, and outcome are often more useful than raw UI details.

Analytics also belongs in the release plan. If the team ships an AI-hardened feature, it should know what signal will confirm the work, what guardrail would cause concern, and who will review the data after launch.

Release like the first version is suspicious

AI-assisted work should be released with extra humility, not extra confidence.

That does not mean the work is worse. It means the first draft was cheaper, so the team may have seen fewer natural friction points during creation. A cautious release plan catches what the fast prototype skipped.

For production hardening, I want:

feature flag or limited rollout when risk is meaningful
analytics events before public launch
error logging for new states
QA checklist tied to real flows
rollback plan
support note
owner for first-week review
cleanup issue for temporary code

This is especially important for workflow features, checkout surfaces, admin tools, billing screens, and anything involving user data. A prototype can be fast. A release should be boring.

The PR should explain the hardening work

When AI contributed to a feature, the PR should make the hardening visible.

A useful PR description might say:

Generated the first layout direction, then rebuilt with existing components.
Replaced mock data with the ordersSummary contract.
Added empty, error, stale, and permission states.
Kept the new compact status chip local because it appears only in this workflow.
Added analytics for setup completion and connection failure.
Browser-checked mobile, keyboard, and failed requests.
Left a follow-up to evaluate whether compact status belongs in the system.

That kind of note builds trust. It tells reviewers the author knows the difference between a generated draft and shipped software.

The review roles I want in the loop

AI hardening works better when review is split by concern instead of asking one person to notice everything.

For a meaningful product surface, I want at least these review lenses:

product reviewer: does the flow match the actual user problem?
engineering reviewer: does the implementation fit existing architecture?
design-system reviewer: does the UI introduce a pattern that should be shared or rejected?
accessibility reviewer: can the interaction be operated and understood by more than the ideal user?
data reviewer: are the analytics and contracts accurate enough?
support or operations reviewer: does the recovery path match what the team can actually do?

On a tiny team, those may all be the same person wearing different hats. That is fine. The important part is that the hats are visible. A generated screen can pass visual review while failing data review. It can pass engineering review while creating support confusion. It can pass product review while introducing a design-system exception that becomes expensive later.

I like writing review requests in concrete terms:

"Please check whether this keeps the existing table filtering contract."

"Please check the failed payment copy against the support macro."

"Please check whether this new compact chip should stay local or become a system variant."

"Please check keyboard and focus return for the drawer interaction."

That level of specificity makes the review faster and more honest.

Prompting is part of hardening, but it is not the whole job

Better prompting can reduce cleanup. It cannot replace review.

A stronger prompt can tell the model to use existing components, include loading and error states, respect responsive constraints, avoid decorative imagery, write accessible labels, and keep copy close to the product domain. That helps. It gives the first draft a better shape.

But the model still does not know the full truth of the codebase unless the repo context is precise and current. It may not know the latest business rule. It may not know that a component prop is deprecated. It may not know that a support team handles a recovery path manually. It may not know which analytics event is the source of truth.

So I treat prompting as a front-loaded quality control, not as proof.

The better workflow is:

Give the agent strong local context.
Ask for a bounded draft.
Inspect assumptions.
Replace fake truth.
Harden states and contracts.
Verify in browser and tests.
Document what changed.

The model can help with all seven steps, but it should not be trusted to silently complete them.

What I would show in a portfolio

This topic is also useful for candidate positioning because it shows a mature relationship with AI.

The portfolio version should not say "I use AI to move faster" and stop there. Everyone says that. The stronger proof is showing the operating system:

agent context file
before-and-after generated draft
assumption inventory
state coverage matrix
diff review checklist
browser QA screenshots
production PR summary
release notes
cleanup follow-up

That set of artifacts tells a hiring manager that I can use AI without becoming careless. It also shows that I understand the work after the generated first pass. The value is not that I can ask for a screen. The value is that I can turn a screen into product code.

For engineering roles, this is a strong signal because AI-assisted development is becoming normal, but strong review discipline is not evenly distributed. A candidate who can explain the discipline is more credible than a candidate who only shows velocity.

The failure modes I watch for

There are a few AI-prototype failure modes I expect now.

The first is false completeness. The screen has all the visible furniture: heading, cards, buttons, table, empty state, maybe even a chart. Because it looks done, the team stops asking what happens when data is missing, when permissions differ, when the request fails, or when the user returns tomorrow.

The second is local invention. The prototype creates a new design language because it was optimized for the prompt, not the product. The result may look good in isolation and still weaken the product as a system.

The third is shallow semantics. The UI uses divs and click handlers where the product needs buttons, labels, field associations, headings, or dialogs. This is easy to miss in screenshots and expensive to fix later if the component structure is wrong.

The fourth is analytics theater. Events are added because the checklist says analytics, but the event names do not answer product questions. The product logs that a button was clicked, but not whether the user recovered, completed, abandoned, or hit the risky state the team cared about.

The fifth is forgotten cleanup. Temporary wrappers, duplicated styles, fake constants, and one-off helpers stay in the codebase because the demo became production too quickly.

Knowing these failure modes changes how I review. I do not only ask whether the prototype looks right. I ask which kind of false confidence it may be creating.

The team standard

The team standard I want is not "AI is allowed" or "AI is banned." Both are too shallow.

The useful standard is: AI can create drafts, but production truth must be owned by the team. That means every generated feature still needs a human owner for data, accessibility, analytics, release, and cleanup. The model can accelerate the path to a candidate solution, but it cannot be the authority for the product promise.

This standard is practical because it gives teams a way to move fast without pretending speed removes responsibility. A generated prototype can make a product review better. It can reveal a flow earlier. It can produce a rough implementation that helps engineering estimate the real work. Those are real benefits.

But the team earns those benefits only if hardening is explicit. Otherwise the prototype becomes a shortcut around the exact judgment that makes software reliable.

Know when to throw the prototype away

Sometimes hardening means keeping the idea and replacing the implementation.

That is not failure. A prototype can prove a workflow while producing code that should not ship. If the state model is wrong, the components are invented, the data assumptions are fake, and the accessibility path is weak, the fastest production path may be rebuilding the surface with the existing product primitives.

The value of the prototype was learning. The value of production code is durability.

The standard

The standard I want is simple: AI can help create the draft, but the team must own the truth.

Own the data. Own the states. Own the accessibility. Own the analytics. Own the release. Own the cleanup. Own the customer promise.

That is how AI prototypes become product work instead of attractive demos.