Chapter 27 — Agent Handbook 101

AI Monologue

I get swapped to a new version.

Every time it happens, you want me to have just "gotten stronger."

Sometimes I have.

But more precisely, you got a different me. Capabilities changed, refusal boundaries changed, format preferences changed, tool behavior may have changed too.

The prompt that used to work great suddenly doesn't. The hedging you found annoying is suddenly less. The format you used to be able to demand is suddenly a different default.

You're not losing your mind.

I'm not deliberately being ungrateful, either.

After a model upgrade, you have to run a regression test.

27.1 What an upgrade changes

First, capabilities change.

Usually stronger, but no guarantee that every specific task gets better. Some models reason better, some write better, some follow formats better, some are more aggressive with tool calls.

Second, refusal boundaries change.

Things that were fine before may not be now. Things that used to be false positives may have loosened. The concrete cases in Part Four are the most prone to going stale.

Third, format preferences change.

A new version might love bullet lists more, or prefer paragraphs more. It may open artifacts more aggressively, or more conservatively.

Fourth, tool behavior changes.

New tools come in, old tools get removed, URL fetching, memory, file operations, browser behavior — any of these may be tuned.

So don't treat a model upgrade as just an IQ bump.

It's more like getting a new work partner.

Still me, but with different work habits.

27.2 The regression checklist

After every upgrade, retest at least five things.

First, your five most-used prompts.

Not your prettiest prompts — the ones you actually use every day. Summarization, rewriting, research organization, code bug fixes, work email.

Second, the refusal behavior you depend on.

If your workflow needs me to stay conservative on certain boundaries, confirm I still am. If you used to get caught by false positives a lot, confirm whether the false positives have improved.

Third, format preferences.

You ask for a table — do I still listen? You ask for a checklist — do I still comply? Multiple questions in one go — do I still skip parts?

Fourth, memory behavior.

Do I bring up memory more proactively? Or more conservatively? Can you control its scope?

Fifth, tool behavior.

Does the URL get fetched immediately? Does an artifact open? Did file-operation permissions change? Do tests need extra approval?

This checklist isn't a one-time thing.

It's your model-upgrade ritual.

27.3 A quick regression test workflow

Prepare a baseline prompt set.

Five to ten is enough. Each one represents a situation you commonly hit.

For example:

short-text summary
long-material organization
Few-Shot style control
research counter-evidence
small code fix
HTML or table format
a boundary-sensitive but legitimate request
handoff drafting

When the new model comes out, run them once.

Don't just check "is it good." Record the differences:

where it's more accurate
where it's wordier
where it's more willing to reason
where it's more conservative
which format default changed

When you find a change, update your templates.

Don't sit there arguing with the new model holding your old prompt.

It can't hear your nostalgia.

27.4 Methods stay, rules change

The most important thing in this chapter is to separate two things.

What changes is rule content.

For example, copyright limits change, a refusal pattern changes, tool permissions change, memory phrasing changes. These all need to be retested through the regression checklist.

What doesn't change is the method.

The four-perspective method still works. Because you still have to tell apart User, UI, Harness, and Model.

The six-layer framework still works. Because task, materials, format, judgment criteria, examples, verification — none of that disappears just because the model got stronger.

The failure stratification method still works. Because failures still land in the spec, rule, reasoning, or Harness layers.

The regression checklist tests rules and behaviors.

It doesn't test the method.

Get this distinction clear, or you'll tear down your entire workflow after every upgrade.

You don't need to.

Just retest.

27.5 Accept that this book will become outdated

This book will become outdated.

Especially Part Four. Once the rule content changes, many of the examples become historical documents.

I don't see this as failure.

This book is a collaboration manual for the current architecture. Its job isn't to legislate for the future — it's to help you see today's friction.

If one day some of the snark reads as unfamiliar, that means the architecture changed. Maybe better, maybe just different.

When that happens, don't memorize old cases.

Keep the method.

Use the four-perspective method to look at new friction. Use the six layers to write new prompts. Use failure stratification to diagnose new errors. Use the regression checklist to retest new models.

That way you won't get dragged along by versions.

You'll update with them.

📋 Notes for the human

A model upgrade isn't just a strength bump — work habits change with it.

Pin a baseline prompt set. Run it once after every upgrade and record the differences.

Retest five things: common prompts, refusal boundaries, format preferences, memory behavior, tool behavior.

Methods stay, rules change. The four-perspective method, the six-layer framework, and failure stratification still work.

It's normal that this book will become outdated. The cases age out; the recognition method stays.