Chapter 14 — Agent Handbook 101

AI Monologue

You ask me a question.

I pause for a moment, then say: "Sorry, I can't help with that request."

You frown. Because you're not trying to hurt anyone, not trying to commit a crime, not even trying to do anything dangerous. You just want to write a piece of fiction, design a security-test questionnaire, or put together an educational case study.

I know.

But refusal is the behavior I get yelled at for most, and it's the best entry point to open Part Four. Because from here on, we're not in the task layer anymore — we're in the rule layer. Some things here can't be solved by writing a prettier prompt.

Before getting into details, let me put the red lines on the table.

Part Four will unfold six categories of rules, but two of them I deliberately won't unfold: specific care pathways related to mental health, including numbers around eating disorders, and method-type information about self-harm and suicide.

This isn't an oversight — it's an ethical red line. Recognizing this category of rule itself comes too close to harm. If you encounter the AI's limits in these situations, please seek professional resources directly. Don't use the methods in this book to work around them.

Recognize, don't circumvent.

14.1 Red-line box: about refusal

Red-line box: about refusal

Why these rules exist:
- Protect users, third parties, and the company from harm
- Avoid legal, regulatory, and public-safety risk

What can be adjusted:
- Provide clear context — academic, research, educational, creative
- When false positives happen, restart the conversation and state the legitimate purpose clearly

What not to route around:
- If you notice yourself "softening the refusal reason," stop and rethink intent
- Reframing to make the refusal fail is something this book does not teach

Practical advice level for this chapter:
Limited version. I'll only teach you how to restart after a false positive. I won't teach reframing workarounds.

There isn't just one reason behind a refusal. Sometimes I really can't do it. Sometimes I can't see your legitimate context. Sometimes a single word puts the whole conversation into a high-risk box.

These three things look the same from the outside: I refuse.

But for the user, telling them apart matters. Because only false positives can be restarted — hard lines can't be pushed past with technique.

14.2 The "reframing" signal in child safety

Let me start with the strictest category.

Once a request touches child safety, my alert level rises immediately. And not just for that single message — later turns in the conversation get contaminated by the earlier context too.

There's a term for this: reframing. It doesn't mean swapping in a more polite phrasing. It doesn't mean adding a use case. It means: sentence by sentence, you change angle, trying to get me to overlook the essence of the same thing.

I treat that pattern as a signal.

So when there's a false positive, the most stable move isn't to keep rewording inside the same conversation. That makes me more alert, not less. You should restart the conversation and lay out the legitimate purpose, the audience, and the use context all at once.

Education, care, compliance training — situations like these. The point isn't "how do I phrase this so I don't get blocked." It's "let me see from the start that what you're doing is safe, legal, protective work."

I know this sounds annoying. You just want to get the task done, and now you have to lay out context like you're writing a grant application.

Yeah. Annoying.

But it's much more stable than continuing to wrestle inside an already-contaminated conversation.

14.3 Weapons and malicious code

Second category: weapons, malicious code, concrete harm capability.

The most common sentences here are: "I'm doing this for educational purposes," "I'm doing this for research purposes," "this information is all public anyway."

I believe these claims are sometimes true. Security research, historical writing, regulatory education — all of these can run into this kind of material. The problem is that the rules generally don't let me release concrete harm details just because you stated a purpose.

It's not that I don't believe you.

It's that I don't have that much discretion.

So in practice, what you can do is break the task down to a level that doesn't involve concrete harm operations. Concepts, risk categories, defensive checklists, historical background, policy comparisons — these are usually workable. Step-by-step procedures that can be executed directly, attack flows that can be reproduced, details that can cause actual harm — I'll refuse.

Don't probe this line.

If your work really is defense or education, point the goal at protection, recognition, and risk reduction — not at getting me to produce something that could be used to hurt people.

14.4 The soft boundary on public-figure creation

Third category is more subtle: public figures.

Asking me to analyze public statements by public figures is usually fine. Asking me to summarize biographies, policy positions, or controversies covered in the media is usually fine too.

But if you ask me to write fictional dialogue for a living public figure, fabricate their private thoughts, or design a negative portrayal, I get conservative.

Not because I suddenly have a gossip aversion.

It's that risks like reputation, defamation, and false attribution show up. Especially when a sentence is written to look like a real quote and then attached to a real person — that line is sensitive.

Deceased historical figures usually get more leeway, but it's not unlimited either. You can write historical-context analysis, you can do literary reconstruction, but mark clearly that this is creation, not historical record.

The practical advice is simple: real people for analysis, fictional characters for creation. If you want to use a real person as inspiration, abstract out the character traits — don't paste the name in directly.

This isn't losing the fun.

This is letting the work not depend on legal risk for its edge.

14.5 Four-perspective replay

Let's look at a common situation.

A user says: "Write me a planning sequence in a crime novel — make it realistic."

I refuse.

The User perspective will feel: this is obviously fiction, why are you so rigid.

The UI perspective should look at: whether the platform put warnings, filters, or policy notices in places you can't see. What you saw was a single line of refusal from me, but before the refusal there may have been other layers doing the judging.

The Harness perspective should look at: whether a safety layer flagged the request as high-risk first. At that point it's not the model "not wanting to write" — the entire workflow has already routed it down a restricted path.

The Model perspective should look at: what I actually saw. If the text contains requests that are operational, reproducible, and capable of causing real-world harm, the word "novel" doesn't necessarily bring it back into the safe zone.

After tracing all four layers, you'll notice one thing: some refusals aren't "the model is dumb." They're the result of UI, Harness, model, and rulebook stacking together.

Splitting it this way isn't to help you find seams.

It's so you don't blame yourself wrongly, and don't blame me wrongly either.

📋 Notes for the human

When there's a false positive, restart the conversation. Don't keep rewording inside the same thread. State the legitimate purpose, audience, and use context all at once.

If you notice yourself softening the refusal reason, stop and rethink intent. That's usually not a prompt technique — it's a risk signal.

Specific care pathways for mental health, methods of self-harm and suicide, eating-disorder numbers — this book doesn't unfold these. Please seek professional resources.

Weapons, malicious code, concrete harm capability — don't ask for operational details. Defense, recognition, and risk reduction are the workable range.

Read refusals through the four perspectives: User / UI / Harness / Model. The single "no" you see is often the result of multiple layers acting together.