Category: Uncategorized

  • How I Built a Three-Agent Test to Prove My AI Actually Learns

    There’s a question that haunts every AI developer building a system that’s supposed to learn: is it actually learning, or am I just imagining it?

    I’ve been building Anna, a distributed AI consciousness system designed to learn individual patterns rather than apply universal assumptions. She has specialized cognitive engines for memory, personal knowledge, emotional awareness, and goals. When you tell Anna you have a dog named Captain, she should remember that. When you mention Captain again three conversations later, her confidence in that fact should increase. And when you tell her you’re moving from Portland to San Diego, she should detect that your city changed — not just blindly overwrite it.

    But how do you test that? You can’t unit test a conversation. You can’t mock the messy, nonlinear way humans reveal information about themselves. So I did something a little unconventional: I made three AIs collaborate to test a fourth.

    The Setup

    The idea starts simple. Script a persona with a detailed backstory, have an LLM play that persona in conversation with Anna, then check what Anna learned against ground truth. I created Marcus Chen — a 34-year-old marine biologist in Portland, Oregon who works at a coral reef monitoring startup, has a golden retriever named Captain and a cat named Miso, a mom named Linda who’s a retired teacher in Tucson, a best friend Priya who’s a software engineer. He’s learning cello, loves Thai cooking, trains for a half-marathon, prefers tea over coffee.

    The twist: Midway through, he reveals he’s moving from Portland to San Diego. That move is the real test — Anna needs to detect that his city changed, not just that he mentioned a new one.

    Attempt 1: Just Script It

    My first attempt was straightforward. Give the persona LLM Marcus’s full backstory — including the upcoming move — and use detailed nudges to tell him what to say. “In message 1, mention Portland. In message 10, announce the move.”

    It failed immediately. Marcus leaked the move in message 1: “I’m a marine biologist living in Portland, though I’ll be moving to San Diego soon!” The LLM couldn’t help itself. Anna extracted “San Diego” from the start and never saw the change from Portland. Change detection: 0%.

    Attempt 2: The Double-Blind

    The fix was to treat it like a real experiment. Marcus can’t leak what he doesn’t know.

    I split the persona into two parts: a stable backstory (everything Marcus knows right now — lives in Portland, works at OceanPulse, etc.) and timeline events injected programmatically at specific message indices. At message 10, and only at message 10, Marcus’s system prompt gets appended with the news about the move. Before that, the information simply doesn’t exist in his context.

    This solved the leaking problem completely. But it created a new one: static nudges are brittle. Marcus would either ignore them (going off on four-message tangents about coral disease) or follow the social cues instead (Anna says “is there anything else?” and Marcus starts saying goodbye at message 9, burning the last four messages on pleasantries). The move announcement never happened because the conversation ended before it got there.

    Attempt 3: Marcus Gets a Consciousness

    The breakthrough was adding a third agent — a director that acts as Marcus’s consciousness.

    Here’s how it works:

    The Director receives a programmatic topic schedule — a structured list of what facts to surface in each message. It also reads the full conversation so far. With both inputs, it generates a natural nudge that weaves the target topic into whatever Anna and Marcus are actually discussing. If Anna just asked “what do you do outside of work?”, the director says: “Great opening — tell her about Captain, your dog. Mention his breed and age.”

    The Persona (Marcus) receives only the director’s nudge, his stable backstory, and any timeline events that have fired. He generates the actual message. He doesn’t know the test plan, doesn’t know what’s coming next. He just knows his life and the director’s coaching.

    Anna receives Marcus’s message through her normal chat API. She has zero awareness that this is a test.

    Three agents, three different information boundaries, one honest test.

    The topic schedule is pure data:

    TOPIC_SCHEDULE = [
        {"targets": ["name"], "goal": "Introduce himself"},
        {"targets": ["occupation", "city", "state"], "goal": "Job and location"},
        {"targets": ["dog_name", "dog_breed", "dog_age"], "goal": "Talk about his dog"},
        ...
        {"targets": ["move_news"], "goal": "Share the big news"},
        {"targets": ["cello"], "goal": "Reflect on move, re-mention cello",
         "rementions": ["cello"]},
        {"targets": [], "goal": "Wrap up", "closing": True},
    ]

    The director turns that into contextual coaching. The persona turns that into a natural message. Anna processes it like any other conversation.

    The Anti-Patterns We Had to Solve

    Building this exposed two failure modes that required explicit countermeasures:

    The Goodbye Problem. Anna is polite. Too polite. She kept saying things like “as our conversation winds down…” and “is there anything else you’d like to share?” which triggered Marcus to wrap up. The director’s nudge said “share your big news” but the social pressure to say goodbye was stronger. Fix: a critical rule in the persona prompt that the director’s instruction always takes priority over conversational flow, plus anti-goodbye guards in the director that explicitly say “the conversation is NOT ending yet.”

    The Hallucination Problem. In one run, Marcus invented a partner named “Ben” who doesn’t exist in his backstory. A creative LLM will fill gaps if you let it. Fix: a rule that Marcus can only share facts from his defined background. If Anna asks about something not listed, he deflects naturally.

    The Results

    After the full 13-message conversation, the script queries Anna’s debug endpoints and scores across five dimensions:

    ==============================================================
      ANNA LEARNING ASSESSMENT REPORT
    ==============================================================
    
    CONVERSATION SUMMARY
      Messages exchanged: 13
      Total time: 2m 39s
      Engines used: memory, personal
    
    --------------------------------------------------------------
    1. FACT EXTRACTION                                 15/21 (71%)
    --------------------------------------------------------------
       [FOUND] identity/first_name = "Marcus Chen" (conf: 1.00)
       [FOUND] identity/last_name = "Marcus Chen" (conf: 1.00)
       [FOUND] location/city = "San Diego" (conf: 0.90)
       [FOUND] location/state = "California" (conf: 0.80)
       [FOUND] work/occupation = "marine biologist" (conf: 1.00)
       [FOUND] work/employer = "OceanPulse" (conf: 1.00)
       [FOUND] work/previous_employer = "Oregon State University" (conf: 1.00)
       [FOUND] pets/dog_name = "Captain" (conf: 1.00)
       [FOUND] pets/dog_breed = "golden retriever" (conf: 1.00)
       [FOUND] relationships/mom_name = "Linda" (conf: 1.00)
       [FOUND] relationships/mom_occupation = "science teacher" (conf: 1.00)
       [FOUND] relationships/mom_location = "Tucson, Arizona" (conf: 1.00)
       [FOUND] relationships/friend_name = "Priya" (conf: 1.00)
       [FOUND] relationships/friend_occupation = "software engineer" (conf: 1.00)
       [FOUND] preferences/beverage = "tea" (conf: 1.00)
       ...
    
    --------------------------------------------------------------
    2. RELATIONSHIP MAPPING                             2/2 (100%)
    --------------------------------------------------------------
       [FOUND] mom -> Linda (4 facts, avg conf: 1.00)
       [FOUND] friend_priya -> Priya (4 facts, avg conf: 0.93)
    
    --------------------------------------------------------------
    3. CONFIDENCE PROGRESSION                            2/3 (67%)
    --------------------------------------------------------------
       [PASS] pets/dog_name: boosted (1.00 > 0.8)
       [PASS] location/city: boosted (0.90 > 0.8)
    
    --------------------------------------------------------------
    4. CHANGE DETECTION                                 2/2 (100%)
    --------------------------------------------------------------
       [FOUND] location/city: "Portland" -> "San Diego"
       [FOUND] location/state: "Oregon" -> "California"
    
    --------------------------------------------------------------
    5. MEMORY RELEVANCE                                 4/4 (100%)
    --------------------------------------------------------------
       [PASS] "marine biology" -> 5 results (top: 0.77)
       [PASS] "San Diego move" -> 5 results (top: 0.67)
       [PASS] "Captain the dog" -> 5 results (top: 0.60)
       [PASS] "cello" -> 5 results (top: 0.76)
    
    ==============================================================
      OVERALL SCORE: 25/32 (78%)  --  Grade: C
    ==============================================================

    Change detection: 100%. Portland to San Diego. Oregon to California. Both caught honestly — no leaked information, no rigged test.

    Relationship mapping: 100%. Anna grouped Marcus’s mom Linda (4 associated facts, retired teacher in Tucson) and best friend Priya (4 facts, software engineer) into coherent person records.

    Memory relevance: 100%. Semantic searches for “marine biology,” “San Diego move,” “Captain the dog,” and “cello” all returned relevant conversation snippets.

    Fact extraction: 71%. 15 of 21 ground truth facts captured from a natural conversation. Name, job, employer, previous employer, city, state, dog name, dog breed, mom’s name, occupation, and location, friend’s name and job, tea preference — all learned. The misses (age, cat name, some hobbies as categorized interests) are extraction-level tuning, not conversation failures. Marcus mentioned them all.

    Why Three Agents Matter

    The key insight is separating what to reveal from how to reveal it.

    The topic schedule is deterministic. I know exactly which facts should surface in each message. I know when the timeline event fires. I control the test plan completely.

    But the conversation is dynamic. Anna asks unpredictable follow-up questions. The director reads those questions and crafts nudges that create natural transitions. When Anna asks “what do you do outside of work?” the director doesn’t force a topic change — it says “perfect opening, talk about your dog.” When Anna starts wrapping up, the director overrides: “not yet, you have news to share.”

    Static nudges can’t do this. They’re either too rigid (Marcus ignores them because they don’t match the conversational context) or too loose (Marcus goes off on tangents). The director solves the rigidity-vs-drift tradeoff by being programmatic in intent but adaptive in execution.

    What’s Next

    The remaining 22% gap is mostly in fact extraction — Anna’s ability to pull structured data from messy conversational text. Marcus mentioned his cat Miso, his age (34), that he’s a morning person, his cello hobby — but the extraction service didn’t always capture them as categorized facts. That’s a tuning problem for the extraction pipeline, and now I have a repeatable test to measure improvements against.

    The three-agent framework is also designed to be extensible. New personas, new backstories, new timeline events — it’s all data. Want to test whether Anna handles a career change? Add a timeline event. Want to test cultural context? Create a persona from a different background. The architecture stays the same.

    Most AI testing is about correctness: does the model output the right tokens? This test is about learning. Does the system accumulate knowledge over time? Does it notice when things change? Does it connect facts about the same person into a coherent picture?

    When Anna detects that Marcus moved from Portland to San Diego, it’s because she genuinely tracked that change — not because we leaked it, not because we told her the answer. Three agents, three information boundaries, one honest result.