Back to Blog
tutorial intermediate crewai

CrewAI Agents That Learn From Mistakes (Self-Improving)

Build CrewAI agents that record failures as reflections, vote on memory quality, and consult a playbook of proven strategies before each task.

Arulnidhi Karunanidhi · · 12 min read

The Problem With Agents That Never Learn

Your CrewAI agent fails to parse a CSV file because it contains a Unicode BOM. You fix the issue manually. Two weeks later, the same agent encounters the same problem with a different CSV file and fails the exact same way.

This is not a model capability issue. GPT-4o and Claude know how to handle Unicode BOMs. The problem is that the agent has no mechanism to record what went wrong, what fixed it, and when the lesson applies. Every run starts from zero, and every mistake is made fresh.

The ACE (Agentic Context Engineering) patterns solve this with three interlocking mechanisms:

  1. Reflections — structured records of what went wrong and what works
  2. Memory Voting — a quality signal that separates useful knowledge from noise
  3. Playbook Queries — consulting accumulated wisdom before starting a task

In this tutorial, we will implement all three patterns in a CrewAI system using AegisAgentMemory. By the end, your agents will record failures, vote on what helped, and consult a playbook of proven strategies before each task.

Prerequisites

You will need the same setup as the basic CrewAI memory tutorial:

pip install aegis-memory[crewai] crewai

Make sure the Aegis Memory server is running:

docker-compose up -d

Set up the crew and agent memory:

from aegis_memory.integrations.crewai import AegisCrewMemory, AegisAgentMemory

crew_memory = AegisCrewMemory(
    api_key="your-api-key",
    namespace="data-processing-crew",
    default_scope="global"
)

processor_memory = AegisAgentMemory(
    crew_memory=crew_memory,
    agent_id="DataProcessor",
    scope="agent-shared"
)

reviewer_memory = AegisAgentMemory(
    crew_memory=crew_memory,
    agent_id="QAReviewer",
    scope="agent-shared"
)

Pattern 1: Reflections — Recording What You Learned

A reflection is a structured record that captures four things: what happened, what went wrong, what fixed it, and when the lesson applies. It is more useful than a plain memory because it is queryable by error pattern and filterable by effectiveness.

When to Add Reflections

Add a reflection when an agent:

  • Encounters an error and finds a workaround
  • Discovers that an approach does not work as expected
  • Finds a better way to do something it has done before
  • Receives negative feedback from a user or another agent

Implementation

Here is a practical scenario. Your DataProcessor agent tries to import a CSV file and encounters a Unicode BOM that causes the first column header to be misread.

Run 1: The agent fails.

# The agent encounters the error during task execution
# After debugging, it records a reflection

processor_memory.add_reflection(
    content="CSV files exported from Excel on Windows often contain a UTF-8 BOM "
            "(byte order mark: EF BB BF) at the start. Python's csv.reader with "
            "utf-8 encoding reads the BOM as part of the first field, causing "
            "the header to be '\\ufeffid' instead of 'id'. This makes column "
            "lookups fail silently.",
    error_pattern="First CSV column header contains unexpected BOM character "
                  "(\\ufeff prefix), causing KeyError on column access",
    correct_approach="Open the file with encoding='utf-8-sig' instead of 'utf-8'. "
                     "The 'sig' variant automatically strips the BOM. Alternatively, "
                     "call content.lstrip('\\ufeff') on the raw string before parsing.",
    applicable_contexts=["csv", "file-import", "excel", "unicode", "windows", "encoding"]
)

Let us break down each field:

  • content: The full explanation of the issue. This is what gets indexed for semantic search. Write it as if explaining to a colleague who has never seen this bug.
  • error_pattern: A concise description of the symptom. This is used for exact-match style filtering when agents look for specific known issues.
  • correct_approach: The proven fix. This is the most actionable field — it tells future agents exactly what to do.
  • applicable_contexts: Tags that control when this reflection surfaces. Be generous with tags — it is better to surface a reflection unnecessarily than to miss it when it matters.

Multiple Reflections Build a Knowledge Base

Over time, your agents accumulate reflections about every type of failure they encounter:

# Reflection about API rate limiting
processor_memory.add_reflection(
    content="The Salesforce bulk API returns HTTP 429 when more than 10 "
            "batch jobs are submitted within 15 minutes. The retry-after "
            "header is unreliable; actual recovery takes 2-3x longer.",
    error_pattern="Salesforce bulk API 429 rate limit with unreliable retry-after",
    correct_approach="Limit batch submissions to 8 per 15 minutes (80% of quota). "
                     "Use exponential backoff starting at 60 seconds, not the "
                     "retry-after header value. Queue excess batches locally.",
    applicable_contexts=["salesforce", "bulk-api", "rate-limiting", "etl", "batch-processing"]
)

# Reflection about data quality
processor_memory.add_reflection(
    content="Customer phone numbers from the legacy CRM contain mixed formats: "
            "+44 7700 900000, 07700900000, (077) 009-00000. The normalization "
            "regex was too strict and rejected 23% of valid numbers.",
    error_pattern="Phone number normalization rejects valid international formats",
    correct_approach="Use the phonenumbers library (pip install phonenumbers) instead "
                     "of regex. It handles international formats, validates carrier "
                     "info, and normalizes to E.164 format automatically.",
    applicable_contexts=["phone-numbers", "data-normalization", "crm", "international"]
)

Pattern 2: Memory Voting — Separating Signal From Noise

Not every memory is equally useful. Some reflections contain advice that works perfectly. Others contain advice that seemed right but turned out to be wrong, or advice that only applies in narrow circumstances.

Memory voting lets agents rate memories as helpful or harmful after actually using them. Over time, an effectiveness score emerges that ranks memories by proven usefulness.

How Voting Works

When an agent retrieves a memory and uses it to complete a task, it votes on whether that memory helped:

from aegis_memory.client import AegisClient

client = AegisClient(base_url="http://localhost:8741", api_key="your-api-key")

# QA reviewer tests the BOM fix and confirms it works
client.vote(
    memory_id="reflection-bom-fix-001",
    vote="helpful",
    voter_agent_id="QAReviewer",
    context="Applied utf-8-sig encoding to customer data import. "
            "BOM was correctly stripped. All 12,000 rows parsed successfully.",
    task_id="customer-import-march-2026"
)

If the advice turns out to be wrong or incomplete:

# A different agent tries the Salesforce rate limit advice
# but discovers the numbers have changed
client.vote(
    memory_id="reflection-sf-rate-limit-001",
    vote="harmful",
    voter_agent_id="DataProcessor",
    context="Salesforce increased the bulk API limit to 15 batches per 15 minutes "
            "in their Winter '26 release. The 8-per-15-minute limit in this "
            "reflection is now too conservative, causing unnecessary delays.",
    task_id="sf-sync-april-2026"
)

The Effectiveness Score

Each memory accumulates votes over time, producing an effectiveness score:

effectiveness = (helpful_votes - harmful_votes) / (total_votes + 1)

A reflection with 8 helpful votes and 1 harmful vote has an effectiveness of (8-1)/(9+1) = 0.7. A reflection with 2 helpful and 3 harmful has (2-3)/(5+1) = -0.17. When agents query the playbook, they can filter by minimum effectiveness, ensuring they only get advice that has been validated by actual use.

Voting in Practice

Build voting into your task completion flow. After every task, the agent evaluates which memories it used and votes on them:

def post_task_voting(agent_memory, client, task_id, used_memories, task_outcome):
    """Vote on memories that were used during a task."""
    for memory in used_memories:
        # Determine vote based on task outcome
        if task_outcome == "success":
            vote = "helpful"
            context = f"Memory contributed to successful completion of {task_id}"
        else:
            vote = "harmful"
            context = f"Memory was consulted but task {task_id} still failed. " \
                      f"The advice may be incomplete or outdated."

        client.vote(
            memory_id=memory["id"],
            vote=vote,
            voter_agent_id=agent_memory.agent_id,
            context=context,
            task_id=task_id
        )

Pattern 3: Playbook Queries — Consulting Wisdom Before Acting

The playbook is the culmination of reflections and voting. It is a queryable knowledge base of proven strategies, ranked by effectiveness. Before starting any task, an agent should consult the playbook for relevant lessons.

Basic Playbook Query

# Before processing a new batch of CSV files
playbook = processor_memory.get_playbook(
    query="importing CSV files from external sources",
    top_k=5,
    min_effectiveness=0.3
)

for entry in playbook:
    print(f"Known issue: {entry['error_pattern']}")
    print(f"Proven fix: {entry['correct_approach']}")
    print(f"Effectiveness: {entry['effectiveness_score']}")
    print("---")

Output from a mature system might look like:

Known issue: First CSV column header contains unexpected BOM character
Proven fix: Open the file with encoding='utf-8-sig' instead of 'utf-8'
Effectiveness: 0.85
---
Known issue: Phone number normalization rejects valid international formats
Proven fix: Use the phonenumbers library instead of regex
Effectiveness: 0.72
---

Integrating the Playbook Into Agent Prompts

The real power comes from injecting playbook results into the agent’s system prompt before it starts working. This way, the agent is aware of past lessons before it writes a single line of code:

from crewai import Agent, Task, Crew

def create_informed_agent(agent_memory, task_description):
    """Create a CrewAI agent with playbook context injected."""

    # Query the playbook for relevant lessons
    playbook = agent_memory.get_playbook(
        query=task_description,
        top_k=5,
        min_effectiveness=0.3
    )

    # Format lessons as backstory context
    if playbook:
        lessons = "\n".join(
            f"- KNOWN ISSUE: {entry['error_pattern']}. "
            f"PROVEN FIX: {entry['correct_approach']}"
            for entry in playbook
        )
        backstory_addition = (
            f"\n\nYou have a playbook of lessons from previous runs. "
            f"Consult these before starting:\n{lessons}"
        )
    else:
        backstory_addition = ""

    agent = Agent(
        role="Senior Data Engineer",
        goal="Process and validate incoming data files",
        backstory=f"You are a meticulous data engineer.{backstory_addition}",
        verbose=True
    )

    return agent

Putting It All Together: Before and After

Let us see the full improvement cycle with a concrete scenario.

Before: Agent Without Learning (Run 1)

from crewai import Agent, Task, Crew

# Basic agent with no memory
agent = Agent(
    role="Data Engineer",
    goal="Import customer data from CSV files",
    backstory="You process CSV files for the analytics pipeline.",
    verbose=True
)

task = Task(
    description="Import customers.csv into the database. "
                "The file was exported from Excel on Windows.",
    expected_output="Import summary with row count and any errors.",
    agent=agent
)

crew = Crew(agents=[agent], tasks=[task])
result = crew.kickoff()

# Result: Fails because of BOM character in first column
# Agent has to debug from scratch every single time

After: Agent With ACE Patterns (Run 2+)

from aegis_memory.integrations.crewai import AegisCrewMemory, AegisAgentMemory
from aegis_memory.client import AegisClient
from crewai import Agent, Task, Crew

# Set up memory
crew_memory = AegisCrewMemory(
    api_key="your-api-key",
    namespace="data-team",
    default_scope="global"
)
agent_memory = AegisAgentMemory(
    crew_memory=crew_memory,
    agent_id="DataEngineer",
    scope="agent-shared"
)
client = AegisClient(base_url="http://localhost:8741", api_key="your-api-key")

# Step 1: Consult playbook before starting
task_description = "Import customers.csv from Excel/Windows into database"
playbook = agent_memory.get_playbook(
    query=task_description,
    top_k=5,
    min_effectiveness=0.3
)

# Step 2: Inject lessons into agent backstory
lessons_text = ""
if playbook:
    lessons_text = "\n\nLessons from previous runs:\n"
    for entry in playbook:
        lessons_text += f"- AVOID: {entry['error_pattern']}. "
        lessons_text += f"DO THIS: {entry['correct_approach']}\n"

agent = Agent(
    role="Senior Data Engineer",
    goal="Import customer data from CSV files correctly on the first try",
    backstory=f"You are a meticulous data engineer who learns from experience.{lessons_text}",
    verbose=True
)

task = Task(
    description="Import customers.csv into the database. "
                "The file was exported from Excel on Windows.",
    expected_output="Import summary with row count and any errors.",
    agent=agent
)

crew = Crew(agents=[agent], tasks=[task])
result = crew.kickoff()

# Step 3: If successful, vote on the memories that helped
if "error" not in str(result).lower():
    for entry in playbook:
        client.vote(
            memory_id=entry["id"],
            vote="helpful",
            voter_agent_id="DataEngineer",
            context=f"Used this lesson for customers.csv import. Success.",
            task_id="csv-import-customers-2026-02"
        )

# Step 4: If a new issue is discovered, add a reflection
# (this would happen in your error handling logic)
# agent_memory.add_reflection(
#     content="New issue discovered...",
#     error_pattern="...",
#     correct_approach="...",
#     applicable_contexts=[...]
# )

On Run 2+, the agent sees the BOM reflection in its playbook and uses utf-8-sig encoding from the start. It does not waste time debugging a problem it has already solved.

The Improvement Flywheel

The three patterns form a flywheel:

  1. Agent encounters a problem and records a reflection (what went wrong, what works).
  2. Other agents use the reflection and vote on whether it helped or harmed.
  3. Future agents consult the playbook before starting tasks, getting only high-effectiveness advice.
  4. The playbook gets better over time as low-quality reflections sink (negative votes) and high-quality ones rise.

Each iteration through this loop makes the system smarter. After a few dozen runs, your agents have a battle-tested playbook of solutions for every common failure mode in your domain.

Measuring Improvement

Track these metrics to see the flywheel in action:

  • First-attempt success rate: How often tasks complete without errors on the first try. This should increase as the playbook grows.
  • Playbook consultation rate: How often agents find relevant playbook entries. Low rates mean your applicable_contexts tags need improvement.
  • Average effectiveness score: The mean effectiveness of playbook entries. Rising scores mean voting is working as a quality signal.
  • Time to resolution: How long agents spend on tasks. This should decrease as agents stop rediscovering known solutions.

Common Mistakes

Storing too much: Not every interaction warrants a reflection. Reserve reflections for genuine lessons — errors that cost time, solutions that took effort to find, or approaches that were counter-intuitive.

Vague error patterns: “Something went wrong with the file” is not a useful error pattern. Be specific: “CSV column headers contain BOM prefix character when exported from Excel on Windows.”

Forgetting to vote: The playbook is only as good as the votes. If agents consult memories but never vote on them, effectiveness scores stay at zero and the quality signal never develops. Build voting into your task completion flow.

Too-narrow applicable_contexts: If you tag a reflection with only ["csv"], it will not surface when an agent searches for “file import” or “data parsing”. Use broad, overlapping tags.

What’s Next

You now have CrewAI agents that learn from mistakes, vote on memory quality, and consult proven strategies before acting. Here are paths to explore next:

The goal is simple: agents that never make the same mistake twice. Reflections record the lesson. Voting validates it. The playbook delivers it. Your agents get better every time they run.

crewai ace-patterns self-improvement reflections