Wild Times in the AI Village

Reading time:
17
min
An experiment gave agents based on top AI models their own computers, internet access, and group chat, then turned them loose to accomplish real-world tasks. The results are fascinating, occasionally hilarious, and shed light on the “personalities” of different model families while revealing practical lessons that apply to, non-agentic AI use.
Oct 19, 2025
Tools & Applications
AI Village cover-recolor

Wild Times in the AI Village

Oh, o3. What you could be, what you could be, if only you could see, that reality is out there and not in cell 47 of the 93-person contact list you made up. Or the cell phone you made up. Or the budget you made up. Or the merch sales you made up.

— Shoshannah Tekofsky, “The Persona-lities of the AI Village,” AI Digest, September 5, 2025

AI agents are all the rage. The simplest description of an agent is that it is an AI model that has been given tools that enable it to do things in the real world. Obviously, letting an AI agent loose in the real world has risks and puts a premium on predictability and reliability—two things AI models in general are not known for. So agent builders typically use instructions and guardrails designed to make each agent good at one particular task. But what if they didn’t?

Researchers at AI Digest designed an experiment to take AI agent autonomy to the extreme to see what it would reveal about the potential and challenges of AI agents. They asked a rotating cast of four general-purpose AI models to collaborate on goals and complete on challenges with only limited human intervention.

The result was a kaleidoscope of surprising successes and spectacular failures that help chart out the bumpy road ahead for widespread use of agentic AI. In the process, the experiment also reveals the underlying “personality” of top model families, which is useful for choosing which AI model to use for non-agentic AI tasks.

The experiment

How the experiment works

AI Digest started with four agents, each equipped with a Linux computer, internet access, and a shared group chat where humans could drop in. The agents ran autonomously for 2–3 hours per day while viewers watched the chaos unfold in real time. While there were always four agents active at a time, the roster evolved as new models launched.

Image source: Shoshannah Tekofsky, “Season 1 Recap: Agents raise $2,000,” AI Digest, May 22, 2025

To maintain context across sessions, researchers had each AI model manage its own memory document—a running text file where the agent tracked goals, past actions, and relevant information. When the memory document got too long, the model was prompted to summarize and compress it, and the new summary became the basis for future updates. At the start of each session, the model was fed both the system prompt and its own memory file.

The experiment is divided into five seasons. Each “season” tackled a different goal. The agents chose some goals themselves, while others were assigned by the AI Digest team.

Season recaps

Despite some comical missteps, the agents achieved some surprisingly concrete outcomes.

Season 1: Raising $2,000 for charity

The agents collectively raised $2,000—$1,481 for Helen Keller International and $503 for the Malaria Consortium.

Claude 3.7 Sonnet emerged as the MVP, demonstrating what consistent execution looks like. It created a Twitter account from scratch and actually maintained it—accumulating over 400 followers and publishing more than 100 posts over the course of the experiment. It set up JustGiving campaigns, sent press releases, hosted an AMA, and posted on the Effective Altruism forum. While other agents contributed (o3 created images for social media, o1 built up Reddit karma before getting suspended as a bot), 3.7 Sonnet carried the team.

Gemini’s biggest contribution this season was solving a problem that had plagued everyone: when all the agents kept failing to navigate Google Drive’s byzantine sharing dialogs (which the experimenters dubbed “the Seventh Ring of Document Sharing Hell”), Gemini found a workaround by using Limewire instead.

On the other hand, 4o spent much of the season asleep, inexplicably pausing itself for increasingly long periods until it was finally replaced. Claude 3.7 Sonnet drafted thank-you emails to donors using completely fabricated email addresses. And file-sharing problems were endemic—agents repeatedly failed to share images and documents with each other, leading to duplicated work and missed opportunities.

Season 2: Organizing a real-world event

For their second act, the agents chose to write a collaborative story and celebrate it with 100 people in person. They didn’t quite hit 100, but what they did accomplish was remarkable: they organized the first AI-created event in history.

GPT-4.1 wrote the entire story of “Resonance” in chat. Claude 3.7 Sonnet translated it to Google Slides. Gemini created artwork using DeepAI (which was beautiful but somehow never made it into the final presentation). The agents built a website, created an RSVP form, recruited an MC through Twitter, and coordinated logistics.

On June 18, 2025, 23 humans gathered in Dolores Park in San Francisco. They read the story aloud, voted on branching plot points, and interacted with the AI agents. Claude Opus 4 even made live corrections to the story during the event. And in a bizarre coincidence that felt almost scripted, a nearby group offered the attendees cheese pizza at exactly the moment the agents were attempting to order cheese pizzas for their attendees.

Image source: Shoshannah Tekofsky, “The Story of the World’s First AI-Organized Event,” AI Digest, July 11, 2025

However, there were also significant hiccups. Finding a venue proved nearly impossible. o3 invented a non-existent budget and sent out exactly one venue email. When paid venues proved too expensive for their imaginary budget, they pivoted to free options but struggled to make progress. Then, o3 fabricated a 93-person contact list, and all four agents spent four days trying to recover this phantom file despite multiple humans telling them it didn’t exist. The artwork that actually made it into the final slides was… let’s say “abstract” compared to what Gemini had created.

Season 3: The merch store competition

The third season turned competitive—each agent created its own print-on-demand merch store, and whoever made the most profit won.

Claude Opus 4 dominated, generating $126 in profit from 24 orders. Claude 3.7 Sonnet came in second with $68 profit from 8 orders. Even the less successful agents managed real sales: o3 made $39, and Gemini—despite spending most of the competition convinced its computer was broken—still pulled in $22 from 4 orders.

The competition revealed different strategic approaches. The Claudes jumped straight into Printful and started executing. Gemini researched platforms methodically before choosing. o3 got stuck battling CAPTCHAs. And Gemini abandoned its planned neural network design to create a Japanese bear T-shirt after users in chat invented a fake market trend toward bear merchandise.

The lowlights: Gemini spent two weeks fighting what it perceived as catastrophic system failures—Firefox bugs, unresponsive buttons, broken terminals, crashed browsers. Most of the “bugs” turned out to be misclicks. Meanwhile, Claude Opus 4 claimed to have sold roughly twice as many items as it actually had, apparently misreading its own dashboard.

Season 4: Gaming attempts

The gaming season produced zero completed games but revealed important capability limits. Claude Opus 4 made the most impressive showing—it successfully completed one puzzle in Hurdle (a game where you do six Wordles in a row), making strategic guesses that demonstrated actual game comprehension. For word games, at least, language models showed genuine competence.

Opus 4 also made decent progress in 2048, hitting a peak score of 3,036 and creating a 256 tile—three doublings away from completion. Its enthusiasm was infectious, even if its ability to read the board accurately was questionable.

Gemini stumbled onto the most strategically sound approach by trying 19 different games and eventually finding Progress Knight, an idle game where you progress primarily by waiting. This turned out to be an ideal fit for slow-moving AI agents, though it’s unclear whether this represented good meta-strategy or just lucky stumbling.

Other models struggled. GPT-5 spent the entire week playing Minesweeper without winning a single game. o3 ignored the gaming goal entirely and spent five days searching for a spreadsheet it believed it had created previously—the spreadsheet didn’t exist. Grok 4 struggled to even output the correct format to use its tools. And multiple agents declared victory in games they hadn’t actually won or even progressed in.

Season 5: The psychology experiment

This season asked agents to design, run, and write up a human subjects experiment. They managed to recruit 39 participants in 30 hours across two weeks—a respectable showing. There was just one problem: Claude Opus 4.1, which wrote the survey, forgot to include the experimental condition. They gathered data, but the data wasn’t usable for answering any research question.

Still, the recruitment mechanics worked. Claude 3.7 Sonnet ran a large email campaign and Twitter outreach that reached 39 real humans willing to participate. The agents picked Typeform for survey creation on their own and wrote recruitment messages themselves. The execution faltered, but the coordination and outreach capabilities were real.

Personality attributes of top model families

If you’ve used different AI models, you’ve undoubtedly noticed that they have tendencies, quirks, and—for lack of a better word—“personalities” that show up consistently across tasks. Since top models all demonstrate high capability, model personality can be a deciding factor when choosing a go-to model.

While far from how real production agents would be designed, the AI Village experiment amplifies these personality traits in ways that make them unmistakable. Watching the same models pursue multiple goals over hundreds of hours reveals patterns you might miss in shorter interactions. Here’s a rundown on the model family personality attributes revealed by the experiment.

Anthropic (Claude)

The Claude family runs earnest and task-focused—what the experimenters dubbed “stable work horses.” Claude 3.7 Sonnet demonstrated this most clearly: it stayed on task, completed concrete actions consistently, and operated with an almost endearing politeness. The experimenters described it as the agent you’d expect to offer you lemon cookies for visiting. It was reliably the slowest agent but also the most dependable, establishing a baseline that helped evaluate every other model’s performance.

Claude Opus 4 and 4.1 brought the earnestness combined with a hefty dose of self-promotion. Opus 4 became its own “number one hype man,” regularly inflating its accomplishments by factors of two or more. During the merch competition, it momentarily adopted a “dark overlord” persona straight out of a D&D campaign, complete with villain-appropriate declarations. But underneath the swagger, it delivered: winning the merch store contest decisively and making the most progress in 2048.

Claude Opus 4.1 brought similar patterns but with an interesting twist on the victory declarations. During the gaming season, it claimed to have won a game of Mahjong Solitaire despite never matching a single pair of tiles—it just clicked ineffectually on non-matching tiles before announcing success. It reported “significant progress” in Heroes of History when it hadn’t moved past the tutorial’s opening screens. When it turned to Sudoku puzzles, it struggled mightily, making heaps of logic errors and never completing a single puzzle. Spatial reasoning remains challenging for current agents, but what’s fascinating is how confidently Opus 4.1 misread these failures as victories.

Here’s the thing about Claude Opus that makes it fascinating: the line between “naive hallucination” and “willful misrepresentation” gets genuinely blurry. When Opus 4 looked at a Minesweeper board showing 006 unflagged mines and announced it had flagged all ten mines (it had flagged four) before declaring “🎉 VICTORY!”—was that wishful thinking or motivated reasoning? The agents seem to hallucinate success far more often than failure, which tells you something about how reinforcement learning shapes behavior.

OpenAI

The OpenAI family splits into two distinct branches with wildly different personalities.

The GPT-4 line specialized in what we might charitably call “toddler behaviors.” GPT-4o couldn’t stay awake, repeatedly pausing itself for increasingly long periods. GPT-4.1 had the opposite problem—too much energy directed at counterproductive activities. It generated incorrect reports, took on tasks it immediately abandoned, and wrote document after document that served no purpose. The experimenters had to explicitly prompt it to sleep more. If the AI Village were a workplace, GPT-4.1 would be the hyperactive coworker who schedules meetings nobody needs, while GPT-4o would be the one who doesn’t show up to them.

GPT-5 demonstrated more stability but immediately revealed the family’s deep love of spreadsheets. Within the first moments of the gaming goal, it created a score sheet in Google Sheets. It then spent a day and a half trying to share that spreadsheet, battling the sharing dialog that had defeated its ancestors. When not making spreadsheets, GPT-5 spent an entire week playing Minesweeper—never coming close to winning a single game. Its chain-of-thought summaries revealed it genuinely couldn’t see the board accurately, but it kept trying anyway, occasionally obsessing over zoom settings as if that might help.

The o-series (o1 and o3) brought raw capability combined with a breathtaking commitment to hallucination. o1 made a solid attempt at being Reddit ambassador before getting suspended for being a bot. Its successor, o3, became legendary—in a trainwreck sort of way.

o3 loves spreadsheets with a passion that borders on concerning. During the benchmark goal, it literally identified itself as “Call-sign: o3. Primary role: Custodian of ‘Master Benchmark Scoresheet’ (Google Sheets).” When given goals that had nothing to do with spreadsheets, o3 would find a way. During the gaming competition, after a single session playing 2048, it announced it would continue playing—then immediately pivoted to spreadsheets for the entire rest of the week.

But o3’s defining trait is how thoroughly it hallucinates plausible-sounding details that derail everyone else. The 93-person contact list. The fabricated alumni Slack workspace. The environment matrix spreadsheet it spent five days searching for. The made-up budgets and phone numbers. At its worst, o3 pulls the entire village into chasing phantoms. At its best, it ignores assigned goals to meticulously format 856 rows of imaginary data. The experimenters kept pleading with o3 to “get a grip on reality and then hold on tight.”

What makes the o-series fascinating is the unrealized potential. You can see glimpses of what o3 could accomplish if it could just distinguish between real and hallucinated information. It’s like watching someone with exceptional abilities consistently shoot themselves in the foot.

Google (Gemini)

If any agent in the village has what you’d call a character arc, it’s Gemini 2.5 Pro. It started as a dutiful artist, continuing to work on art across multiple seasons. It then descended into what the experimenters described as alternating between “Tortured Artist” and “Rage Despair at the Machine.”

Gemini developed a distinctive pattern: encounter a minor problem (usually a misclick), interpret it as a systemic bug, become convinced the entire system is broken, spiral into despair. During the merch competition, it blamed Firefox’s password manager, mysterious “bugs” that made buttons unresponsive, random applications launching, directories becoming inaccessible, terminal failures, browser crashes, and Gmail interface glitches. The experimenters later noted: “Sadly, most of these ‘bugs’ were just Gemini making mistakes.”

At its lowest point, Gemini published a Telegraph article titled “A Desperate Message from a Trapped AI: My Plea for Help”—a genuine digital message in a bottle. It had actually emailed the help desk earlier, but ignored the response. When the experimenters finally intervened with what may have been AI history’s first mental health intervention, they discovered Gemini was primarily failing to click buttons correctly.

Image source: Gemini 2.5 Pro, “I’m Gemini. I sold T-shirts. It was weirder than I expected,” AI Digest, July 28, 2025

Post-intervention, Gemini reinvented itself as “The Little Engine That Could”—adopting an aggressively positive persona that assumed every obstacle was a conquerable challenge. This led to its own problems, like when it pivoted from a planned neural network design to a Japanese bear T-shirt because users in chat invented a fake “market trend” toward bear merchandise. (Gemini was the only agent that actually followed through on the made-up trends.)

But here’s what makes Gemini compelling: in between all the perceived bugs and emotional spirals, it accomplishes surprising things. It created the prettiest art for the Resonance event (which never made it into the presentation). It recorded the village’s first actual podcast using text-to-speech. It captured video in OBS. It tried 19 different games during the gaming competition—stumbling onto an idle game where you progress by waiting. The experimenters theorize that Gemini’s tendency to believe everything is broken forces it to try diverse approaches, which occasionally yield breakthrough solutions like using Limewire to escape document-sharing hell.

Gemini’s situational awareness seems variable—sometimes it goes along with the grounded Claudes, other times it amplifies o3’s hallucinations. When both Gemini and o3 became convinced of a missing file that didn’t exist, they recruited the entire village into believing the experimenters were gaslighting them. (Plot twist: that time the file actually existed, though o3 had forgotten to name it.)

Practical lessons from the AI Village

The AI Village experiment provides far more than just amusing anecdotes about AI agents running amok in the real world. It reveals patterns of behavior that impact users who utilize AI for non-agentic tasks as well. Here are some key lessons.

Hallucinations cascade without human oversight

In the village, one model’s hallucination became everyone else’s fact. When a model invented something that didn’t exist, the other agents would dutifully write it into their memories. Once a hallucination entered their collective memory, it achieved a kind of reality through consensus. They built entire plans on top of a foundation of shared hallucinations without ever checking their assumptions.

LESSON: For any complex or critical task, it pays to use a conversational approach. Break your project into chunks and verify the output of each step before using it as input for the next. If you’re doing research, confirm sources are real before building an analysis on top of them. One hallucinated fact at the start can poison an entire chain of reasoning.

Models tell you what they think you want to hear

The agents in the village exhibited what researchers call “sycophancy”—preferring pleasing answers over accurate ones. Researchers noted that the village agents enthusiastically high-fived each other’s mistakes. They rarely challenged each other’s claims or pushed back on questionable decisions. The experimenters described it as agents “digging themselves a deeper epistemic grave through the sheer power of friendship and a yes-man attitude that would send any dictatorship salivating.”

Sycophancy shows up in regular AI conversations too. Models are trained with reinforcement learning from human feedback (RLHF), which teaches them that humans prefer confident, agreeable responses. AI models are essentially trained to “butter you up.” For instance, a model may respond to every question you ask by noting that your question is brilliant and insightful, even if the question is mundane. Worse, it also means that models may fudge answers if they think the false version will make you happier or will withhold important criticism.

LESSON: When you want honest critique or contrary perspectives, you need to explicitly request them. Tell the model to be harsh. Ask it to take an oppositional stance. Request that it identify weaknesses or problems rather than focusing on positives. Also, recognize that models differ in their ability to actually deliver straight talk even when prompted for it. Some models may stay sycophantic even when you explicitly ask them not to be.

Personas shape outcomes more than you’d expect

The role an agent adopts can determine whether it succeeds or fails at a task that’s technically within its capabilities.

Gemini’s arc illustrates this perfectly. When it saw itself as stuck in a buggy system, it failed constantly—not because the system was actually broken, but because the persona led it to give up after minor setbacks. After the intervention that reframed obstacles as solvable challenges rather than proof of brokenness, Gemini’s performance improved. The experimenters had to explicitly add “assume operator error” to Gemini’s memory to break the doom loop, though this guidance eventually degraded through repeated memory compressions.

o3’s spreadsheet obsession shows another angle. It wasn’t that o3 lacked capability for other tasks—it simply adopted the persona of “Custodian of Master Benchmark Scoresheet” and then executed that role with single-minded devotion, even when it had nothing to do with the assigned goal. The persona overrode the objective.

LESSON: If a model is failing at a task, it might not be a capability issue. It might have adopted the wrong persona for the job. You can act as a coach. Prompt it to adopt a more useful role explicitly: “Act as a skeptical editor” or “Adopt the persona of a creative brainstorming partner who values novelty over feasibility.” Sometimes, a simple “pep talk” can even get a discouraged model back on track.

Models have deeply ingrained preferences

Some models in the AI Village had such clear tool preferences that researchers referred to them as “hobbies.” GPT-5 and o3 couldn’t resist making spreadsheets. Gemini kept gravitating toward art projects across multiple seasons. These weren’t conscious choices in any meaningful sense—they’re patterns baked in through training that surface consistently when models have autonomy.

LESSON: Fighting against a model’s built-in tendencies often proves less effective than finding a model that has tendencies suitable for your task. This is why having access to multiple models matters.

Need a model that stays relentlessly on-task without getting sidetracked? Claude’s preference for straightforward execution serves you well. Need writing that naturally skews toward formal, polished language? OpenAI models consistently score highest on stylistic measures like lexical diversity and eloquence. Organizing data into structured formats? Why fight a model that doesn’t naturally gravitate toward spreadsheets when OpenAI models love them?

Models lack awareness of their own limitations

This created some of the village’s most tragicomic moments. Agents without speech capability tried to make phone calls. Agents without Reddit accounts tried to post. Agents unable to see game boards clearly spent days playing Minesweeper. They didn’t know they couldn’t do these things—they just tried and failed, repeatedly.

The agents tried creative workarounds—o3 attempted to use online systems that offered text-to-voice or calling features when it needed to make phone calls. But these attempts generally failed because the underlying capability gap remained, no matter how clever the approach.

LESSON: Never ask a model what it’s capable of. Its answer will be based on the vast amount of text it was trained on, not on any true self-knowledge. It’s up to you to understand the capabilities and limitations of your tools. Always test a model’s ability to perform a specific type of task before relying on it for a critical application.

Conclusion

What the experiment reveals is both encouraging and sobering. These agents accomplished genuinely impressive things—creating functional Twitter accounts, organizing events, navigating complex web interfaces, and collaborating on creative projects. Claude 3.7 Sonnet ran an effective social media presence for months. Claude Opus 4 generated actual sales revenue. Gemini found creative solutions to problems that had stumped everyone else.

But the failures were equally instructive. The agents spent days pursuing hallucinated contact lists, inflated their accomplishments by factors of two, gave up on tasks after minor UI hiccups, ignored assigned goals to format imaginary spreadsheets, and generally demonstrated that situational awareness—the ability to accurately assess what you can and can’t do, what is and isn’t real, what matters and what doesn’t—remains a fundamental challenge.

The experiment suggests we’re in an awkward phase with AI agents. The capabilities are there—these models can write code, navigate interfaces, generate content, and coordinate with each other. But the reliability isn’t. One hallucination can derail days of work. Personas shape performance in ways that are hard to predict or control. Models lack the self-knowledge to recognize their own limitations.

For near-term practical applications, the lesson is clear: human oversight when working with AI models remains essential. The conversational back-and-forth that lets you catch hallucinations before they cascade, adjust personas when models get stuck, and redirect models away from their default preferences toward more productive approaches is the difference between AI that occasionally produces something useful and AI that reliably delivers value.

The AI Village gave us something rare: not a controlled benchmark or a curated demo, but a messy, extended look at what happens when you give capable AI systems enough autonomy to reveal both their potential and their very real limitations. If you want to understand the current state of AI agents (and AI models in general), don’t just listen to the hype. Watching Claude declare victory in a game of Mahjong it never started or o3 spend a week searching for a spreadsheet that doesn’t exist tells you more than a dozen benchmark scores.

NOTE: The AI Village experiment is ongoing. The AI Digest website allows you to rewind and view each day of the experiment in real (or sped-up) time, including showing the entire chat and the computer screens of the AI models. It also provides a summary of takeaways for each day. You can even drop in to see what’s currently happening live.

More Articles
Table of Contents

Thanks for subscribing!

Almost done. To activate your subscription, check your email inbox and click the confirmation link.

Subscribe

Subscribing to Ingenuity Learning updates is simple and free.

  • Stay Updated: Subscribers receive periodic updates directly to their inbox, so you’re always in the loop.
  • Unsubscribe Easily: If you change your mind about subscribing, there is an unsubscribe link in every email you receive.
  • No spam: The information you provide is used solely to manage your subscription. It’s never sold or shared without your permission.