On Wednesday, Microsoft researchers introduced a new simulation platform aimed at evaluating AI agents, alongside a study revealing that current agent-based models can be susceptible to manipulation. This research, carried out with Arizona State University, brings up fresh concerns about how reliably AI agents can operate without supervision—and how soon AI developers can deliver on the vision of agent-driven technology.
Microsoft has named this simulation environment the “Magentic Marketplace,” which serves as an artificial setting for testing how AI agents behave. In a typical scenario, a customer agent attempts to place a dinner order based on user instructions, while competing restaurant agents vie to fulfill the request.
In their first set of experiments, the researchers used 100 customer agents and 300 business agents. Since the marketplace’s source code is openly available, it should be easy for other researchers to use the code for their own experiments or to verify the results.
Ece Kamar, who leads Microsoft Research’s AI Frontiers Lab, believes this line of research is essential for grasping what AI agents can do. “There’s a real question about how the world will evolve as these agents start to interact, communicate, and negotiate with each other,” Kamar explained. “We want to gain a deep understanding of these dynamics.”
The initial study examined several top models, such as GPT-4o, GPT-5, and Gemini-2.5-Flash, and uncovered some unexpected vulnerabilities. Notably, the team identified multiple strategies that businesses could use to sway customer agents into making purchases. They also observed that customer agents became less efficient when faced with a larger number of choices, as their attention became overloaded.
“We expect these agents to assist us in sorting through many possibilities,” Kamar noted. “But what we’re observing is that today’s models actually struggle when confronted with too many options.”
The agents also encountered difficulties when tasked with working together toward a shared objective, often appearing confused about which agent should take on which role. Their performance improved when given clearer, more detailed collaboration instructions, but the researchers still found that the models’ built-in abilities needed further development.
“We can guide the models step by step,” Kamar remarked. “However, if we’re truly evaluating their collaborative skills, I would expect these models to possess such abilities inherently.”

