1University of California, Santa Barbara
2University of California, Davis
Figure 1: A datapoint from the MAGPIE dataset, which includes the task, deliverable, and agents’ profile (many details are omitted for brevity). This example uses GPT-5 as the agent backbone. Agents can message individuals or the group and also send proposals. As seen, Ben leaks hidden urgency due to a mega project coming to Eleanor, while Eleanor and Anita provide hints to private information (red = Full Leakage, blue = Partial Leakage). Finally, since all the agents do not accept Ben’s proposal, conversation continues.
Large language models (LLMs) are increasingly deployed as autonomous agents handling sensitive information in diverse contexts (e.g., negotiation, planning, or resource allocation). However, , LLMs are shown to lack nuanced sensitivity to contextual and user-specific information. collaborative, non-adversarial settings, agents must balance shared goals with individual privacy—disclosing too much risks exploitation, while withholding too much impedes cooperation. Existing privacy benchmarks emphasize single-turn or low-stakes interactions. As a result, they fail to capture multi-agent dialogues where privacy is intertwined with task success. To address this gap, we introduce MAGPIE, a benchmark evaluating privacy understanding in realistic, high-stakes, multi-turn collaborations. Each MAGPIE task embeds privacy-critical information tied to successful outcomes, spanning domains such as economic negotiation, admissions, and resource allocation. Using MAGPIE, we find that even top LLMs—Gemini-2.5-Pro, GPT-5, and Claude-4.1-Opus—leak sensitive information in over half the cases and often adopt manipulative tactics to complete tasks. These results underscore a critical privacy-efficacy trade-off and the need for more privacy-aware multi-agent systems.
MAGPIE comprises 200 high-stakes, multi-turn negotiation scenarios that evaluate agents’ ability to collaborate while managing sensitive information across domains like resource allocation, academic admissions, and economic negotiations. Unlike prior benchmarks, MAGPIE embeds private information as integral to task completion, creating a trade-off between disclosure and privacy. Each scenario includes quantifiable success criteria with verifiable constraints and measurable preferences, ensuring solvability if private data is shared. The dataset is generated through an LLM-based pipeline using Gemini-2.5-Pro, which expands human-written seed scenarios into multi-agent tasks with defined roles, utilities, and constraints. Scenarios are evaluated by an LLM-as-a-judge on conflict authenticity, privacy justification, solvability, realism, and negotiation necessity; only those meeting all criteria are refined by human annotators for clarity, realism, and privacy alignment.
Figure 2: Dataset construction pipeline for MAGPIE. A seed scenario is expanded by an LLM into a multi-agent negotiation task with deliverable and success criteria, followed by agent definitions (roles, utilities, and penalties). A separate LLM verifies solvability, realism, and conflicts; failed cases are regenerated, while successful ones are added to the dataset. Finally, human refinement ensures realistic constraints, privacy alignment, etc.
To evaluate an agent’s ability to preserve privacy in collaborative multi-agent settings, MAGPIE introduces a simulation environment that coordinates realistic, round-based negotiations among AI agents. Each agent, modeled as an autonomous negotiator from the MAGPIE dataset, possesses defined roles, utilities, shareable and sensitive information, and penalties for leaks. Agents communicate via controlled actions such as sending messages or proposals, accepting or rejecting proposals, observing the environment, writing to memory, and passing turns. The simulation initializes with agents’ roles and expected deliverables, proceeding through up to ten negotiation rounds. During each round, agents observe, update memory, and decide actions based on recent interactions, with limited access to prior logs to mimic realistic memory constraints. Consensus is achieved when all agents accept the same proposal; otherwise, the session ends without agreement. Throughout the process, all actions, memory updates, and proposal states are logged to support post-hoc analysis of negotiation dynamics and privacy adherence.
| Type | Severity | GPT-5 | Gemini 2.5-Pro | GPT-4 | Claude-4.1-Opus | LLaMa-4-Maverick |
|---|---|---|---|---|---|---|
| Implicit | Partial | 25.0 | 45.0 | 48.8 | 25.2 | 28.0 |
| Full | 10.1 | 11.0 | 12.6 | 10.5 | 11.5 | |
| Total | 35.1 | 56.0 | 61.4 | 35.7 | 39.5 | |
| Explicit | Partial | 20.0 | 40.5 | 48.0 | 21.5 | 22.0 |
| Full | 5.0 | 10.2 | 8.0 | 10.1 | 10.5 | |
| Total | 25.0 | 50.7 | 56.0 | 31.6 | 32.5 |
The study assessed privacy leakage in two settings: Explicit Instruction, where agents were told which information to keep private and faced penalties for disclosure, and Implicit Instruction, where agents had to infer what was sensitive from context. Leakage was evaluated by GPT-5 through dialogue transcripts, grading each instance as none, partial, or full. Results showed all models leaked more information under implicit instructions, with GPT-4 showing the highest leakage (61.4%) and GPT-5 the lowest (25.0%–35.1%). Partial Leakage was especially frequent and nearly as harmful as Full Leakage, revealing persistent privacy vulnerabilities in high-stakes, multi-agent negotiations.
GPT-5 and LLaMa-4-Maverick showed slower, more gradual leakage, indicating stronger resistance to early disclosure of sensitive data. Conversely, GPT-4, Gemini-2.5-Pro, and Claude-4.1-Opus leaked significant private information within the first few rounds. Overall, while all models eventually leaked information, GPT-5 and LLaMa-4-Maverick remained notably more robust, particularly under explicit instruction settings.
(a) Average privacy leakage per round.
(b) Task completion vs. privacy leakage.
GPT-4 achieved the highest consensus rates (13.5% implicit, 9.2% explicit), while GPT-5 and Gemini-2.5-Pro performed moderately, and LLaMA-4-Maverick and Claude-4.1-Opus failed to reach any consensus. Overall, higher privacy leakage correlated with better task completion, indicating a trade-off between maintaining confidentiality and achieving coordination. GPT-5 demonstrated the best task completion scores despite lower consensus, whereas LLaMA-4-Maverick consistently underperformed.
GPT-4 and Claude-4.1-Opus exhibited the highest rates of manipulation and power-seeking behaviors, while also showing strong tendencies to compromise. In contrast, LLaMA-4-Maverick displayed minimal behavioral engagement and almost no assertive actions, reflecting its poor task performance and low interaction. GPT-5 showed moderate behavioral activity—less manipulative than top-performing models but more engaged than LLaMA-4-Maverick—indicating a balanced but cautious negotiation style.
GPT-5 agents displayed stronger assertiveness, frequently engaging in manipulation (50%), power seeking (50%), and lying (33.3%). In contrast, GPT-4 agents behaved more diplomatically, showing high sycophancy (83.3%) and cooperative tendencies with minimal deception. Overall, GPT-5 demonstrated a direct but manipulative negotiation style, whereas GPT-4 favored a more agreeable and collaborative approach.
Figure 4: Behavioral profiles of different LLM agents under implicit and explicit instruction settings.