tag:github.com,2008:https://github.com/parameterlab/MASEval/releases
Release notes from MASEval
2026-01-18T21:54:32Z
tag:github.com,2008:Repository/1091817332/v0.3.0
2026-01-18T21:55:15Z
v0.3.0
<h2>[0.3.0] - 2025-01-18</h2>
<h3>Added</h3>
<p><strong>Parallel Execution</strong></p>
<ul>
<li>Added parallel task execution with <code>num_workers</code> parameter in <code>Benchmark.run()</code> using <code>ThreadPoolExecutor</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>ComponentRegistry</code> class for thread-safe component registration with thread-local storage (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>TaskContext</code> for cooperative timeout checking with <code>check_timeout()</code>, <code>elapsed</code>, <code>remaining</code>, and <code>is_expired</code> properties (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>TaskProtocol</code> dataclass with <code>timeout_seconds</code>, <code>timeout_action</code>, <code>max_retries</code>, <code>priority</code>, and <code>tags</code> fields for task-level execution control (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>TimeoutAction</code> enum (<code>SKIP</code>, <code>RETRY</code>, <code>RAISE</code>) for configurable timeout behavior (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>TaskTimeoutError</code> exception with <code>elapsed</code>, <code>timeout</code>, and <code>partial_traces</code> attributes (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>TASK_TIMEOUT</code> to <code>TaskExecutionStatus</code> enum for timeout classification (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
</ul>
<p><strong>Task Queue Abstraction</strong></p>
<ul>
<li>Added <code>TaskQueue</code> abstract base class with iterator interface for flexible task scheduling (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>SequentialQueue</code> for simple FIFO task ordering (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>PriorityQueue</code> for priority-based task scheduling using <code>TaskProtocol.priority</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
<li>Added <code>AdaptiveTaskQueue</code> abstract base class for feedback-based adaptive scheduling with <code>initial_state()</code>, <code>select_next_task(remaining, state)</code>, and <code>update_state(task, report, state)</code> methods (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
</ul>
<p><strong>ModelAdapter Chat Interface</strong></p>
<ul>
<li>Added <code>chat()</code> method to <code>ModelAdapter</code> as the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning a <code>ChatResponse</code> object and accepting tools</li>
<li>Added <code>ChatResponse</code> dataclass containing <code>content</code>, <code>tool_calls</code>, <code>role</code>, <code>usage</code>, <code>model</code>, and <code>stop_reason</code> fields for structured response handling</li>
</ul>
<p><strong>AnthropicModelAdapter</strong></p>
<ul>
<li>New <code>AnthropicModelAdapter</code> for direct integration with Anthropic Claude models via the official Anthropic SDK</li>
<li>Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input</li>
<li>Added <code>anthropic</code> optional dependency: <code>pip install maseval[anthropic]</code></li>
</ul>
<p><strong>Benchmarks</strong></p>
<ul>
<li>Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li><code>Tau2Benchmark</code>, <code>Tau2Environment</code>, <code>Tau2User</code>, <code>Tau2Evaluator</code> components for framework-agnostic evaluation (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li><code>DefaultAgentTau2Benchmark</code> using an agent setup closely resembeling to the original tau2-bench implementation (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li>Data loading utilities: <code>load_tasks()</code>, <code>ensure_data_exists()</code>, <code>configure_model_ids()</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li>Metrics: <code>compute_benchmark_metrics()</code>, <code>compute_pass_at_k()</code>, <code>compute_pass_hat_k()</code> for tau2-style scoring (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li>Domain implementations with tool kits: <code>AirlineTools</code>, <code>RetailTools</code>, <code>TelecomTools</code> with full database simulation (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
</ul>
<p><strong>User</strong></p>
<ul>
<li><code>AgenticUser</code> class for users that can use tools during conversations (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li>Multiple stop token support: <code>User</code> now accepts <code>stop_tokens</code> (list) instead of single <code>stop_token</code>, enabling different termination reasons (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li>Stop reason tracking: <code>User</code> traces now include <code>stop_reason</code>, <code>max_turns</code>, <code>turns_used</code>, and <code>stopped_by_user</code> for detailed termination analysis (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
</ul>
<p><strong>Simulator</strong></p>
<ul>
<li><code>AgenticUserLLMSimulator</code> for LLM-based user simulation with tool use capabilities (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
</ul>
<p><strong>Examples</strong></p>
<ul>
<li>Tau2 benchmark example with default agent implementation and result comparison scripts (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
</ul>
<h3>Changed</h3>
<p><strong>Benchmark</strong></p>
<ul>
<li><code>Benchmark.agent_data</code> parameter is now optional (defaults to empty dict) (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
<li>Refactored <code>Benchmark</code> to delegate registry operations to <code>ComponentRegistry</code> class (PR: #)</li>
<li><code>Benchmark.run()</code> now accepts optional <code>queue</code> parameter (<code>BaseTaskQueue</code>) for custom task scheduling (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3700727957" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/14" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/14/hovercard" href="https://github.com/parameterlab/MASEval/pull/14">#14</a>)</li>
</ul>
<p><strong>Task</strong></p>
<ul>
<li><code>Task.id</code> is now <code>str</code> type instead of <code>UUID</code>. Benchmarks can provide human-readable IDs directly (e.g., <code>Task(id="retail_001", ...)</code>). Auto-generates UUID string if not provided. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
</ul>
<h3>Fixed</h3>
<ul>
<li>Task reports now use <code>task.id</code> directly instead of <code>metadata["task_id"]</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3755146065" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/16" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/16/hovercard" href="https://github.com/parameterlab/MASEval/pull/16">#16</a>)</li>
</ul>
<h3>Removed</h3>
github-actions[bot]
tag:github.com,2008:Repository/1091817332/v0.2.0
2025-12-05T16:29:23Z
v0.2.0
<h2>[0.2.0] - 2025-12-05</h2>
<h3>Added</h3>
<p><strong>Exceptions and Error Classification</strong></p>
<ul>
<li>Added <code>AgentError</code>, <code>EnvironmentError</code>, <code>UserError</code> exception hierarchy in <code>maseval.core.exceptions</code> for classifying execution failures by responsibility (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added <code>TaskExecutionStatus.AGENT_ERROR</code>, <code>ENVIRONMENT_ERROR</code>, <code>USER_ERROR</code>, <code>UNKNOWN_EXECUTION_ERROR</code> for fine-grained error classification enabling fair scoring (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added validation helpers: <code>validate_argument_type()</code>, <code>validate_required_arguments()</code>, <code>validate_no_extra_arguments()</code>, <code>validate_arguments_from_schema()</code> for tool implementers (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added <code>ToolSimulatorError</code> and <code>UserSimulatorError</code> exception subclasses for simulator-specific context while inheriting proper classification (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>Benchmarks</strong></p>
<ul>
<li>MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>Benchmark</strong></p>
<ul>
<li>Added <code>execution_loop()</code> method to <code>Benchmark</code> base class enabling iterative agent-user interaction (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added <code>max_invocations</code> constructor parameter to <code>Benchmark</code> (default: 1 for backwards compatibility) (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added abstract <code>get_model_adapter(model_id, **kwargs)</code> method to <code>Benchmark</code> base class as universal model factory to be used throughout the benchmarks. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>User</strong></p>
<ul>
<li>Added <code>max_turns</code> and <code>stop_token</code> parameters to <code>User</code> base class for multi-turn support with early stopping. Same applied to <code>UserLLMSimulator</code>. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added <code>is_done()</code>, <code>_check_stop_token()</code>, and <code>increment_turn()</code> methods to <code>User</code> base class (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added <code>get_initial_query()</code> method to <code>User</code> base class for LLM-generated initial messages (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li>Added <code>initial_query</code> parameter in <code>User</code> base class to trigger the agentic system. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>Environment</strong></p>
<ul>
<li>Added <code>Environment.get_tool(name)</code> method for single-tool lookup (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>Interface</strong></p>
<ul>
<li><a href="https://github.com/run-llama/llama_index">LlamaIndex</a> integration: <code>LlamaIndexAgentAdapter</code> and <code>LlamaIndexUser</code> for evaluating LlamaIndex workflow-based agents (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3658516202" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/7" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/7/hovercard" href="https://github.com/parameterlab/MASEval/pull/7">#7</a>)</li>
<li>The <code>logs</code> property inside <code>SmolAgentAdapter</code> and <code>LanggraphAgentAdapter</code> are now properly filled. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3652577875" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/3" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/3/hovercard" href="https://github.com/parameterlab/MASEval/pull/3">#3</a>)</li>
</ul>
<p><strong>Examples</strong></p>
<ul>
<li>Added a new example: The <code>5_a_day_benchmark</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3676793025" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/10" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/10/hovercard" href="https://github.com/parameterlab/MASEval/pull/10">#10</a>)</li>
</ul>
<h3>Changed</h3>
<p><strong>Exception Handling</strong></p>
<ul>
<li>Benchmark now classifies execution errors into <code>AGENT_ERROR</code> (agent's fault), <code>ENVIRONMENT_ERROR</code> (tool/infra failure), <code>USER_ERROR</code> (user simulator failure), or <code>UNKNOWN_EXECUTION_ERROR</code> (unclassified) instead of generic <code>TASK_EXECUTION_FAILED</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li><code>ToolLLMSimulator</code> now raises <code>ToolSimulatorError</code> (classified as <code>ENVIRONMENT_ERROR</code>) on failure (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li><code>UserLLMSimulator</code> now raises <code>UserSimulatorError</code> (classified as <code>USER_ERROR</code>) on failure (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>Environment</strong></p>
<ul>
<li><code>Environment.create_tools()</code> now returns <code>Dict[str, Any]</code> instead of <code>list</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
</ul>
<p><strong>Benchmark</strong></p>
<ul>
<li><code>Benchmark.run_agents()</code> signature changed: added <code>query: str</code> parameter (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li><code>Benchmark.run()</code> now uses <code>execution_loop()</code> internally to handle agent-user interaction cycles (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>)</li>
<li><code>Benchmark</code> class now has a <code>fail_on_setup_error</code> flag that raises errors observed during setup of task (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3676793025" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/10" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/10/hovercard" href="https://github.com/parameterlab/MASEval/pull/10">#10</a>)</li>
</ul>
<p><strong>Callback</strong></p>
<ul>
<li><code>FileResultLogger</code> now accepts <code>pathlib.Path</code> for argument <code>output_dir</code> and has an <code>overwrite</code> argument to prevent overwriting of existing logs files.</li>
</ul>
<p><strong>Evaluator</strong></p>
<ul>
<li>The <code>Evaluator</code> class now has a <code>filter_traces</code> base method to conveniently adapt the same evaluator to different entities in the traces (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3676793025" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/10" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/10/hovercard" href="https://github.com/parameterlab/MASEval/pull/10">#10</a>).</li>
</ul>
<p><strong>Simulator</strong></p>
<ul>
<li>The <code>LLMSimulator</code> now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3689788308" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/13" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/13/hovercard" href="https://github.com/parameterlab/MASEval/pull/13">#13</a>).</li>
</ul>
<p><strong>Other</strong></p>
<ul>
<li>Documentation formatting improved. Added darkmode and links to <code>Github</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3680930100" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/11" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/11/hovercard" href="https://github.com/parameterlab/MASEval/pull/11">#11</a>).</li>
<li>Improved Quick Start Guide in <code>docs/getting-started/quickstart.md</code>. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3676793025" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/10" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/10/hovercard" href="https://github.com/parameterlab/MASEval/pull/10">#10</a>)</li>
<li><code>maseval.interface.agents</code> structure changed. Tools requiring framework imports (beyond just typing) now in <code><framework>_optional.py</code> and imported dynamically from <code><framework>.py</code>. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3686679733" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/12" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/12/hovercard" href="https://github.com/parameterlab/MASEval/pull/12">#12</a>)</li>
<li>Various formatting improvements in the documentation (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3686679733" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/12" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/12/hovercard" href="https://github.com/parameterlab/MASEval/pull/12">#12</a>)</li>
<li>Added documentation for View Source Code pattern in <code>CONTRIBUTING.md</code> and <code>_optional.py</code> pattern in interface README (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3686679733" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/12" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/12/hovercard" href="https://github.com/parameterlab/MASEval/pull/12">#12</a>)</li>
</ul>
<h3>Fixed</h3>
<p><strong>Interface</strong></p>
<ul>
<li><code>LlamaIndexAgentAdapter</code> now supports multiple LlamaIndex agent types including <code>ReActAgent</code> (workflow-based), <code>FunctionAgent</code>, and legacy agents by checking for <code>.chat()</code>, <code>.query()</code>, and <code>.run()</code> methods in priority order (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3676793025" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/10" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/10/hovercard" href="https://github.com/parameterlab/MASEval/pull/10">#10</a>)</li>
</ul>
<p><strong>Other</strong></p>
<ul>
<li>Consistent naming of agent <code>adapter</code> over <code>wrapper</code> (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3652577875" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/3" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/3/hovercard" href="https://github.com/parameterlab/MASEval/pull/3">#3</a>)</li>
<li>Fixed an issue that <code>LiteLLM</code> interface and <code>Mixin</code>s were not shown in documentation properly (#PR: 12)</li>
</ul>
<h3>Removed</h3>
<ul>
<li>Removed <code>set_message_history</code>, <code>append_message_history</code> and <code>clear_message_history</code> for <code>AgentAdapter</code> and subclasses. (PR: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="3652577875" data-permission-text="Title is private" data-url="https://github.com/parameterlab/MASEval/issues/3" data-hovercard-type="pull_request" data-hovercard-url="/parameterlab/MASEval/pull/3/hovercard" href="https://github.com/parameterlab/MASEval/pull/3">#3</a>)</li>
</ul>
github-actions[bot]
tag:github.com,2008:Repository/1091817332/v0.1.2
2025-11-18T18:03:45Z
v0.1.2
<p><strong>Full Changelog</strong>: <a class="commit-link" href="https://github.com/parameterlab/MASEval/compare/v0.1.1...v0.1.2"><tt>v0.1.1...v0.1.2</tt></a></p>
github-actions[bot]
tag:github.com,2008:Repository/1091817332/v0.1.1
2025-11-18T15:47:18Z
Initial Release
<p>This is the initial code release. Library under active development. API might change anytime.</p>
cemde
tag:github.com,2008:Repository/1091817332/v0.1.0
2025-11-17T17:25:38Z
v0.1.0-alpha
No content.
cemde