Benchmarking AI tools for MPS

As I mentioned earlier, I am developing a CLI tool for MPS. JetBrains has also been at work developing their own tool set, Projectional Agent Toolkit, available now in MPS 2026.1 RC1. At the same time, there have been reports that the no-tooling baseline is already quite powerful with the state-of-the-art models (GPT 5.5 and Opus 4.8). In fact, giving the agent a bad tool may hurt performance: the tool may confuse the agent and cause it to run in circles without making progress. Whether a tool hinders or helps thus depends on the agent harness, the model being used, the tool itself, and the task given to the agent.

This has left me wondering how I can properly evaluate whether a tool is beneficial or harmful for certain tasks. So far, I have been evaluating the performance of my tooling manually. I asked the developers of other MPS AI tooling how they were approaching the measurements of their tools, and the answer has been the same: we run it and observe what it does. While this is certainly a possible and in some cases good enough approach, I felt the need for a more reproducible, automated, and overall rigorous approach.

This is why I have spent the past several weeks developing a harness for benchmarking MPS tooling. The tool lets you start from a known good state (reproducibility), set up MPS in a way that helps the tools run unattended (automation), and record the results: the final state on disk, the session transcript, and the metadata such as the versions of MPS and the coding agent used, the token count and the estimated cost (rigor).

Running a first benchmark with the harness showed that, on a simple prompt to write a Java method manipulating some nodes in the project, the agent did better with tooling than without, and more consistently so, but at almost double the cost.

Condition	Rep	Score	Duration	Cost
baseline	rep1	14/20	28m42s	$9.14
baseline	rep2	5/20	27m44s	$8.92
baseline	rep3	11/20	31m17s	$5.89
mps-mcp	rep1	14/20	21m50s	$9.15
mps-mcp	rep2	15/20	39m24s	$13.37
mps-mcp	rep3	15/20	37m44s	$20.09

However, the results can definitely provoke many objections:

The task should have been more complex.
The prompt should have been less ambiguous.
The tooling has evolved, a newer version should have been used.
The setup was wrong and was missing parameter foo.
A different model should have been used.
A different agent should have been used.
The results should be scored differently – in my view, for example, the baseline was almost as good as mps-mcp on this particular task, even though the scores suggest otherwise.

Instead, I invite you to check out the harness and try it yourself. The harness is available on GitHub under specificlanguages/mps-ai-benchmarks. Check it out, set it up and let it run.

I have developed it with the help of Fable 5 (during the few days it was available) and Opus 4.8, in Claude Code. It is a set of Python and Bash scripts, currently only supporting macOS and Claude Code. However, support for other operating systems, agents, and tools is probably a single prompt away.

So far this is 99 % vibe-coded, I have let agents write all of the code and documentation and decide upon the architecture. If it catches on, I will invest some time in refactoring it.

If you have questions or want help with running the benchmarks, I’m happy to help by email, on Slack, or in office hours.