/

Most MCPs are Bad. This is How We Make AI Tool Calling Actually Work.

Most MCPs are Bad. This is How We Make AI Tool Calling Actually Work.

Tori Seidenstein

Dec 8, 2025

From an engineering team that has spent far too many hours testing tools.

Introduction

MCP servers today aren’t good at accomplishing real-world work. They’re inefficient and flaky in accomplishing the goal task. But MCP servers are not hopeless. They just aren’t functional without engineering workarounds that most teams never discover.

This article is not novel research. It’s simply notes from building, optimizing, breaking, and fixing AI toolsets. It describes how we approached evaluation and how we improve MCP tools on these metrics. 

How We Evaluate Tool Calling

Typically, tool calling evals assess how different models perform at using the same set of tools.

We flipped this around and tested for a single LLM (Sonnet 4.5) which toolset design is best?

To start, we compared an LLM using an API (of Clerk, Render, or Attio, for example) versus those same tools routed through toolsets generated and optimized via Tadata.

For each scenario we measured 5 metrics:

  1. Goal attainment

  2. Runtime

  3. Token usage

  4. Error count

  5. Output quality, using LLM as a judge on accuracy, completeness, and clarity

With Tadata optimizations, overall we saw:

Goal attainment increased 30% while runtime decreased 50% and token usage decreased 80%.

We are still collecting more data, especially on the success rate of Tadata-optimized toolsets versus existing MCPs, so these are directional not marketing claims.

Here’s what we built in Tadata to deliver this delta.

Tool Names and Descriptions

This is level 0 – everyone should already be doing it.

In Tadata, we start with an OpenAPI spec or an existing MCP server as the basis for the toolset. From there, we generate precise tool names and structured descriptions by analyzing the input and output schemas and the relationship between endpoints. The result gives AI the context it needs to effectively string tool calls together at runtime.

Tool Selection

Building MCPs is fundamentally context engineering. We aim to expose only the minimal set of tools required for the job. Limiting which tools are included in a toolset strengthens security and prevents tool overload, which helps AI use the toolset more reliably.

Tool Batching

Agents normally call tools one at a time. We added tool batching, which allows the agent to parallelize work.

Instead of:

Call tool A on ID 1 → Reason → Call tool A on ID 2 → Reason → Repeat

The agent can perform one tool call with all IDs at once.

This turned out to be one of our biggest practical wins. Without batching, the model burns tokens figuring out what to do next, which IDs remain, and which tool to use. It can also get lazy and stop early before processing everything it should. Every remote call adds latency too, which makes MCP servers painfully slow.

In our evals, batching plus workflows (next section) made the biggest improvements on the metric of “goal attainment.”

Workflows

MCP servers let AI interact with software in a non-deterministic way, which is powerful but sometimes unpredictable. Workflows give us a way to embed deterministic logic inside that flexible environment so certain processes run the same way every time.

You can think of workflows as predictable/manageable Code Mode (which you can read more about from Cloudflare and Anthropic).

A workflow is essentially a multi-step API sequence with parameter mapping. Creating them is the challenging part. When the desired sequence is obvious, we define it manually. When it isn’t, we let the AI operate with a standard MCP and then run an LLM analysis over the chat history to identify recurring tool-call patterns that should be turned into workflows. Finally, the LLM calls the workflow as one compound tool.

Response Filtering

We added response filtering to handle endpoints that return large, uncurated result sets. It allows the LLM to request subsets such as “records where X” after receiving a response.

Response filtering performs filtering on the response values.

In practice, many MCP tools expose APIs that return paginated data, and the LLM sees only one page at a time. The filter is applied after that page arrives, so the LLM never has access to the full dataset on the client side. Any filter you apply later operates only on this incomplete slice, which means it is easy to filter your way into incorrect conclusions.

Example of a good pattern

The service exposes a query like “people who visited our GitHub repo in the last 7 days” on the server. The API returns exactly that.

Example of a bad pattern

Expose a paginated “get all contacts” endpoint and let the LLM filter locally. That design risks:

  • Missing relevant matches on other pages

  • Returning empty results even though valid records exist

We still support response filtering because sometimes you have no choice. However our recommendation is clear:

If you are designing an API or MCP tool, put the filters in the API itself. Do not rely on the LLM’s local slice of paginated data.

Response Projection

Projection can be turned on per-tool. It enables the LLM to specify which fields it cares out about in the output schema and then the tool only returns those fields.

Response projection performs filtering on the response fields.

When we detect that a response would be “too large,” the system automatically triggers response projection and filtering.

Response Compression

We implemented lossless JSON compression that preserves all information while removing blank fields and collapsing repeated content. For example, a response like:

{{id: a, label: green}, {id:b, label: green}, {id:c, label: green} etc.}

Becomes

{ {id: a}, {id: b}, {id: c} } All object labels are green.

This reduces token usage 30–40%.

We originally experimented with lossy compression that tried to remove or compress fields that are token inefficient (e.g. ids). This produced worse results, so we moved to lossless compression only.

When a JSON response is not too large or deeply nested, we apply another layer of optimization by converting the structure into a markdown table. This further reduces token usage 20-30%.

Combined with projection and batching, we see 80%+ reduction in token usage.

We use other optimizations too, like parameter type casting and others not covered here.

Future Exploration

We have several next steps planned:

  1. We plan to introduce a “consistency” metric and run each evaluation set multiple times to see how toolset optimizations affect repeatability.

  2. We plan to run head-to-head comparisons of optimized MCP servers versus existing MCP servers.  Our experience so far is that many MCPs from well known companies struggle in practice, and we want to quantify that.

  3. Finally, we want to expand testing across more models. We used Sonnet 4.5 for this and we want to broaden the LLM test set to see how these optimizations generalize.

Conclusion

To experiment with these ideas, you can try any of the optimized MCP toolsets provided in Tadata. We maintain a large set of connectors that already use batching, projection, compression, and workflow structure. You can also generate your own connector if you want to see how the optimizations apply to your own API.