LLM-Powered Search: o4-mini-high vs o3 vs Deep Research

tldr:

Prompt: “How does Vercel use Speakeasy for API testing?”

Feature / Model	o-4-mini-high	o3	Deep Research
Speed	⚡ Instant	🕒 Conversational	🐢 Slower
Depth of Response	🟢 Basic facts	🟡 Balanced depth	🔵 Comprehensive analysis
Citation Quality	Inline links only	Inline links	30+ footnotes
Latency Friction	None	Low	High (3-min delay)
Cost	💸 Lowest	💸 Moderate	💸💸 Highest
Best Use Case	Sanity-checks, quick verification	Background research, architectural docs	Formal research, literature-style analysis
Output Length	Medium (~4.8k characters)	Longer (~7.5k characters)	Very Long (~13.9k characters)
Sources Used	10	15	31
Ideal Context	Slack pings, fact-checks	Blog prep, decision-making	Deep dive reports, whitepapers

Introduction

While reading about the “Docs as Tests” approach to API documentation, I found something interesting about Vercel using Speakeasy for their API testing. This caught my attention because I wanted to learn more about how they put this into practice.

Last week, Simon Willison had published a compelling argument that modern LLMs have essentially “solved” web search for everyday research tasks (with the caveat that you should still verify any potential inaccuracies). (AI assisted search-based research actually works now)

This gave me an idea: instead of using Google like most people would, I could test different LLM models against each other. I decided to ask three different OpenAI models the same research question:

Why is this now working?

The key difference with models like o3 is that they’re no longer just language models that predict the next token—they’re becoming mini agents. This means they can also predict when to call a function or use a tool, like running a Python script or performing a web search.

For example, o3 is often powered by tools like Bing for real-time web access or a Python interpreter for calculations. That’s why it can generate tables, analyze data, and reason with up-to-date information.

Traditional “vanilla” models (that only predict the next token without tools) are limited—they rely solely on their internal knowledge, which can be outdated and prone to hallucinations. But once you give a model access to tools, it can fetch accurate, real-world context on demand, making its next-token prediction much smarter.

Model Response Analysis

Prompt: “How does Vercel use Speakeasy for API testing?”

Raw Response Samples

(First words of each, to give you a feel for the tone)

🟢 o-4-mini-high

Vercel leverages Speakeasy’s OpenAPI-driven toolchain to generate both their official TypeScript SDK and a comprehensive suite of contract tests based on their public spec at https://openapi.vercel.sh (sdk/tests-overlay.yaml at main · vercel/sdk · GitHub, Testing | Speakeasy). Using the Speakeasy CLI’s speakeasy test command, they spin up a mock server and validate every endpoint against its schema automatically (GitHub - speakeasy-api/speakeasy). To handle endpoints that are still evolving, Vercel maintains a tests-overlay.yaml overlay file in their SDK repository, which disables specific tests by annotating operations with x-speakeasy-test: false (sdk/tests-overlay.yaml at main · vercel/sdk · GitHub). This entire process is woven into their CI/CD via GitHub Actions: running speakeasy configure tests scaffolds a “Test SDKs” workflow that executes these tests on every pull request (Testing in GitHub Actions). As a showcase customer, Vercel is explicitly listed on Speakeasy’s product page among world-class API teams using automated API testing in their pipelines (Testing | Speakeasy).

Thought for 41 seconds · 10 sources
Characters: 4,811

Full conversation →

🟡 o3 Result

Speakeasy sits in Vercel’s CI/CD pipeline as an OpenAPI-driven “safety-net”: every time the Vercel REST-API spec changes, Speakeasy (1) regenerates an SDK, (2) boots a ready-made Bun test-suite that hits live endpoints, and (3) compares the new spec against the last published contract in its registry. The workflow automatically runs in GitHub Actions; if an endpoint breaks schema, status-codes, or core behaviour, the job fails and engineers are alerted before the change ships. Below is a deep-dive into each moving part and how they fit together.

Why Vercel Automates API Tests

Vercel’s public API is used by thousands of deploy hooks, CLIs, and partners; even a minor breaking change can halt customer CI pipelines. Speakeasy markets its testing product to “world-class API companies” and lists Vercel as a reference logo, highlighting reliability as a selling point. (Testing | Speakeasy)
In Speakeasy’s Series A announcement the company calls out Vercel as one of the “fast-growing innovators” that relies on automated contract testing to “make sure no unintentional breaking changes get shipped.” (Speakeasy’s $15M Series A)

Thought for 1m 25s · 15 Sources
Characters: 7,483

Full conversation →

🔵 Deep Research

Vercel’s Speakeasy-Powered API Toolchain

Vercel uses Speakeasy to automate its REST API tooling – generating SDKs, docs, mocks and tests from an OpenAPI spec. In late 2024 Vercel announced a new TypeScript SDK (@vercel/sdk) built with Speakeasy (Introducing the Vercel TypeScript SDK - Vercel). The SDK is fully type-safe (using Zod schemas) and provides complete coverage of the Vercel REST API (Introducing the Vercel TypeScript SDK - Vercel) (Using the Vercel SDK - Vercel API Docs). In fact, Vercel’s SDK repo explicitly notes “SDK Created by Speakeasy” (GitHub - vercel/sdk). Under the hood, Vercel maintains an OpenAPI spec (vercel-spec.json) and a Speakeasy config (.speakeasy/gen.yaml) that drives code generation. A GitHub Actions workflow invokes the Speakeasy CLI on each update, regenerating the SDK (and publishing to npm) whenever the API spec changes. This ensures the SDK and docs stay in sync with the API. As a result, every Vercel endpoint is exposed via the SDK with up-to-date documentation and code examples, and the generated code is optimized (e.g. “standalone functions” for tree-shaking) (Introducing the Vercel TypeScript SDK - Vercel) (Using the Vercel SDK - Vercel API Docs).

Research completed in 3m · 31 sources
Characters: 13,942

Full conversation →

Key Finding

For search-focused tasks, o3 beat Deep Research. I got almost the same depth but twice as fast and for a fraction of the cost.

Choosing the Right Model

Best Practices for LLM Research

My testing matches what Simon Willison recently said about using AI to search for information. He made a strong point:

I still don’t fully trust these tools not to make mistakes. But for small, low-risk tasks, I might skip double-checking.

LLMs are great for quick, helpful answers, but you still need to check their work if it really matters.

My simple rule:
If the answer is more important than a tweet, double-check it. Look for two good sources or ask a second AI.
You’ll catch most errors in under a minute. Also its always worth to check the original sources.

Conclusion

LLM search helps you start a research rabbit-hole in seconds:

Use o3 for deeper answers that balance depth and speed
Switch to o-4-mini-high when time is of the essence
Choose Deep Research only when you need a comprehensive report with extensive citations

In practice, cost considerations play a significant role in model selection. With a $20 monthly subscription, my usage of Deep Research and o3 needs to be strategic. The key is matching the model to both your needs and context: When I’m on my smartphone and need quick answers, o4-mini-high is my go-to choice for its balance of speed and simplicity.

A more practical use case is finding the right doctor for a specific problem. Instead of dealing with Google’s clutter (like ads, SEO traps, and scattered reviews), I can just ask a reasoning model to do the heavy lifting. It can quickly suggest the top three doctors who best match my situation. Then I can check their websites myself to get a feel for them. This way, I do not just save time; I also make more informed decisions.

LLM-Powered Search: o4-mini-high vs o3 vs Deep Research

tldr:

Introduction

Why is this now working?

Model Response Analysis

Raw Response Samples

🟢 o-4-mini-high

🟡 o3 Result

Why Vercel Automates API Tests

🔵 Deep Research

Vercel’s Speakeasy-Powered API Toolchain

Key Finding

Choosing the Right Model

Best Practices for LLM Research

Conclusion

Stay Updated!

Most Related Posts

🎤 Speaking at PragVue 2025!

LLM-Powered Search: o4-mini-high vs o3 vs Deep Research

tldr:

Introduction

Why is this now working?

Model Response Analysis

Raw Response Samples

🟢 o-4-mini-high

🟡 o3 Result

Why Vercel Automates API Tests

🔵 Deep Research

Vercel’s Speakeasy-Powered API Toolchain

Key Finding

Choosing the Right Model

Best Practices for LLM Research

Conclusion

Stay Updated!

Most Related Posts

XML-Style Tagged Prompts: A Framework for Reliable AI Responses

How I Use LLMs

How I Added llms.txt to My Astro Blog