Back to Gems of AI

Stop eyeballing your prompt changes, how Promptfoo brings actual metrics to AI development

Learn how the open-source tool Promptfoo helps developers move past trial-and-error prompt engineering by introducing automated, metric-driven evaluations.

I have a confession to make. For the longest time, my prompt engineering process was basically just changing a word, running the app, and deciding if the output felt better. It was pure vibes-based development.

This approach is fine when you are hacking on a weekend project. It becomes a massive liability when you are deploying AI to production. You change a prompt to fix a specific bug, and suddenly you break three other edge cases you forgot to test. You need a way to measure whether a change actually improved your system across all scenarios.

Promptfoo solves this exact problem. It is an open-source framework that brings test-driven development to the chaotic world of large language models.

Moving past trial and error

When you build traditional software, you write unit tests. You expect deterministic functions to return the same output every time. LLMs are messy and unpredictable. They give you a slightly different answer each time you ask.

Promptfoo lets you define a suite of test cases in a simple configuration file. You provide a list of inputs and the expected criteria for the outputs. The tool then runs your prompt against all these inputs and scores the results based on metrics you define.

Instead of guessing if adding "think step by step" to your system prompt helped, you get a concrete metric. You can look at a dashboard and see that your success rate went from 82% to 94%. This changes everything about how you iterate on AI features.

Comparing models side by side

One of the hardest parts of building AI apps is deciding which model to use. GPT-4 might be the smartest, but maybe Gemini Flash is fast enough for your specific use case.

I use Promptfoo to run side-by-side comparisons. You can configure it to take the same prompt and run it through OpenAI, Anthropic, Azure, Bedrock, and local models via Ollama.

The web viewer generates a visual matrix showing how each model responded to your test cases. This makes it incredibly obvious when a cheaper model fails at complex reasoning tasks. It also gives you the hard data you need to justify switching providers if a new model actually performs better for your specific application. It even supports live reload and caching, which makes the developer experience incredibly fast.

Evaluating RAG pipelines

Retrieval-Augmented Generation adds another layer of complexity. You are not just testing the model, you are testing your search pipeline.

Promptfoo has built-in tools for evaluating RAG systems. It measures factuality and verifies the model is actually using the context you provided. I find this extremely helpful for tracking down whether a hallucination was caused by a bad prompt or simply bad search results from the database.

You can automatically check if the model ignores irrelevant context or if it correctly says "I don't know" when the answer is missing. This prevents your bot from confidently making up facts when it cannot find the relevant documents.

Running in the background

Just like their red teaming tools, the evaluation suite plugs directly into your continuous integration pipeline.

When a developer opens a pull request that modifies a core prompt, the CI system runs the full Promptfoo evaluation suite. If the overall quality score drops below a certain threshold, the PR gets blocked. This gives your team the confidence to iterate quickly without fear of degrading the user experience.

  • GitHub Repository: https://github.com/promptfoo/promptfoo
  • Project Page / Demo: https://www.promptfoo.dev/

Conclusion

Vibes are not a valid engineering metric. If you are serious about building reliable AI applications, you need automated evaluations. Try setting up Promptfoo on your current project and running a baseline evaluation. You will finally know exactly how well your prompts are performing, and you will never go back to manual testing again.

Continue exploring

S

SmallAI Team

From Gems of AI · Manage credits

Frequently Asked Questions

How does Promptfoo help with prompt engineering?

Promptfoo allows developers to run automated evaluations on their prompts to mathematically compare performance instead of relying on subjective manual testing.

Can Promptfoo compare different LLM models?

Yes, you can configure Promptfoo to run the same test suite across multiple models like GPT-4, Claude, and Gemini to see which performs best for your specific use case.

How do you view Promptfoo evaluation results?

Promptfoo provides a web viewer that displays a matrix comparing outputs from different prompts and models side-by-side, making it easy to identify regressions.

Is Promptfoo an open-source tool?

Yes, Promptfoo is completely open-source and MIT-licensed. It is driven by a large community of developers building AI applications.

Does Promptfoo support local LLMs?

Promptfoo supports local models through Ollama. This lets you run comprehensive evaluations without incurring API costs.

Can I use Promptfoo for RAG pipelines?

Yes, Promptfoo can evaluate Retrieval-Augmented Generation systems to measure factuality and verify the model is retrieving the correct context.

Can Promptfoo be used in CI/CD?

Yes, developers often integrate Promptfoo into their continuous integration pipelines to automatically test prompts against regressions on every commit.

Does Promptfoo charge for running evaluations?

No, Promptfoo is an open-source tool, meaning you can run unlimited evaluations locally for free, though you still pay for your own API usage to the model providers.

Can I use Promptfoo to test system prompts?

Yes, you can easily compare different versions of your system prompt across a standardized dataset to mathematically prove which one yields better results.

How do I integrate Promptfoo into GitHub Actions?

Promptfoo offers a dedicated CLI command and documentation specifically for CI/CD integration, allowing you to run your evaluation suite on every pull request.

Ready to try our AI tools? 100+ specialized tools for tiny jobs. No signup required.
Browse 100+ Tools