I have a confession to make. For the longest time, my prompt engineering process was basically just changing a word, running the app, and deciding if the output felt better. It was pure vibes-based development.
This approach is fine when you are hacking on a weekend project. It becomes a massive liability when you are deploying AI to production. You change a prompt to fix a specific bug, and suddenly you break three other edge cases you forgot to test. You need a way to measure whether a change actually improved your system across all scenarios.
Promptfoo solves this exact problem. It is an open-source framework that brings test-driven development to the chaotic world of large language models.
Moving past trial and error
When you build traditional software, you write unit tests. You expect deterministic functions to return the same output every time. LLMs are messy and unpredictable. They give you a slightly different answer each time you ask.
Promptfoo lets you define a suite of test cases in a simple configuration file. You provide a list of inputs and the expected criteria for the outputs. The tool then runs your prompt against all these inputs and scores the results based on metrics you define.
Instead of guessing if adding "think step by step" to your system prompt helped, you get a concrete metric. You can look at a dashboard and see that your success rate went from 82% to 94%. This changes everything about how you iterate on AI features.
Comparing models side by side
One of the hardest parts of building AI apps is deciding which model to use. GPT-4 might be the smartest, but maybe Gemini Flash is fast enough for your specific use case.
I use Promptfoo to run side-by-side comparisons. You can configure it to take the same prompt and run it through OpenAI, Anthropic, Azure, Bedrock, and local models via Ollama.
The web viewer generates a visual matrix showing how each model responded to your test cases. This makes it incredibly obvious when a cheaper model fails at complex reasoning tasks. It also gives you the hard data you need to justify switching providers if a new model actually performs better for your specific application. It even supports live reload and caching, which makes the developer experience incredibly fast.
Evaluating RAG pipelines
Retrieval-Augmented Generation adds another layer of complexity. You are not just testing the model, you are testing your search pipeline.
Promptfoo has built-in tools for evaluating RAG systems. It measures factuality and verifies the model is actually using the context you provided. I find this extremely helpful for tracking down whether a hallucination was caused by a bad prompt or simply bad search results from the database.
You can automatically check if the model ignores irrelevant context or if it correctly says "I don't know" when the answer is missing. This prevents your bot from confidently making up facts when it cannot find the relevant documents.
Running in the background
Just like their red teaming tools, the evaluation suite plugs directly into your continuous integration pipeline.
When a developer opens a pull request that modifies a core prompt, the CI system runs the full Promptfoo evaluation suite. If the overall quality score drops below a certain threshold, the PR gets blocked. This gives your team the confidence to iterate quickly without fear of degrading the user experience.
Official Links
- GitHub Repository: https://github.com/promptfoo/promptfoo
- Project Page / Demo: https://www.promptfoo.dev/
Conclusion
Vibes are not a valid engineering metric. If you are serious about building reliable AI applications, you need automated evaluations. Try setting up Promptfoo on your current project and running a baseline evaluation. You will finally know exactly how well your prompts are performing, and you will never go back to manual testing again.