Using Claude, GPT, and local LLMs for QA: real examples

LLMs are good at a few specific QA tasks and mediocre at others. This article skips the general "AI will transform testing" pitch and goes straight to the prompts, the outputs, and the situations where each model works best.

I've been using Claude, GPT, and local models (via Ollama) for QA work for the past year. Here's what I've learned: AI saves the most time on tasks that are tedious but well-defined. Generate test cases from a spec. Create 50 test data records instead of typing them by hand. Feed it your failed test results and ask for patterns. Paste a messy bug thread and get clean reproduction steps back.

Where it falls short: anything requiring product intuition, visual judgment, or knowledge of how real users behave. AI can generate 50 test cases for a login form in 30 seconds. It won't notice that the error message font is too small to read on mobile. Those are fundamentally different kinds of work.

The PractiTest State of Testing 2025 report found that AI adoption in testing doubled from 7% to 16% in one year, with test case generation as the primary use case. That 16% is growing fast, but most teams are still figuring out what works. These examples should save you some of that figuring-out time.

Test case generation from requirements#

This is the highest-value use case. You have a feature spec. You need test cases. Instead of staring at a blank screen for 30 minutes, you give the spec to an LLM and get a structured first draft in seconds.

The prompt that works#

Here's a feature specification for [feature name]:

[Paste the spec, user story, or PRD section]

Generate test cases for this feature. Structure them as:
- Section headers for major test areas (happy path, validation,
  edge cases, error handling, permissions)
- Individual test steps under each header
- Each test step should have: a clear action and an expected result
- Include boundary values for any input fields
- Include at least 3 negative test cases per section

Format: plain text, one test step per line, with expected result
on the same line after " -> ".

What the output looks like#

For a password reset feature, Claude might produce:

Happy Path
  Enter registered email, click "Reset Password" ->
    Success message shown, reset email sent within 60 seconds
  Click reset link from email within 24 hours ->
    Redirect to "Set New Password" page
  Enter new password meeting all requirements, confirm, submit ->
    Password changed, redirected to login, old password no longer works

Validation
  Enter email with no @ symbol ->
    Error: "Please enter a valid email address"
  Submit form with empty email field ->
    Error: "Email is required"
  New password shorter than 8 characters ->
    Error: "Password must be at least 8 characters"
  New password without a number ->
    Error shown with specific requirement that failed
  Confirmation password doesn't match ->
    Error: "Passwords do not match", form not submitted

Edge Cases
  Click reset link after 24 hours ->
    Error: "This link has expired. Please request a new reset."
  Click same reset link twice ->
    Second click shows "This link has already been used"
  Request reset for email not in system ->
    Same success message as valid email (no user enumeration)
  Request multiple resets, use the oldest link ->
    Oldest link invalid, only most recent link works

That's 80% of what you need. The missing 20% is domain-specific stuff: what happens with SSO accounts? What about users who signed up with Google OAuth and don't have a password? What if the user's account is suspended? You add those based on your knowledge of the product.

With MCP integration, the AI creates these test cases directly in your test management tool instead of outputting plain text. In TestRush, the AI reads your existing scripts, avoids duplicates, and structures items with headers and child items automatically.

Model comparison for test generation#

Claude follows structural instructions well. If you ask for headers and child items, you get headers and child items. It tends toward thoroughness, so you sometimes get more cases than you need. Good at validation rules and boundary values. Weak at guessing creative abuse scenarios.

GPT-4 is more creative with edge cases. It'll suggest things like "what if the user changes their email between requesting and completing the reset?" that Claude might miss. But it sometimes ignores formatting instructions and gives you prose instead of structured test cases. You may need to be more explicit about format.

Local models (Llama, Mistral via Ollama) handle simple features fine. For a login form or a CRUD screen, the output is usable. For complex multi-step workflows with state management, they produce test cases that are vague or miss important paths. Use them for straightforward features. Use a cloud model for anything complex.

Bug report analysis#

When a bug report is messy (long Slack thread, unclear reproduction steps, conflicting information from different testers), an LLM can extract the signal from the noise.

The prompt#

Here's a bug report thread with multiple comments. Extract:
1. Clear reproduction steps (numbered)
2. Expected behavior
3. Actual behavior
4. Environment details mentioned
5. Any conflicting information between commenters

Thread:
[Paste the bug report or Slack thread]

This turns a 20-message Slack thread into a structured bug report that a developer can act on. The time saving is small per bug (maybe 5 minutes), but it adds up if you process 10-15 bugs a week.

Where this breaks down: when the bug is visual ("the button looks weird") or requires product context ("it should redirect to the dashboard, not the settings page"). The LLM can organize the information, but it can't evaluate whether the reported behavior is actually a bug.

Test data generation#

Generating realistic test data is one of the least exciting QA tasks, and LLMs handle it well. Instead of manually creating 50 user records with different combinations of names, emails, and roles, you describe what you need.

The prompt#

Generate 20 test user records as a JSON array. Each record needs:
- name (mix of ethnicities, include names with accents and
  special characters)
- email (unique, use example.com domain)
- role (distribute: 60% "member", 25% "admin", 15% "viewer")
- created_at (dates spread across the last 12 months, ISO 8601)
- subscription (mix of "free", "pro", "enterprise")

Include edge cases:
- One name with only a single character
- One email at the 254-character limit
- One user created in the last hour
- One user created 364 days ago

The output is ready to paste into your seed script or test setup. Every model handles this well, including local ones. Test data generation is structured enough that even smaller models produce usable output.

Never paste real user data into a cloud LLM prompt. Use anonymized or synthetic data. If you must work with production-like data, run a local model via Ollama so nothing leaves your machine.

SQL test data#

LLMs are also good at generating SQL INSERT statements for test data. Give it your table schema and describe the scenarios you need:

Here's my table schema:

CREATE TABLE orders (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  status TEXT CHECK (status IN ('pending','paid','shipped','delivered','refunded')),
  total_cents INTEGER,
  created_at TIMESTAMPTZ DEFAULT now()
);

Generate INSERT statements for 15 orders that cover:
- At least 2 orders in each status
- One order with total_cents = 0 (free order)
- One order with total_cents = 99999999 (high value)
- Orders spread across the last 6 months
- At least 3 orders from the same user_id

This saves 20-30 minutes per test data setup, especially for schemas with multiple related tables.

Test result pattern analysis#

After a large test run, you might have 300 items with 20 failures scattered across different feature areas. An LLM can spot patterns faster than scrolling through results manually.

The prompt#

Here are the failed test items from our latest regression run:

[Paste the list of failures with their section/area]

Analyze the failures and tell me:
1. Are there clusters? (multiple failures in the same area)
2. Are there patterns? (similar error types, similar user flows)
3. Which failures are likely related to the same root cause?
4. What area should I investigate first based on severity and clustering?

For a run where 15 items failed, the LLM might respond: "8 of the 15 failures are in the checkout flow, and 6 of those involve payment processing with non-USD currencies. This looks like a single regression in the currency conversion logic. The remaining 7 failures are scattered across login (2), search (3), and profile (2), with no obvious pattern."

That kind of triage in 10 seconds is worth the prompt. Without it, a QA engineer eyeballs the failures, mentally groups them, and reaches the same conclusion in 10-15 minutes.

Coverage gap analysis#

If your test management tool supports MCP, the AI can read your entire test repository and identify what's missing.

The prompt (with MCP connected)#

Look at all the test scripts in my project. For each feature area:
1. How many test cases exist?
2. Are there obvious gaps? (no error handling tests, no edge cases,
   no permission tests)
3. Which areas have the thinnest coverage?
4. Suggest 5 test cases I should add to the weakest area.

Without MCP, you'd have to export your test cases, paste them into the prompt, and lose the structure. With MCP, the AI reads the scripts directly from your test management tool, preserving the hierarchy and tags. See our MCP setup guide for the configuration steps.

Writing prompts that get usable output#

After writing hundreds of QA prompts, here's what I've learned about getting output you can actually use.

Be specific about format. "Generate test cases" gives you prose. "Generate test cases as bullet points with action and expected result separated by ->" gives you structured output. Models follow formatting instructions when you give them.

Include your constraints. If your product only supports English, say so. If there's a 255-character limit on usernames, say so. The more constraints you specify, the more relevant the test cases.

Give examples of what you want. One or two examples of your preferred test case format are worth more than a paragraph of description. The model pattern-matches from examples faster than it interprets abstract instructions.

Ask for sections separately. Instead of "generate all test cases for the user management module" (which gives you a messy blob), ask for happy path first, then validation, then edge cases, then permissions. Smaller, focused prompts produce better output.

Specify what to skip. "Don't generate test cases for the UI layout or visual appearance. Focus on functional behavior only." This prevents the model from padding the output with generic "verify the button is visible" cases that waste your review time.

Which model for which task#

Here's my honest ranking based on daily use:

Complex test case generation (multi-step workflows): Claude > GPT-4 > local models. Claude's structured output is easier to work with for hierarchical test scripts.

Creative edge case brainstorming: GPT-4 > Claude > local models. GPT is better at "what if?" scenarios that you wouldn't think of yourself.

Test data generation: Any model works. Even a 7B parameter local model produces good JSON test data. Don't pay for a cloud API for this.

Bug report cleanup: Claude = GPT-4 >> local models. Both cloud models handle messy text well. Local models lose context in long threads.

Pattern analysis of test results: Claude = GPT-4. Both do well with structured data analysis. Local models struggle with large datasets.

Compliance and regulatory test cases: GPT-4 > Claude > local models. GPT has slightly better coverage of compliance frameworks (GDPR, HIPAA, PCI DSS) in its training data.

For teams that want a single model for all QA tasks, Claude with MCP connected to TestRush is the smoothest workflow because MCP was designed by Anthropic and Claude's integration is the most mature. But GPT via the ChatGPT desktop app also supports MCP and works well.

TestRush pricing includes MCP access on every plan. Connect Claude, GPT, or a local LLM and start generating test scripts directly in your project. See the demo.

Running LLMs locally for QA#

If your organization has data sensitivity requirements, or you just want to avoid cloud API costs, local models are an option. Ollama makes this straightforward.

Setup: Install Ollama, pull a model (ollama pull llama3.1 or ollama pull mistral), and you're running. No API key, no account, no data leaves your machine.

What works locally: Test data generation, simple test case generation, reformatting bug reports, generating SQL fixtures. These tasks don't need the reasoning depth of a large cloud model.

What doesn't work locally (yet): Complex multi-step test scenario generation, analysis of large test suites, anything requiring deep understanding of business logic. The 7B-13B models that run well on consumer hardware don't have enough capacity for these tasks. The 70B models are better but need serious hardware.

MCP with local models: If your MCP client supports Ollama (several do), you can connect a local model to your test management tool the same way you'd connect Claude. The setup is identical except the model runs on your machine. Your test data never leaves your network.

Common mistakes#

Trusting AI output without review. AI generates plausible test cases that miss your specific business rules. A test case for "verify user can't access admin panel" is only useful if your product actually has an admin panel with specific access rules. Review every generated test case against your product knowledge. Expect to edit or discard 20-30% of AI output. Our guide on AI test case generation covers the review process in detail.
Pasting sensitive data into cloud models. Customer names, emails, internal API keys, database credentials. None of these should go into a cloud LLM prompt. Use anonymized data or run a local model. This sounds obvious, but I've seen teams paste entire database dumps into ChatGPT to "generate test scenarios from real data."
Over-engineering prompts. A simple prompt with one example usually beats a 500-word prompt with detailed instructions. If your first prompt doesn't produce good output, iterate by adding one constraint at a time rather than rewriting the entire prompt.
Using AI for tasks that take 2 minutes manually. If you need 3 test cases for a small bug fix, just write them. The time you spend crafting a prompt, reviewing AI output, and editing is longer than writing 3 test cases by hand. AI shines on volume: 30+ test cases for a new feature, or analysis of 200 test results.

FAQ#

Which LLM should I start with?#

Start with whatever model you already have access to. If you're using Claude Desktop or ChatGPT, try the test case generation prompt from this article on your next feature. The model matters less than the prompt. When you're ready to integrate AI into your workflow, set up MCP with your preferred model.

How do I know if AI-generated test cases are good enough?#

Apply the same review you'd apply to test cases written by a new team member. Can someone execute these steps without asking clarifying questions? Do the expected results match your product's actual behavior? Are the edge cases relevant to your users? If you'd accept the test cases from a junior tester after those checks, they're good enough.

Are local LLMs good enough for QA work?#

For test data generation, simple test case creation, and formatting tasks, yes. For complex test scenario design and large-scale analysis, cloud models are still better. A practical approach: use local models for data and formatting, cloud models for test case generation and analysis.

Can I use AI to execute tests, not just generate them?#

Not with LLMs directly. LLMs generate text, not browser interactions. Test execution requires either a human clicking through the UI or an automation framework (Selenium, Playwright) running coded tests. AI helps with everything around execution: preparation, generation, analysis. The execution itself is still your job. Keyboard-driven test execution tools make the human part faster.

Connect AI to your test management workflow. Start free with TestRush or explore the live demo.

Frequently asked questions

Which LLM is best for QA work?

Claude and GPT-4 class models produce the most reliable test cases. For structured output like test scripts, Claude tends to follow formatting instructions more closely. GPT is better at creative edge case generation. Local models via Ollama work for simpler tasks but struggle with complex test scenarios.

Can AI replace manual testers?

No. AI generates test cases, analyzes results, and produces test data. It does not click through your application, observe visual behavior, or make judgment calls about user experience. The human tests. The AI handles the preparation and analysis work around testing.

Is it safe to paste production data into an LLM?

Do not paste real user data, credentials, or PII into any cloud LLM. Use anonymized or synthetic data in your prompts. If you need to work with sensitive data, use a local model via Ollama that runs entirely on your hardware.

How accurate are AI-generated test cases?

AI produces good first drafts covering happy paths and common edge cases. It typically misses domain-specific business rules, obscure environmental issues, and bugs that require deep product knowledge. Review rate varies, but expect to edit or remove 20-30% of generated cases.

Using Claude, GPT, and local LLMs for QA: real examples

Test case generation from requirements#

The prompt that works#

What the output looks like#

Model comparison for test generation#

Bug report analysis#

The prompt#

Test data generation#

The prompt#

SQL test data#

Test result pattern analysis#

The prompt#

Coverage gap analysis#

The prompt (with MCP connected)#

Writing prompts that get usable output#

Which model for which task#

Running LLMs locally for QA#

Common mistakes#

FAQ#

Which LLM should I start with?#

How do I know if AI-generated test cases are good enough?#

Are local LLMs good enough for QA work?#

Can I use AI to execute tests, not just generate them?#

Frequently asked questions

Which LLM is best for QA work?

Can AI replace manual testers?

Is it safe to paste production data into an LLM?

How accurate are AI-generated test cases?

Related articles

Exploratory testing: a practical guide for QA engineers

Building a test plan: templates and real examples

AI-powered test case generation: a practical guide

Ready to rush through your tests?