20–50
Prompts to test per service
4
AI models in a thorough test
Monthly
Recommended testing cadence

Why prompt testing is non-negotiable

You cannot optimize what you do not measure. The biggest difference between GEO and traditional SEO is that there is no "Search Console" telling you which keywords surface you in AI answers. The only reliable way to know is to test yourself — systematically and repeatedly.

AI answers are also probabilistic. The same prompt may produce a slightly different answer the next time. A single test is not the truth — you need repetitions, multiple models, and logged data to see the trend.

Regular testing reveals three things: (1) which questions you appear in, (2) where competitors are ahead, and (3) which piece of your content AI cites. Only then do you know where to focus optimization.

  • No ready-made Search Console data for AI answers
  • AI answers are probabilistic — repetition is essential
  • Testing reveals what AI actually cites from you
  • Measurement is the only way to spot competitor moves

Tools and standardized conditions

Building the prompt matrix — four prompt classes

A good prompt matrix has 20–50 prompts per service or topic split across four classes. Together they give a complete picture of where in the buying journey AI mentions you — and where it does not.

The first class is brand-led ("What does AlgoTerra do?"). The second is comparative ("What is the best GEO agency in Finland?"). The third is solution-led ("How do I get my business visible in ChatGPT?"). The fourth is problem-led ("Why is my site traffic dropping in 2026?").

Classes 3 and 4 matter most — they are research-stage customer questions where the fight for the slot is hardest and the win is biggest. Brand-led prompts mostly tell you whether AI knows you as an entity.

  • Class 1: Brand-led (5–10 prompts)
  • Class 2: Comparative (5–15 prompts)
  • Class 3: Solution-led (10–15 prompts) — most important
  • Class 4: Problem-led (5–10 prompts)

Prompt structure — ask as a customer would

Do not write prompts in marketing language. A customer does not type "B2B SaaS customer acquisition solutions" into ChatGPT — they type "how do I get more B2B sales". Use natural language and open questions.

A good prompt is specific but open: tight enough that AI gives a concrete answer, but open enough that several brands could be relevant in the answer. "Best GEO agency in Helsinki" beats both "GEO agency" and "AlgoTerra GEO".

Test each prompt in at least three AI models: ChatGPT (GPT-4/5), Perplexity, and Google AI Overviews. Add Claude and Bing Copilot if your industry has international demand. The same prompt can return completely different sources across models.

  • Use natural language, not marketing jargon
  • Specific but open — several brands could be relevant
  • Include location, industry, or context where relevant
  • Test at minimum ChatGPT, Perplexity, and Google AI Overviews
Prompt matrix table: 4 prompt classes as rows, 4 AI models as columns, cells showing whether the brand is mentioned
A practical prompt matrix: rows are prompts, columns are AI models. Green = brand mentioned, yellow = mentioned in passing, red = not present. This is your Share of Voice baseline.

Share of Voice — what the metric actually means

Share of Voice (SoV) is the single most important AI visibility metric. It tells you in what share of your audience’s prompts your brand is mentioned — vs. competitors. 30 % SoV means 30 mentions per 100 prompts across three AI models.

SoV is calculated as: (brand mentions) / (total prompts × models) × 100 %. A realistic 6-month target is 15–30 % on a narrow topic. Above 50 % is market-leader territory.

Also measure competitor SoV. If your 5 % visibility matches a competitor’s 40 %, you know there is a large gap to close. A deeper process is in our GEO audit and Share of Voice article.

  • SoV = brand mentions / (prompts × models) × 100 %
  • Realistic 6-month target: 15–30 % on a narrow topic
  • Always measure vs. 3–5 core competitors
  • Track the monthly trend, not a single measurement

Logging results and scaling with automation

Manual testing is essential in the early phase because you see for yourself what AI says about your brand and competitors. Log results in a simple table: prompt, model, brand mentioned y/n, tone of context, cited sources, and competitors in the same answer.

Once the baseline is clear, automation speeds things up. Tools like Profound, AthenaHQ, Goodie, and Otterly automate prompt testing and surface trends. They do not replace manual interpretation but they free time for optimization.

Always run tests on the same weekday and time, from the same location, on the same model version. Even small variation can change the answer — comparability comes only from standardized conditions.

  • Manual testing first — you learn what AI actually says
  • Table: prompt, model, mentioned, tone, sources, competitors
  • Automate after that: Profound, AthenaHQ, Goodie, Otterly
  • Standardize weekday, time, and location

Measurement results in numbers

30+
Prompts in a thorough test
4
AI models to cover
120+
Data points per testing round
15–30 %
Realistic Share of Voice in 6 months

Most common prompt testing mistakes

The biggest mistake is testing only with your own brand name. "What does my-company do?" prompts only tell you whether AI knows you as an entity — not whether you appear in actual research. 80 % of prompts must have no brand name in them.

A second mistake is testing while logged in. ChatGPT remembers conversation history and personalizes answers. The result may look like you are everywhere when nobody else sees you.

A third is one-off testing. AI models update, indexes change, competitors optimize. One month’s result is a baseline — the truth shows in the 3–6 month trend.

  • Brand-only prompts → little learning value
  • Logged-in testing → personalization skews results
  • One-off test → no trend, no stability signal
  • Single model only → missing 60–80 % of the picture

Prompt testing checklist

Following this list gives you a repeatable, comparable set of test results that can drive optimization work.

Once the matrix is running, move to AI-citable content for concrete improvements. Need help with measurement? See our GEO service.

  • Built 20–50 prompts across 4 classes
  • Test at minimum ChatGPT, Perplexity, and Google AI Overviews
  • Use incognito and a standardized location
  • Log results in a table: prompt, model, mentioned, tone, sources
  • Compute Share of Voice for yourself and 3–5 competitors
  • Run tests monthly on the same cadence
  • Identify 5 prompts where a competitor is ahead — optimize those first

Frequently asked questions

How do I test whether my company appears in ChatGPT?

Build a 20–50 prompt matrix across brand, comparative, solution, and problem classes. Test each prompt in an incognito window in at least ChatGPT, Perplexity, and Google AI Overviews. Log results in a table and repeat monthly.

What is Share of Voice in GEO?

Share of Voice is the percentage of prompts where your brand is mentioned in the AI answer. Formula: brand mentions divided by total prompts times the number of models tested. A realistic 6-month target is 15–30 % on a narrow topic.

Which tools should I use for GEO prompt testing?

Start manually in ChatGPT, Perplexity, and Google AI Overviews — that is how you learn what AI actually says. Then automate with tools like Profound, AthenaHQ, Goodie, or Otterly. Do not replace manual interpretation entirely with automation.

How often should I run prompt tests?

Monthly works for most B2B and B2C companies. AI models update often and competitors optimize continuously, so longer intervals leave you blind to changes. During larger campaigns, biweekly testing is justified.

Why does the same prompt give different answers on different runs?

AI models are probabilistic: they pick each word based on probabilities. That is why one test is not the truth — you need 3–5 repetitions per prompt and standardized conditions to separate trend from noise.