© 2026 IQRush. All Rights Reserved.
Site by ONBOX
AI search visibility you can defend.
Whether you're building, buying, or briefing on AI search, get decision-grade data that holds.
learn more
Back to Blog
Evertune’s Luxury Handbag Case Study: What’s Behind the Headline Number?

Kevin McCabe
CRO
5 min read

Evertune’s luxury handbag case study reports an AI Brand Score moving from 14 to 45 on ChatGPT in six weeks — a 221% increase. It’s a big number built on a proprietary metric whose formula isn’t published. Evertune defines AI Brand Score as measuring “the probability of AI driving attention to a brand unaided”, which means the 221% increase is a change in a modeled probability, not a verified count of how often the brand actually appeared in AI answers.
The case study tells a clear story about why the number moved: there’s a named attribute gap, a targeted content strategy, and a before-and-after result. What’s missing is the measurement work that would tell a buyer whether the 221% headline is a durable result or a favorable reading of a system that moves on its own.
Every AI visibility vendor in 2026 faces the same problem: the engines they measure don’t return the same answer twice. That makes before-and-after claims harder to support than they look on a case study.
The claim
A luxury handbag brand’s AI Brand Score went from 14 to 45 on ChatGPT (221% increase) and from 20 to 44 on Gemini (120% increase). Overall ChatGPT visibility jumped from 29% to 89%. Sentiment moved from 30 to 63. The results landed inside a six-week window, but there’s no disclosure of how many prompts were tracked, how many times each one was run, or what confidence bands surround any of the reported numbers. The methodology behind AI Brand Score itself isn’t published anywhere a buyer can find it.
Here’s what has to be true for the headline number to hold up.
What you’d have to believe
1. The score is built on repeated measurements, not two standalone runs.
The case study reports four metrics across two engines. Every one of them rests on the same question: how many times was each prompt run?
Ask ChatGPT to recommend luxury handbags five times in a row and you’ll get five different sets of brands, in different orders, with different framing. Published research from other vendors in this category has found that only about 30% of brands persist from one AI answer to the next. That means a single-pass reading is a snapshot in time. Averaging across multiple runs turns a snapshot into a measurement. The case study doesn’t say which one Evertune did.
2. A score of 14 and a score of 45 each sit inside a range the reader can’t see.
AI Brand Score is reported on a 0-to-100 scale. The move from 14 to 45 looks like a 31-point gap. But both numbers arrived without any uncertainty band. The size of that band depends on how many prompts fed the score and how many times those prompts were run, neither of which is disclosed.
Consider the implication. If the real range around 14 stretches into the low 20s, and the real range around 45 dips into the low 30s, the gap between them shrinks from a clean 31 points to something much less conclusive. Or the bands might be tight and the gap might hold comfortably. The reader has no way to tell. A 221% headline without a confidence range is asking the buyer to trust precision the case study hasn’t earned.
3. The baseline of 14 needs to have been stable, not a low point in normal fluctuation.
Luxury handbags is a competitive category in AI recommendations. Brand mentions rotate in and out of answers as engines resample and re-weight sources. A score of 14 captured during a period when the brand happened to be cycling low would mechanically inflate the lift — the “before” number would be artificially depressed before anyone touches a content calendar.
For 14 to be a real starting point, the citation landscape in luxury handbags would need to have been stable during that measurement window. The brands showing up in AI answers would need to have settled into a consistent order. And the gap between this brand’s score and its neighbors would need to have been large enough that the ranking meant something rather than being noise. The case study presents 14 as a fixed floor. Whether it was one is a question the case study leaves unanswered.
4. The prompts measured what consumers ask, not just what the brand optimized for.
Evertune’s case study tells a specific story: their Consumer Preferences report found the brand was underperforming on “style,” so the brand created content targeting that attribute. The score then moved from 14 to 45.
That narrative creates a natural question about the prompt set. If the prompts used to compute AI Brand Score leaned toward style-related queries — “most stylish luxury handbags,” “best designer bags for fashion” — then the measurement was pointed directly at the gap the brand just filled. The lift would be real on those prompts but wouldn’t tell you whether the brand’s visibility improved on the broader set of questions consumers actually ask when shopping for handbags: durability, resale value, gifting, travel, everyday use, as examples.
The case study doesn’t disclose how many prompts were tracked, what they covered, or whether the same set was used for both the before and after readings. If the set changed between measurements, the comparison isn’t paired. You’re looking at two readings of two different things.
The proprietary score question
The four points above apply to any case study in the category. Evertune’s has an additional layer: AI Brand Score is a proprietary composite whose formula isn’t published.
Composite metrics are a reasonable product choice. Rolling citation share, sentiment, and visibility into a single number gives a marketer one thing to track. But without disclosing the methodology, a buyer cannot decompose the lift. Did 14 to 45 come from citation share improving? Sentiment swinging? The formula re-weighting between the two readings? The case study offers a narrative explanation. But plausible isn’t the same as auditable.
The practical consequence: a marketer can’t compare “14 to 45” against their own brand’s citation share or mention coverage, because the units don’t translate outside Evertune’s ecosystem. A published methodology would close that gap.
The ask
Four questions any AI visibility case study should answer:
How many times was each prompt run per measurement window?
What confidence range surrounds each reported number?
Was the baseline stable and decision-ready before the “before” reading was locked?
How many prompts were used, how were they selected, and was the same set used for both readings?
The Evertune case study doesn’t answer any of them. The next time a vendor walks you through a headline lift, run it through the list above. Not because the number is wrong. Because you don’t have enough information to know whether it’s right.
Frequently asked questions
How do you evaluate an AI visibility case study that uses a proprietary score?
Start with the same four questions: multi-run measurement, confidence ranges, baseline readiness, and prompt design. Then add a fifth: is the score's formula published? If you can't see what goes into the composite, you can't decompose the lift or benchmark it against your own data.
What is AI Brand Score?
Evertune's proprietary composite metric. The case study describes it as measuring "the probability of AI driving attention to a brand unaided." That definition is worth pausing on. A probability of visibility is not the same thing as measured visibility. The case study headline says the brand increased its AI Brand Score by 221% — but what increased was a modeled probability, not a verified count of how often the brand actually appeared in AI answers. The distinction matters: a probability score can move based on changes to the model's inputs or weights without any change in how often the brand is actually cited by the engines. Without a published methodology, a buyer can't tell whether the 221% increase reflects a real shift in how AI engines treat the brand or a shift in how the score estimates that treatment.
Why does measurement methodology matter for a proprietary score?
Because the measurement sits underneath the score. Whether the metric is called AI Brand Score, Citation Share, or anything else, the prompts were either run once or multiple times, the baseline either settled or didn't, and the prompt set was either balanced or skewed. A proprietary formula doesn't change what it takes to measure a probabilistic system.
Kevin McCabe is CRO at IQRush. If you want to see how your brand’s AI visibility holds up under the same measurement framework described here, book a 30-minute walkthrough.
Back to Blog
Evertune’s Luxury Handbag Case Study: What’s Behind the Headline Number?

Kevin McCabe
CRO
5 min read

Evertune’s luxury handbag case study reports an AI Brand Score moving from 14 to 45 on ChatGPT in six weeks — a 221% increase. It’s a big number built on a proprietary metric whose formula isn’t published. Evertune defines AI Brand Score as measuring “the probability of AI driving attention to a brand unaided”, which means the 221% increase is a change in a modeled probability, not a verified count of how often the brand actually appeared in AI answers.
The case study tells a clear story about why the number moved: there’s a named attribute gap, a targeted content strategy, and a before-and-after result. What’s missing is the measurement work that would tell a buyer whether the 221% headline is a durable result or a favorable reading of a system that moves on its own.
Every AI visibility vendor in 2026 faces the same problem: the engines they measure don’t return the same answer twice. That makes before-and-after claims harder to support than they look on a case study.
The claim
A luxury handbag brand’s AI Brand Score went from 14 to 45 on ChatGPT (221% increase) and from 20 to 44 on Gemini (120% increase). Overall ChatGPT visibility jumped from 29% to 89%. Sentiment moved from 30 to 63. The results landed inside a six-week window, but there’s no disclosure of how many prompts were tracked, how many times each one was run, or what confidence bands surround any of the reported numbers. The methodology behind AI Brand Score itself isn’t published anywhere a buyer can find it.
Here’s what has to be true for the headline number to hold up.
What you’d have to believe
1. The score is built on repeated measurements, not two standalone runs.
The case study reports four metrics across two engines. Every one of them rests on the same question: how many times was each prompt run?
Ask ChatGPT to recommend luxury handbags five times in a row and you’ll get five different sets of brands, in different orders, with different framing. Published research from other vendors in this category has found that only about 30% of brands persist from one AI answer to the next. That means a single-pass reading is a snapshot in time. Averaging across multiple runs turns a snapshot into a measurement. The case study doesn’t say which one Evertune did.
2. A score of 14 and a score of 45 each sit inside a range the reader can’t see.
AI Brand Score is reported on a 0-to-100 scale. The move from 14 to 45 looks like a 31-point gap. But both numbers arrived without any uncertainty band. The size of that band depends on how many prompts fed the score and how many times those prompts were run, neither of which is disclosed.
Consider the implication. If the real range around 14 stretches into the low 20s, and the real range around 45 dips into the low 30s, the gap between them shrinks from a clean 31 points to something much less conclusive. Or the bands might be tight and the gap might hold comfortably. The reader has no way to tell. A 221% headline without a confidence range is asking the buyer to trust precision the case study hasn’t earned.
3. The baseline of 14 needs to have been stable, not a low point in normal fluctuation.
Luxury handbags is a competitive category in AI recommendations. Brand mentions rotate in and out of answers as engines resample and re-weight sources. A score of 14 captured during a period when the brand happened to be cycling low would mechanically inflate the lift — the “before” number would be artificially depressed before anyone touches a content calendar.
For 14 to be a real starting point, the citation landscape in luxury handbags would need to have been stable during that measurement window. The brands showing up in AI answers would need to have settled into a consistent order. And the gap between this brand’s score and its neighbors would need to have been large enough that the ranking meant something rather than being noise. The case study presents 14 as a fixed floor. Whether it was one is a question the case study leaves unanswered.
4. The prompts measured what consumers ask, not just what the brand optimized for.
Evertune’s case study tells a specific story: their Consumer Preferences report found the brand was underperforming on “style,” so the brand created content targeting that attribute. The score then moved from 14 to 45.
That narrative creates a natural question about the prompt set. If the prompts used to compute AI Brand Score leaned toward style-related queries — “most stylish luxury handbags,” “best designer bags for fashion” — then the measurement was pointed directly at the gap the brand just filled. The lift would be real on those prompts but wouldn’t tell you whether the brand’s visibility improved on the broader set of questions consumers actually ask when shopping for handbags: durability, resale value, gifting, travel, everyday use, as examples.
The case study doesn’t disclose how many prompts were tracked, what they covered, or whether the same set was used for both the before and after readings. If the set changed between measurements, the comparison isn’t paired. You’re looking at two readings of two different things.
The proprietary score question
The four points above apply to any case study in the category. Evertune’s has an additional layer: AI Brand Score is a proprietary composite whose formula isn’t published.
Composite metrics are a reasonable product choice. Rolling citation share, sentiment, and visibility into a single number gives a marketer one thing to track. But without disclosing the methodology, a buyer cannot decompose the lift. Did 14 to 45 come from citation share improving? Sentiment swinging? The formula re-weighting between the two readings? The case study offers a narrative explanation. But plausible isn’t the same as auditable.
The practical consequence: a marketer can’t compare “14 to 45” against their own brand’s citation share or mention coverage, because the units don’t translate outside Evertune’s ecosystem. A published methodology would close that gap.
The ask
Four questions any AI visibility case study should answer:
How many times was each prompt run per measurement window?
What confidence range surrounds each reported number?
Was the baseline stable and decision-ready before the “before” reading was locked?
How many prompts were used, how were they selected, and was the same set used for both readings?
The Evertune case study doesn’t answer any of them. The next time a vendor walks you through a headline lift, run it through the list above. Not because the number is wrong. Because you don’t have enough information to know whether it’s right.
Frequently asked questions
How do you evaluate an AI visibility case study that uses a proprietary score?
Start with the same four questions: multi-run measurement, confidence ranges, baseline readiness, and prompt design. Then add a fifth: is the score's formula published? If you can't see what goes into the composite, you can't decompose the lift or benchmark it against your own data.
What is AI Brand Score?
Evertune's proprietary composite metric. The case study describes it as measuring "the probability of AI driving attention to a brand unaided." That definition is worth pausing on. A probability of visibility is not the same thing as measured visibility. The case study headline says the brand increased its AI Brand Score by 221% — but what increased was a modeled probability, not a verified count of how often the brand actually appeared in AI answers. The distinction matters: a probability score can move based on changes to the model's inputs or weights without any change in how often the brand is actually cited by the engines. Without a published methodology, a buyer can't tell whether the 221% increase reflects a real shift in how AI engines treat the brand or a shift in how the score estimates that treatment.
Why does measurement methodology matter for a proprietary score?
Because the measurement sits underneath the score. Whether the metric is called AI Brand Score, Citation Share, or anything else, the prompts were either run once or multiple times, the baseline either settled or didn't, and the prompt set was either balanced or skewed. A proprietary formula doesn't change what it takes to measure a probabilistic system.
Kevin McCabe is CRO at IQRush. If you want to see how your brand’s AI visibility holds up under the same measurement framework described here, book a 30-minute walkthrough.
AI search visibility you can defend.
Whether you're building, buying, or briefing on AI search, get decision-grade data that holds.
learn more
© 2026 IQRush. All Rights Reserved.
spacer