Back to Blog

Profound’s Ramp Case Study: Does the Math Support the Headline?

Kevin McCabe

CRO

5 min read

Profound's Ramp case study has a headline every marketer wants to put in a deck: AI visibility from 3.2% to 22.2%. The case study's own date range runs from December 1, 2024 to February 15, 2025 — about eleven weeks — though the headline and meta description both say "in just 1 month." Either way, it's a 7x improvement. It's clean, it's dramatic.

I'm not here to say it didn't happen. I'm here to walk through what would have to be true — underneath that headline — for the number to mean what it implies. Because case studies in AI visibility aren't like case studies in SEO, where you can type the query yourself and verify where you rank. AI visibility is measured on probabilistic systems. The engines reshuffle their answers constantly. And that means a before-and-after number carries a burden of proof that Profound's case study doesn't attempt to meet.

This isn't about Profound specifically. It's about what buyers should expect from any vendor in this space, including us.


The claim

Profound's Ramp case study reports that Ramp's AI visibility in Accounts Payable went from 3.2% to 22.2%, with 300+ citations and a rank improvement from 19th to 8th. The headline says "in one month," though the date range on the case study runs December 1, 2024 to February 15, 2025 — closer to eleven weeks. The case study doesn't disclose how many prompts were tracked, how many times each was run, what confidence range surrounds either number, or which engines were included in the measurement.
Here's what has to be true for that number to hold up.


What you'd have to believe


Each prompt was measured more than once

AI engines don't produce the same answer every time. Ask ChatGPT about accounts payable software five minutes apart and you'll get different citation lists. Published research from other vendors in this space has found that only about 30% of brands persist in AI answers from one run to the next. If that's even directionally true, a single-pass measurement is capturing one frame of a movie that's constantly cutting between scenes.
For either number on the Ramp case study to mean anything, each prompt in the measurement set would need to have been run multiple times, with the reported figure being an average across those runs rather than a single draw. The case study doesn't say whether this happened. If Profound fired each prompt once for the "before" reading and once for the "after" reading, both numbers are single realizations of noisy distributions, not measurements.


The reported numbers would hold up inside a confidence range

Neither 3.2% nor 22.2% is reported with any uncertainty band. The width of that band depends on how many prompts were tracked and how many times each was run. For example, at 50 prompts per topic, the textbook 95% band around 3.2% lands roughly at 1% to 12%, and the band around 22.2% runs roughly from 13% to 36%. The bands barely clear each other. With more prompts the gap widens, and the precision argument gets stronger. With fewer prompts the bands overlap entirely, and sampling noise alone could explain the full lift.
Profound's own research makes this worse, not better. A separate Profound blog post on AI Search Volatility reports 59.3% citation drift on Google AI Overviews. That's Profound telling its buyers that the engines it measures are noisy. The Ramp case study doesn't reconcile that admission with its headline number.


The 3.2% baseline had settled before being treated as a starting point.

For 3.2% to be a real baseline, two things would need to be true about the measurement period it came from. The rankings of top-cited domains in Accounts Payable would need to have stopped moving materially and converged within that window. And the gap between established domains' citation shares would need to have been large enough, relative to the uncertainty bands, that the ranking order was meaningful rather than noise.
If either condition wasn't met, 3.2% isn't a baseline. It's a reading taken from a distribution that was still settling. The lift from 3.2% to 22.2% would then be anchored to a moving point. The entire comparison inherits that instability.
The case study doesn't say whether the baseline was stable. It presents 3.2% as a fixed point.


The prompt set was representative, balanced, and used consistently across both measurements.

The case study says "Accounts Payable" but doesn't disclose how many prompts were used, how they were selected, or whether the same set was used for both the before and after readings. All three matter.
The composition of the prompt set determines the result. A set skewed toward long-tail queries where Ramp's new comparison pages happened to rank would produce a different lift than a set reflecting how actual buyers ask about AP software. A balanced set — structured across query types, buyer intents, and contexts — produces a measurement that's harder to game accidentally. An unbalanced one can produce a lift that's real on the measured prompts but doesn't generalize.
And if different prompts were used for the before and after readings, the comparison isn't paired. You're measuring two different things and calling it a change. The case study doesn't disclose the prompt count, the selection criteria, or whether the set was held constant.


Why this matters beyond Ramp

Every vendor in AI visibility — IQRush included — publishes case studies. The question is what standard those case studies should meet.
Right now, the standard is basically: show the before number, show the after number, describe what the customer did, put a quote on the case study. That's what Profound did with Ramp, and it's what most of the category does.
But AI visibility isn't like other SaaS metrics. The underlying systems are probabilistic. The measurements are noisy by nature. A before-and-after number on a system with 59% citation drift carries a fundamentally different burden of proof than a before-and-after number on a system that logs deterministic events.
The category needs a disclosure floor. Not a peer-reviewed paper on every customer case study, but enough information for a buyer to evaluate whether the headline number is a measurement or a snapshot. At minimum:

  1. How many times was each prompt run per measurement window?

  2. What confidence range surrounds each reported number?

  3. Was the baseline stable and decision-ready before the "before" reading was locked?

  4. How many prompts were used, how were they selected, and was the same set used for both readings

None of that is on the Ramp case study. And without it, 3.2% to 22.2% is a number you can put in a deck but not a number you can defend in a budget review.


What we'd publish differently

If IQRush ran a Ramp-style measurement, the customer case study would show the averages across multiple runs, not a single pass. It would show confidence ranges alongside the point estimates. It would include a readiness signal confirming the baseline had settled and the signal had separated from the noise floor before the lift was computed. And if the pre-period was still drifting on its own, we'd say so before any headline was published.
That's not because we're more virtuous. It's because the math requires it. A lift that can't survive a second measurement isn't a lift. It's a favorable snapshot.


The ask

I don't expect Profound to publish a white paper for every case study, but I do think buyers evaluating any case study in this space — theirs, ours, anyone's — should be able to answer four questions from what's on the study. Right now, the Ramp case study doesn't answer any of them.
The next time a vendor walks you through a headline lift, run it through the list above. Not because the number is wrong. Because you don't have enough information to know whether it's right.

Back to Blog

Profound’s Ramp Case Study: Does the Math Support the Headline?

Kevin McCabe

CRO

5 min read

Profound's Ramp case study has a headline every marketer wants to put in a deck: AI visibility from 3.2% to 22.2%. The case study's own date range runs from December 1, 2024 to February 15, 2025 — about eleven weeks — though the headline and meta description both say "in just 1 month." Either way, it's a 7x improvement. It's clean, it's dramatic.

I'm not here to say it didn't happen. I'm here to walk through what would have to be true — underneath that headline — for the number to mean what it implies. Because case studies in AI visibility aren't like case studies in SEO, where you can type the query yourself and verify where you rank. AI visibility is measured on probabilistic systems. The engines reshuffle their answers constantly. And that means a before-and-after number carries a burden of proof that Profound's case study doesn't attempt to meet.

This isn't about Profound specifically. It's about what buyers should expect from any vendor in this space, including us.


The claim

Profound's Ramp case study reports that Ramp's AI visibility in Accounts Payable went from 3.2% to 22.2%, with 300+ citations and a rank improvement from 19th to 8th. The headline says "in one month," though the date range on the case study runs December 1, 2024 to February 15, 2025 — closer to eleven weeks. The case study doesn't disclose how many prompts were tracked, how many times each was run, what confidence range surrounds either number, or which engines were included in the measurement.
Here's what has to be true for that number to hold up.


What you'd have to believe


Each prompt was measured more than once

AI engines don't produce the same answer every time. Ask ChatGPT about accounts payable software five minutes apart and you'll get different citation lists. Published research from other vendors in this space has found that only about 30% of brands persist in AI answers from one run to the next. If that's even directionally true, a single-pass measurement is capturing one frame of a movie that's constantly cutting between scenes.
For either number on the Ramp case study to mean anything, each prompt in the measurement set would need to have been run multiple times, with the reported figure being an average across those runs rather than a single draw. The case study doesn't say whether this happened. If Profound fired each prompt once for the "before" reading and once for the "after" reading, both numbers are single realizations of noisy distributions, not measurements.


The reported numbers would hold up inside a confidence range

Neither 3.2% nor 22.2% is reported with any uncertainty band. The width of that band depends on how many prompts were tracked and how many times each was run. For example, at 50 prompts per topic, the textbook 95% band around 3.2% lands roughly at 1% to 12%, and the band around 22.2% runs roughly from 13% to 36%. The bands barely clear each other. With more prompts the gap widens, and the precision argument gets stronger. With fewer prompts the bands overlap entirely, and sampling noise alone could explain the full lift.
Profound's own research makes this worse, not better. A separate Profound blog post on AI Search Volatility reports 59.3% citation drift on Google AI Overviews. That's Profound telling its buyers that the engines it measures are noisy. The Ramp case study doesn't reconcile that admission with its headline number.


The 3.2% baseline had settled before being treated as a starting point.

For 3.2% to be a real baseline, two things would need to be true about the measurement period it came from. The rankings of top-cited domains in Accounts Payable would need to have stopped moving materially and converged within that window. And the gap between established domains' citation shares would need to have been large enough, relative to the uncertainty bands, that the ranking order was meaningful rather than noise.
If either condition wasn't met, 3.2% isn't a baseline. It's a reading taken from a distribution that was still settling. The lift from 3.2% to 22.2% would then be anchored to a moving point. The entire comparison inherits that instability.
The case study doesn't say whether the baseline was stable. It presents 3.2% as a fixed point.


The prompt set was representative, balanced, and used consistently across both measurements.

The case study says "Accounts Payable" but doesn't disclose how many prompts were used, how they were selected, or whether the same set was used for both the before and after readings. All three matter.
The composition of the prompt set determines the result. A set skewed toward long-tail queries where Ramp's new comparison pages happened to rank would produce a different lift than a set reflecting how actual buyers ask about AP software. A balanced set — structured across query types, buyer intents, and contexts — produces a measurement that's harder to game accidentally. An unbalanced one can produce a lift that's real on the measured prompts but doesn't generalize.
And if different prompts were used for the before and after readings, the comparison isn't paired. You're measuring two different things and calling it a change. The case study doesn't disclose the prompt count, the selection criteria, or whether the set was held constant.


Why this matters beyond Ramp

Every vendor in AI visibility — IQRush included — publishes case studies. The question is what standard those case studies should meet.
Right now, the standard is basically: show the before number, show the after number, describe what the customer did, put a quote on the case study. That's what Profound did with Ramp, and it's what most of the category does.
But AI visibility isn't like other SaaS metrics. The underlying systems are probabilistic. The measurements are noisy by nature. A before-and-after number on a system with 59% citation drift carries a fundamentally different burden of proof than a before-and-after number on a system that logs deterministic events.
The category needs a disclosure floor. Not a peer-reviewed paper on every customer case study, but enough information for a buyer to evaluate whether the headline number is a measurement or a snapshot. At minimum:

  1. How many times was each prompt run per measurement window?

  2. What confidence range surrounds each reported number?

  3. Was the baseline stable and decision-ready before the "before" reading was locked?

  4. How many prompts were used, how were they selected, and was the same set used for both readings

None of that is on the Ramp case study. And without it, 3.2% to 22.2% is a number you can put in a deck but not a number you can defend in a budget review.


What we'd publish differently

If IQRush ran a Ramp-style measurement, the customer case study would show the averages across multiple runs, not a single pass. It would show confidence ranges alongside the point estimates. It would include a readiness signal confirming the baseline had settled and the signal had separated from the noise floor before the lift was computed. And if the pre-period was still drifting on its own, we'd say so before any headline was published.
That's not because we're more virtuous. It's because the math requires it. A lift that can't survive a second measurement isn't a lift. It's a favorable snapshot.


The ask

I don't expect Profound to publish a white paper for every case study, but I do think buyers evaluating any case study in this space — theirs, ours, anyone's — should be able to answer four questions from what's on the study. Right now, the Ramp case study doesn't answer any of them.
The next time a vendor walks you through a headline lift, run it through the list above. Not because the number is wrong. Because you don't have enough information to know whether it's right.

spacer