The Quiet Collapse of Surveys Part II: adding some nuances
Following up on the previous Substack which identified twin threats for surveys: increases in non-response and bots, this follow-up unpacks the nuances, busts myths, and offers practical fixes.
My last Substack laid out the twin headaches everyone who works with survey now probably feels: human response rates keep sliding, while bot traffic keeps climbing. The previous Substack travelled beyond my academic ivory tower - and even reached many market-research pros, who acknowledge to seeing the problem everyday. A LinkedIn commenter called the trend “the enshittification of surveys.” I like that term: colourful, maybe a touch harsh - but it definitely shows that frustration is real.
Now that I have some attention drawn to this topic, it’s time to move past headlines and into nuance. I have now spoken to practitioners with way more expertise and knowledge than myself. Out of those conversations come five nuance-heavy take-aways: when response rate doesn’t tell the whole story, why length hurts more than bots, how cash incentives plateau, what panel conditioning really does, why volume itself is a risk, and how modern bot-detection is catching up.
Let’s unpack each one, debunk a few myths along the way, and end with fixes that hopefully come to use or might somewhat inspire. Stick with me; the details matter.
Nuance 1: Probability vs non probability sampling response rates
Something I somewhat neglected in the previous Substack is the distinction between probability and non-probability sampling - a crucial starting point for surveys! Probability panels start with a random address or phone draw; whereby only a small share (in the case of Pew their panel averages 3%) - the recruitment rate - say yes. Specifically, researchers draw a sample of addresses or phone numbers from a complete frame (e.g., every residential address in the country). Because the draw is random, everyone in the population has a known, non-zero chance of being contacted. If you multiply the recruitment with the average wave completion and retention rate over time you have your cumulative response rate which is often very low.
Opt-in panels flip the model: the sample is self-selected from day one, so per-wave completion can exceed 70%. Opt-in samples over-collect “easy” groups (high education, high broadband) and miss lower-income and limited-English households - but on paper these non-representative samples have high response rates.
Okay, so raw response rates do not tell us the whole story it seems, but should we care? Fluent, broad attitudinal measures (e.g., brand tracking) often tolerate some error - especially if they field daily and look for relative shifts. But if you need a point estimate that will drive funding or seat allocation, you still want a probability backbone. Especially, if that bias lines up with your outcome (say, political engagement), weights won’t rescue you no matter how hard you try and how high your response rate is!
Let’s look at some data. Using Data of Pew’s 2023 panel comparison which benchmarked each sample against 28 gold-standard population margins, we can observe strong differences in the absolute mean error rates between non-probability samples and probability samples. Even after weighting and calibration we can see that there is double the amount of error in the non-probability samples.
The rules of thumb I would use? If you need trend precision (e.g., approval ration of plus/minus 2 percent), you should definitely stick with probability or hybrid in which you have a probability seed and add a touch of Bayesian blending. If you need segment discovery and speed? Go with large non-prob samples as long as you occasionally validate with probability pules. Want to reach a hard-to-reach niche? start with non-prob sampling oversampling the niche and make sure to do some post hoc propensity matching to a small probability check.
What do industry players actually do? To give a few UK based examples I am familiar with:
YouGov admits selection is non-random, but argues its sampling-plus-modelling stack (stratified quotas, sample-matching, and finally multilevel regression-with-post-stratification) recovers population truths at a fraction of the cost and in 24-hour field windows. The trade-off: larger mean absolute error versus benchmarks (≈ 4½ percent vs. 2 percent for Pew’s ATP) and heavier reliance on model assumptions if the world drifts.
The ONS is the archetype of classical probability work: systematic address sampling, five follow-up waves, and expensive interviewer call-backs. It delivers sub-percentage precision on employment but suffers from falling co-operation (wave-1 down ~16 percent in a decade), forcing costly sample boosts and rapid-redesign projects.
Nuance 2: Survey length reduces data quality more than bots do.
Hoerger (2010) found that any online survey loses about 10% of respondents almost immediately, then sheds another 2 percentage points for every 100 questions that follow. Peytchev (2009) sharpened the point: an open-ended text box makes a respondent 2.5 times more likely to quit.
The survival-curve below shows what that looks like in practice: even with a perfectly human sample, a 200-item form shaves the effective N from 100% to ~86%. That 14-point hit translates straight into wider confidence intervals, heavier post-strat weights, and - when designers over-correct - bias creep that no bot-detection filter can fix.
So what to do? Avoid ruthlessly prune grids and replace them with short, single-topic items or card style questions could be a starter. Swap open-ends for quick taps (a single option comment box is often more than enough!), and move low-priority psychographics to re-contacts. Keeping the core survey under 50 items often delivers more usable data than the fanciest respondent-validation stack.
In short, bots may grab headlines, but question overload does more everyday damage. Trim the form first; fight automation second.
Nuance 3: Cash bumps response, but costs climb faster than budgets.
I mentioned in the previous Substack that respondents should perhaps be paid more, I now looked into this a bit more. The idea behind prepaid cash is straightforward: include money with the invitation and people feel obliged to reciprocate. The instinct is simple: tape more cash to the invite → nudge more people to answer.
Yet the latest large-N experiments show we’re already into diminishing-returns territory. The chart below uses the 2022 Survey of Consumer Finances randomized trial (around 8,600 sampled households). Moving the enclosed bill from $5 → $10 buys you a visible bump (6.5% → 8.6% completions). Doubling again to $15 barely matters (9.3%) – you add three dollars to get just 0.7-point extra reach. This is the case while the budget impact can be immediate: most agencies budget incentive lines months in advance. If a study scales from 5,000 to 50,000 addresses (common for labour-force surveys) an unplanned shift from $5 to $10 can add $250k overnight.
My advice? Obviously it heavily depends on the task at hand, but overall, prepaid cash works, but $1–$5 works best. Above that, you are spending faster than data quality improves - and drawing budget and compliance risk in the process. You can also boost under-represented strata by layering post-paid bonuses targeted to strata after survey completion for a fraction of the cost. Also perhaps to avoid “professional respondent” churn: replace straight cash with a lottery or charity top-up once base response is stable. This maintains reciprocity signal without steady escalations.
Btw, combining nuance 2 and 3 there seems to be a trade-off between making the study longer and paying more. Since the costs of the study grows rather linearly (Hoerger’s rule) and the returns to give higher incentives seem to be concave (the SCF experiment): there’s an unique interior minimum that can be reached - the “sweet spot” where an extra dollar or an extra question makes the cost per usable interview worse, not better! I’ll leave it up to some economist with extra time to calculate this and fit the real parameters though;)
Nuance 4: panel conditioning - when the sample starts gaming the test
The longer the same people stay in a panel, the more their answers change - not because their views shift, but because they adapt to the questionnaire.
The heat-map below shows how answer quality erodes as the same people answer successive waves. Heavy responders dominate panels - and their shortcuts, not bots, are bending your data. The straight-lining rate is the share of grid questions where a respondent simply clicks the same column all the way down (e.g., “Strongly agree” × 10). It’s a classic fatigue/shortcut signal - not a measure of learning. In panel studies it climbs because frequent respondents “remember the drill” and speed-click.
In the German PaCo project (2024), straight-lining - ticking the same box down a grid - creeps from ≈ 5% at wave 1 to 17% by wave 12.
Pew’s U.S. experiment adds nuance: knowledge-based questions inflate by +7 pp after just four waves, yet attitudes drift only +1 pp - thus conditioning is topic-sensitive.
My advice would be to rotate sub-samples or insert ‘rest’ waves so the same people are not answering, down weight serial responders or add practice items that absorb their learning without contaminating target questions and randomise the grid order and limit grid length, breaking the pattern discourages auto=clicking.
Frequent respondents are valuable, but only if you blunt the conditioning effect. Otherwise the panel starts gaming the test - and the bias belongs to you, not the bots.
Nuance 5: bot-detection is getting smarter (and more layered)
In the previous Substack, I showed how easy it was to set-up a bot yourself - even I can manage with some vibe coding. But, for almost every new trick that me and GitHub Copilot together invent, the major survey platforms now field at least one counter-measure - and the evidence shows it works to identify bots like mine.
To give some examples, a few platform-level defences have already emerged:
Qualtrics pipes every respondent through Google’s Invisible reCAPTCHA v3 and stores a Q-Recaptcha. Scores below 0.5 are automatically flagged as “probable bot” and can be filtered in real time. Internal benchmarks on large B2C trackers put the share of completes that trip this flag at 1–2%, with a false-positive rate below 0.1%.
Prolific relies on a multi-step identity gate: e-mail, SMS, and photo-ID verification at sign-up; IP/ISP and VPN checks on every login; a short, hand-graded writing task before a new account sees live studies. The company reports that fewer than 2% of would-be participants ever make it past onboarding but are later removed for quality.
SurveyMonkey stress-tested its own system by unleashing a ChatGPT-powered bot across an incentivised sample. The bot could evade simple honeypots and “speeding” flags, but was eventually caught by the platform’s Build-with-AI response-coherence engine plus duplicate-ID checks; none of the synthetic completes reached the final dataset.
The graph above also shows some other evidence: a 2024 Frontiers in Research Metrics & Analytics study evaluated 31 separate fraud indicators - from keystroke timing to semantically-aware text entropy - on two California agriculture surveys. Their best six-indicator ensemble reached 96% precision and 0.92 AUC, ultimately flagging about 8% of otherwise “clean” completes as fraudulent. Crucially, the authors show that no single indicator clears 90% recall.
Besides these established platforms, many start-ups have also entered the field specializing on identifying bots - I briefly talked to the founders of a start-up called Meaningful and get a chance to go through their platform - it looked very promising!
Some tips? Layer your screens. CAPTCHA → IP blocklists → latent-semantic checks catches more than any single rule. Also, keep raw paradata: millisecond timestamps, focus-change events, and typing cadence give auditors the evidence needed to justify exclusions. Perhaps a budget for manual review wouldn’t be a bad idea either. In any case, update the mix every quarter. Bot tactics now evolve almost as fast as LLM APIs; what works this spring may leak by autumn.
In short, sophisticated detection is already here and demonstrably effective, but only if we treat it as a living part of survey operations rather than a one-time checkbox.
The number of surveys are increasing, so more can also go wrong…
The figure above shows that sheer volume of surveys is exploding. The blue series tracks every individual survey question ever logged in the Roper iPoll archive. Since 2010 they have gone from roughly 600 000 to 1.1 million questions in field - an +83% jump. Further, industry money is keeping pace. The orange line uses ESOMAR’s annual global revenue totals for market-research (all modes). Turnover rises from $69 bn to $140 bn (+71%) over the same window. So budgets are there to keep asking - but not necessarily to keep fixing the quality problems created by that extra volume.
Naturally, the error surface scales with traffic. Every additional questionnaire adds potential for:
Non-response - more invitations means more ways to overload panels or email lists.
Bot infiltration - when payouts and route IDs multiply, so does the attack surface.
Measurement drift - panellists see more waves, triggering the conditioning you saw in the previous heat-map.
Weight-instability - post-stratification matrices that worked at 600k items may buckle at 1 million because minor groups fragment across more surveys.
In sum, yes the problem is real. Yes, there is ‘enshittification’ of surveys. But let me be the typical academic and point out that there are also many nuances and solutions: I just pointed out 5 of the many. Do reach out and share this if you interested to hear more or want to follow-up on any of these ideas.







I am the person who wrote about the enshittification of programmatic sampling. This is a great piece that provides a ton of depth to what those of us who have been operating in the industry over the past 20+ years have seen.
my e-book on enshittification is here: https://getitfrom.jddeitch.com
On Qualtrics - does this article by Sean Westwood qualify your optimism about reCAPTCHA as a useful barrier? He claims to have bypassed it with an LLM.
https://www.pnas.org/doi/10.1073/pnas.2518075122