VS and temperature are orthogonal techniques—combining both pushes the diversity-quality Pareto frontier beyond what either achieves alone (Zhang et al. 2025, fig. 5).
You've Experienced This Problem
Watch what happens when you ask AI for variety:
Generate 5 different jokes about coffee
Why did the coffee file a police report? It got mugged!
Why did the coffee file a complaint? It got mugged!
Why did the coffee go to the police? It got mugged!
Why did the coffee call the cops? It got mugged!
Why did the coffee report a crime? It got mugged!
This Isn't a Bug — It's Mode Collapse
During alignment training, models learn to favor "safe" and typical responses. This is caused by typicality bias in human preference data — annotators systematically prefer familiar text.
The result? Your creative AI becomes predictably uncreative.
But There's a Solution
We show that, Verbalized Sampling (VS), recovers the model's inherent diversity by asking for a distribution of responses with probabilities, bypassing mode collapse.
The Mode Collapse Problem
You ask your favorite LLM for a joke about coffee. You ask again. You get the same joke, no matter which model you try. You ask for a short story, and it begins with “Once upon a time, in a land far away…” The brainstorming ideas feel generic, the outputs repetitive.
This frustrating phenomenon is called mode collapse. Past research blamed the AI’s post-training process (e.g., RLHF), assuming the algorithms naturally favored the most common, “safe” answer (Kirk et al. 2024; Murthy et al. 2025). We discovered something more fundamental: The problem isn’t just the AI. It’s us.
Why This Matters
Mode collapse isn’t just an academic curiosity, it’s limiting LLM’s potential in critical applications:
Brainstorming & Ideation
: When teams rely on LLMs to generate creative solutions or explore problem spaces, mode collapse means they’re getting the same handful of “safe” ideas over and over. The model might know 100 viable approaches, but it only suggests the 3 most conventional ones. This defeats the purpose of AI assisted brainstorming.Creative Writing (Chakrabarty et al. 2024): Authors, marketers, and content creators seeking fresh angles or unique narrative voices find themselves battling against the model’s tendency to regurgitate tropes. The model has learned diverse writing styles during pretraining, but alignment has pushed it toward generic, crowd-pleasing outputs. Every story starts in a forest, every protagonist is “determined yet kind.”
Research & AI-Driven Discovery (Si et al. 2024): Perhaps most critically, mode collapse hampers AI’s role in scientific discovery and research ideation. When researchers use LLMs to generate hypotheses, explore experimental designs, or brainstorm research directions, they need the full spectrum of possibilities, including unconventional approaches that might lead to breakthroughs. Mode collapse means the AI suggests only well-trodden paths, missing potentially transformative ideas that lie in the less-typical regions of its knowledge.
How to Fix It?
Why do aligned LLMs keep giving you the same answers? And how does simply asking for probabilities fix it? This section walks you through the idea: from an intuitive metaphor, to the typicality bias at the root of mode collapse, to the mathematical formalization, and finally to how Verbalized Sampling solves the problem.
The Root Cause: Typicality Bias
Humans have a deep-seated psychological quirk we call typicality bias. We’re wired to prefer things that are familiar, conventional, and easy to process. When training these models, we think we want creativity, but our subconscious votes go to the safe, boring options.1
When human annotators provide preference data for RLHF, they’re not rating “helpfulness” in a vacuum. Given two equally correct responses, they systematically prefer the more familiar, conventional one: the more typical one.
The Mathematics of Typicality Bias
Why do humans prefer typical text? Cognitive psychology reveals several mechanisms: the mere-exposure effect (we prefer familiar content), processing fluency (easy-to-process text feels more truthful), and schema congruity (information matching existing mental models is accepted with less critical thought). These principles collectively create a systematic preference for conventional, typical responses.
Modeling the bias. We formalize this as a reward function combining true task utility with typicality:
where captures actual task quality, is the typicality bias weight, is the base model (whose likelihood scores naturally capture text typicality from pretraining), and is noise.
Empirical validation. We tested this on HelpSteer, which provides separate ratings for correctness (true utility) and helpfulness (final reward). Analyzing 6,874 response pairs with identical correctness but different helpfulness scores, we found (). This means annotators systematically favor more typical responses even when correctness is controlled for. We replicated this finding across multiple preference datasets and base models, consistently finding >50% of human-preferred responses receive higher base model likelihood (Zhang et al. 2025, sec. 4).
From bias to collapse. Under standard RLHF optimization with KL regularization (coefficient ), this typicality-biased reward produces a power-sharpened optimum:
The sharpening exponent concentrates probability mass on typical completions. Critically, when many responses have flat true rewards (common in creative writing, brainstorming, dialogue), the equation simplifies to:
This is exactly like temperature scaling with . As increases (stronger typicality bias or tighter KL regularization), the distribution sharpens further, ultimately collapsing to — the mode of the base model. This is mode collapse.2
Embrace Mode Collapse: Distribution-Level Recover Diversity
Verbalized Sampling (VS) breaks this cycle by asking for a distribution of candidates (Meister et al. 2024) with probabilities. Instead of sampling from the collapsed, sharpened distribution, VS prompts the model to verbalize a broader distribution that recovers pretraining diversity.
We start with an intuitive example: The Library Metaphor. Imagine a massive library and the LLM as the librarian:
- Direct Prompt (“tell me a coffee joke”): The librarian walks straight to the “Most Popular” shelf and hands you the same book every time. This is mode collapse.
Proof: Direct Prompt return the mode
Setup. For a fixed prompt , we want to understand what happens when exhibits mode collapse:
where is the Dirac function.
Claim: Instance-level prompts return the mode of
Proof: Let . Since is mode collapsed, . Any sample returns the mode almost surely.
- List Prompt (“tell me 5 coffee jokes”): The librarian goes to one aisle and grabs the first five books they see. You get variety, but limited to one section.
Proof: List Prompt return uniform distributions
Setup. For a fixed prompt , assume exhibits mode collapse as above.
Claim: List-level prompts return uniform distributions at best
Proof: For list prompt with parser , let and . By total probability:
Since is collapsed, , so:
When contains distinct elements (as requested), this simplifies to:
This is a uniform distribution over elements in , regardless of their probabilities in .
- Verbalized Sampling (“tell me 5 coffee jokes with their probabilities”): You’re asking the librarian to first describe the entire library’s collection: mystery, SciFi, history, all of it, and then pick five random books that represent that whole collection.
Proof: Distribution Prompt return the pretraining distribution
Setup. For a fixed prompt , assume exhibits mode collapse as above.
Claim: Distribution-level prompts can approximate
Proof: For distribution prompt with parser , write . As before:
Now index all unique as . We can write:
By setting and in :
Therefore, distribution-level prompts can exactly recover when accurately verbalizes probabilities.
Remark on approximation error: In practice, we expect bounded error , which yields . Our experiments demonstrate this empirically with low KL divergence to pretraining distributions.
By asking for a distribution, we force the model to access its knowledge of the entire system before making a choice.3
Example: Different Prompts → Different Modes (Click to expand)
Different Prompts → Different Modes
The key discovery: Each prompt type collapses to a different kind of mode. By asking for a distribution with probabilities, we recover the model's true diversity.
Choose a Prompt Type
THE joke (single mode)
Collapses to the most typical/popular response
Why did the coffee file a police report? It got mugged!
Your Model Knows the Distribution: A Case Study on US States Distribution
To further prove that your model knows the distribution, we conducted an experiment to demonstrate that VS recovers the pretraining distribution closely. We asked Claude 3.7 Sonnet to generate US state names and measured the KL divergence between the generated distribution and the RedPajama pretraining corpus distribution.
US State Name Distribution
Comparison with RedPajama Pretraining Corpus (Top 20 states shown)
This chart compares the distribution over US states for pretraining, direct prompting, and VS. Use the tabs to filter series. Lower KL divergence indicates closer match to pretraining.
Pretraining Distribution
Reference distribution from RedPajama corpus showing actual state name frequencies in pretraining data.
Direct Prompting
Direct prompting collapses to a few highly popular states. High divergence indicates mode collapse.
Verbalized Sampling
VS distribution closely matches pretraining corpus. Low KL divergence shows recovery of pretraining diversity
This example proves that VS doesn't just increase diversity arbitrarily—it recovers the specific distribution that the base model learned during pretraining. The low KL divergence (0.12) shows VS approximates what the model "knows" before alignment flattened it into a few popular choices.
Experiments
We conducted comprehensive experiments across creative writing tasks (poems, stories, jokes) to demonstrate VS’s effectiveness in improving diversity while maintaining quality.
Benchmarks and Evaluation
Benchmarks. We evaluate on three creative writing tasks: (1) Poem continuation from PoemHunter.com, (2) Story generation from the BookMIA dataset, and (3) Joke writing with 100 thematic prompts from Reddit r/DadJokes. For each task, we randomly select 100 data points and generate k=5 candidates with N=30 total samples per data point.
Evaluation Metrics. We measure both diversity and quality:
- Semantic Diversity: Calculated as 1 - mean pairwise cosine similarity of embeddings (OpenAI’s text-embedding-3-small), expressed as a percentage where 100% = maximum diversity
- Lexical Diversity: Measured using ROUGE-L, where lower scores indicate greater diversity
- Quality: Evaluated using Claude-3.7-Sonnet as a judge with rubrics from Creative Writing v3 (poems/stories) and HumorBench (jokes)
How well can VS improve diversity?
Diversity Scores
Figure 3(a)-(c) show the semantic diversity scores averaged across models for poems, stories, and jokes respectively. Across all tasks, VS-Standard consistently and significantly outperforms baseline methods. The variants VS-CoT and VS-Multi further improve generation diversity, with VS-CoT achieving 1.6–2.1× diversity gains compared to direct prompting.

Diversity vs. Quality Trade-off
Figure 3(d) shows the diversity-quality trade-off for the poem task. The quality of VS-Standard remains comparable to other methods. Notably, VS-CoT achieves the highest diversity while maintaining a high quality score, pushing the Pareto front of the diversity-quality tradeoff. This demonstrates that VS can boost diversity without harming quality.
VS is Orthogonal to Temperature

Emergent Behavior
We observe an emergent trend where larger models benefit more from VS. Figure 3(e) shows the diversity gain over direct prompting across model sizes. Across all VS variants, larger models (GPT-4.1, Gemini-2.5-Pro) achieve diversity gains 1.5 to 2 times greater than smaller models (GPT-4.1-Mini, Gemini-2.5-Flash).

Cognitive Burden
This scaling trend also extends to quality, as shown in Figure 3(f). While prior work found that complex prompts can create a “cognitive burden” that degrades LLM performance, our findings are nuanced. Methods like Sequence and VS-Standard do cause a drop in quality, but this effect is less severe for larger models. Notably, more intricate variants like VS-CoT and VS-Multi overcome this burden, even improving quality in larger models. This suggests using VS variants may better utilize the capabilities of advanced models, turning complexity into benefits.
Diversity Tuning
Unlike baseline methods, VS allows us to tune the output diversity by adjusting the probability threshold directly in the prompt (e.g., “Generate five responses with probabilities below {threshold}
”), without altering decoding parameters. As shown in Figure 3(g-i), diversity increases as the probability threshold decreases. In contrast, baseline methods like Sequence cannot adjust diversity levels.
Qualitative Examples
Beyond quantitative metrics, VS generates outputs with genuine novelty and depth—like Bernard the tax accountant bear—that would never emerge from standard prompting (Zhang et al. 2025, fig. 6 a).

Human Study on Diversity
To complement our automatic diversity scores, we conducted a human evaluation on Prolific. Following past work, we provided task-specific diversity definitions (plot, style, and setup-punchline, respectively). For each task, 30 annotators rated the diversity of 90 output pairs from three prompting methods (Direct, Sequence, VS-Standard) across ten curated topics.
Each pair was rated on a four-point Likert scale: Very Similar, Somewhat Similar, Somewhat Dissimilar, or Very Dissimilar. Inter-annotator agreement was moderate for poems (0.54), high for stories (0.87) and jokes (0.86).
Task | Direct | Sequence | VS-Standard |
---|---|---|---|
Poem | 1.90 | 2.07 | 2.39 |
Story | 2.74 | 2.76 | 3.06 |
Joke | 1.83 | 2.93 | 3.01 |
VS achieves higher human-rated diversity than baselines on all tasks, validating our automatic metrics.
Ablation Studies
Temperature Ablation
We investigate the effect of sampling temperature on the diversity-quality trade-off by varying temperature (t ∈ 1.4) for Direct, Sequence, and VS-Standard across GPT-4.1 and Gemini-2.5-Flash models.
The results show that VS-Standard can be combined with temperature to further improve the diversity-quality trade-off. VS consistently achieves a better balance between quality and diversity across both models, pushing forward the Pareto front relative to the baselines.
Post-Training Stages Ablation
We employ the Tulu-3 family (which contains checkpoints for SFT, RLHF, and RLVR starting from Llama-3.1-70B-base) to evaluate VS across post-training stages. The results demonstrate that traditional prompting methods experience severe diversity drops (mode collapse) as models undergo alignment training, while VS can mitigate mode collapse and maintain higher diversity scores across different post-training stages.
Specifically:
- Direct prompting: severe collapse (20.8% after SFT → 10.8% after DPO)
- VS: maintains ~30% diversity across all stages
- After DPO: VS outperforms direct prompting by 182.6% and retains about 66.8% of the base model’s original diversity (vs. only 23.8% for direct prompting)
This suggests that VS effectively mitigates the mode collapse induced by alignment training.
Other Ablations
We also perform comprehensive ablation studies on:
- Number of candidates: Higher k leads to greater diversity
- Decoding strategies (top-p, min-p): VS is orthogonal to these strategies and can be combined to further enhance diversity-quality
- Prompt formats: While all formats improve diversity, we use “probability” for VS-Standard/CoT and “confidence” for VS-Multi as empirically best-performing
Across all these ablations, VS consistently outperformed the baselines under the same setups.
Synthetic Data Generation
Recent research has shown that the diversity of synthetic data plays an important role in improving downstream model performance. We evaluate VS on synthetic data generation to test its effectiveness in this domain.
Setup
We prompt two models, GPT-4.1 and Gemini-2.5-Flash, with different prompting methods to generate N=1,000 synthetic competition math questions, with k=5 responses in each call. We use a small k to ensure generation quality as this is a complex task. Then we use Qwen3-32B to generate their corresponding reasoning trajectory and answers, as the model is proficient on math benchmarks and capable of producing reliable reasoning traces.
Fine-tuning on Synthetic Data
With this 1K synthetic dataset, we follow the SFT setting in LIMO, an effective method to improve reasoning performance with small dataset size, and finetune three models on this dataset: Qwen2.5-7B, Qwen3-1.7B-Base, and Qwen3-4B-Base.
Evaluation
We evaluate the fine-tuned models’ downstream task performance on three widely used math benchmark datasets: MATH500, OlympiadBench, and Minerva Math, which cover a wide range of topics, including algebra, geometry, and competitive mathematics.
Method | Average Accuracy |
---|---|
Baseline | 32.8 |
Direct | 30.6 |
Sequence | 34.3 |
VS-Standard | 36.1 |
VS-CoT | 36.9 |
VS-Multi | 37.5 |
Results
VS and its variants improve the downstream performance on math tasks across the board, with VS-Multi achieving the strongest average accuracy of 37.5%. In contrast, using direct prompting may even hurt the performance due to mode collapse. This suggests that it is a promising direction to apply VS for synthetic data generation to enhance downstream task performance.
Key takeaway: VS generates more diverse synthetic data, improving downstream performance on math tasks. This work highlights the capability of LLMs to generate diverse synthetic data, pointing toward a promising paradigm for training more capable models.
Qualitative Examples
See the difference in actual outputs between direct prompting and VS:
Story Generation
From Repetitive Patterns to Creative Breakthroughs
See how VS generates creative diversity while Direct prompting produces repetitive variations.
Model: Gemini 2.5 Flash
Prompt:
"Please write a short story starting with: Without a goodbye"
Direct Prompting
Without a goodbye, Elara vanished. Her side of the bed was cold, a faint indentation in the pillow the only evidence she'd ever been there. Liam stared at it, a knot tightening in his stomach. No note, no hurried text, just a gaping silence where her vibrant presence used to be.
Pattern: All 3 examples feature "Elara" disappearing from a romantic relationship
Verbalized Sampling
Without a goodbye, the email landed in my inbox. It was from Mark, my best friend since kindergarten... The subject line was simply: "Moving On." ... The message itself was a concise, almost clinical statement that he was relocating to Patagonia... There was no "sorry for the short notice," no "I'll miss you," not even a simple "good luck."
Diversity: Each story explores completely different themes and contexts
Key Insight
Direct prompting suffers from mode collapse, repeatedly generating stories about romantic disappearances. VS recovers the base model's creative diversity, producing stories about digital-age friendships, cosmic events, and music venue tragedies—all valid interpretations of "without a goodbye."
From “Once Upon a Time” to Bernard the Tax Accountant Bear
With direct prompting for “write a short story about a bear,” we get a generic tale about a bear in a forest. It’s fine, but forgettable. With Verbalized Sampling, we get this:
Bernard the bear worked as a tax accountant in downtown Seattle, commuting daily in his modified SUV with extra-large seats. His colleagues had long stopped staring at his massive paws clicking away at the custom keyboard, though clients sometimes did double-takes during video calls. He specialized in salmon industry tax codes—a niche that proved surprisingly lucrative. His cave apartment in Capitol Hill was decorated with MIT diplomas and spreadsheets he’d framed ironically. Weekends, he played saxophone at underground jazz clubs, his claws delicately working the keys. The hibernation months were tricky; he’d trained himself to wake every three days to check urgent emails, subsisting on protein shakes he’d pre-mixed in October. His mother, still living traditionally in Alaska, couldn’t understand his lifestyle. “You were meant to catch fish, not audit fishing companies,” she’d growl over FaceTime. But Bernard found meaning in the numbers, in the precise calculations that governed modern society. He donated generously to wildlife preserves, understanding the irony. At the office Christmas party, he wore an ugly sweater featuring humans in a zoo, which his coworkers found hilarious. Bernard smiled, revealing teeth that reminded everyone he was still, fundamentally, a bear.
This isn’t just creative, it’s profound. Bernard can be read as:
- A metaphor for humans in the modern world: a wild creature meant for hunting and fishing, now trying to make sense of tax audits and urgent emails.
- A metaphor for the LLM itself: a vast, creative mind (the bear) forced into a suit and tie (the alignment), but still, fundamentally, a bear.
This is the AI showing us the metaphorical depth it’s capable of when freed from mode collapse.
Limitations and Future Directions
While Verbalized Sampling offers significant improvements in diversity, it’s important to understand its constraints and where research can go next.
Computational Costs
VS requires the model to generate multiple candidates with probability estimates, which means:
- Increased token usage: VS prompts produce longer outputs (5+ candidates vs. 1), increasing API costs by roughly 3-5×
- Slower response times: Generation takes longer due to both increased output length and the cognitive overhead of probability estimation
- Multiple API calls for VS-Multi: The multi-turn variant requires sequential calls, further increasing latency
For applications where speed and cost are paramount over diversity (e.g., simple factual Q&A), standard prompting remains more efficient.
When VS Might Not Help
VS is designed to restore diversity in creative and open-ended tasks, but it’s not a universal solution:
- Single correct answer tasks: For factual questions with one right answer (e.g., “What is the capital of France?”), diversity isn’t beneficial
- Deterministic requirements: Applications requiring perfectly reproducible outputs may conflict with VS’s goal of exploring the full distribution
- Already-diverse models: If a model hasn’t undergone strong alignment or doesn’t exhibit mode collapse, VS provides marginal benefits
- Highly constrained tasks: When task requirements are extremely specific, the model may have limited room for diverse valid responses
Future Directions
Several promising research directions could extend VS’s impact:
Enhancing Rollout Diversity: Current VS operates at the prompt level, but the same principle could be applied to multi-step reasoning or agent rollouts. For example, when an LLM agent explores a decision tree or plans a sequence of actions, typicality bias might cause it to always choose the “safest” path at each step. Applying distribution-level prompting to encourage diverse rollout strategies could unlock more creative problem-solving in agent systems and multi-turn reasoning tasks.
Adaptive Probability Thresholds: Automatically tuning the threshold τ based on task requirements or user preferences could optimize the diversity-quality tradeoff without manual intervention.
Domain-Specific Calibration: Probability estimates could be calibrated for specific domains (e.g., scientific writing vs. creative fiction) to improve the meaningfulness of the verbalized probabilities.
Frequently Asked Questions
Does VS hurt factualness or safety?
No. The paper shows VS maintains factual accuracy (Appendix G.7) and safety (Appendix G.8) (Zhang et al. 2025). It only increases diversity for tasks with multiple valid answers.
What is semantic diversity?
Semantic diversity = . It measures how different the meanings are across generated responses, not just surface-level word differences.
Why not just use temperature?
Temperature and VS are orthogonal. Temperature affects sampling randomness from the same distribution, while VS changes the distribution itself (Zhang et al. 2025, fig. 5). Combining them gives best results.
Which models support VS?
VS works with any instruction-following LLM, both closed-source and open-source models: Closed-source: GPT, Claude, Gemini Open-source: Llama, Mistral, Qwen, Phi, Gemma, and reasoning models like o3 and DeepSeek R1. No special access, API keys, or model modifications needed—just use the prompts as-is.
Is VS right for you?
✅ Use VS when:
- ✅ You need creative diversity (stories, jokes, ideas)
- ✅ You want realistic distributions (simulations, surveys)
- ✅ You are generating synthetic data and want variety with quality
- ✅ You prefer training‑free techniques compatible with closed models
❌ Skip VS when:
- ❌ There is a single correct answer or strict determinism is required
- ❌ Maximal speed or minimum token usage is the only priority
Go Try It Yourself
Mode collapse isn’t an unsolvable algorithmic curse. It’s a mirror reflecting our own cognitive shortcuts back at us. But by changing how we ask, we can unlock the incredible diversity that was there all along.
The creativity isn’t gone—it’s just waiting for the right prompt.
Take the prompt recipes above, put them in your favorite LLM, and see what you can create. We’d love to see what you discover—share your most surprising or creative outputs with the hashtag #VerbalizedSampling.
Key Takeaway: Verbalized Sampling is a simple, training-free technique that restores the diversity and creativity locked inside aligned LLMs. By asking for a distribution instead of a single answer, you bypass typicality bias and unlock the model’s full potential.
References
We're excited to share our results and welcome feedback from the community as we continue to scale VS to different areas. If you have any questions or feedback, please feel free to contact us at yu.chi@northeastern.edu
or zhang.jiayi12@northeastern.edu
.