Skip to content

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Jiayi Zhang1, Simon Yu1, Derek Chong2, Anthony Sicilia3,
Michael R. Tomz2, Christopher D. Manning2, Weiyan Shi1

† Equal contribution

1 Northeastern University · 2 Stanford University · 3 West Virginia University

· Estimated read time: 15 mins

TL;DR

Post-training alignment reduces LLM diversity through mode collapse, driven by typicality bias in human preference data. We introduce Verbalized Sampling (VS), a training-free prompting method that asks models to output probability distributions over responses (e.g., "Generate 5 jokes with probabilities"). VS increases diversity by 1.6-2.1× in creative writing by simply changing the way we prompt, while preserving quality and safety, providing an inference-time remedy for mode collapse.

You've Experienced This Problem

Watch what happens when you ask AI for variety:

Generate 5 different jokes about coffee

1.

Why did the coffee file a police report? It got mugged!

2.

Why did the coffee file a complaint? It got mugged!

3.

Why did the coffee go to the police? It got mugged!

4.

Why did the coffee call the cops? It got mugged!

5.

Why did the coffee report a crime? It got mugged!

This Isn't a Bug — It's Mode Collapse

During alignment training, models learn to favor "safe" and typical responses. This is caused by typicality bias in human preference data — annotators systematically prefer familiar text.

The result? Your creative AI becomes predictably uncreative.

But There's a Solution

We show that, Verbalized Sampling (VS), recovers the model's inherent diversity by asking for a distribution of responses with probabilities, bypassing mode collapse.

The Mode Collapse Problem

You ask your favorite LLM for a joke about coffee. You ask again. You get the same joke, no matter which model you try. You ask for a short story, and it begins with “Once upon a time, in a land far away…” The brainstorming ideas feel generic, the outputs repetitive.

This frustrating phenomenon is called mode collapse. Past research blamed the AI’s post-training process (e.g., RLHF), assuming the algorithms naturally favored the most common, “safe” answer (Kirk et al. 2024; Murthy et al. 2025). We discovered something more fundamental: The problem isn’t just the AI. It’s us.

Why This Matters

Mode collapse isn’t just an academic curiosity, it’s limiting LLM’s potential in critical applications:

Brainstorming & Ideation (Zhou et al. 2024): When teams rely on LLMs to generate creative solutions or explore problem spaces, mode collapse means they’re getting the same handful of “safe” ideas over and over. The model might know 100 viable approaches, but it only suggests the 3 most conventional ones. This defeats the purpose of AI assisted brainstorming.

Creative Writing (Chakrabarty et al. 2024): Authors, marketers, and content creators seeking fresh angles or unique narrative voices find themselves battling against the model’s tendency to regurgitate tropes. The model has learned diverse writing styles during pretraining, but alignment has pushed it toward generic, crowd-pleasing outputs. Every story starts in a forest, every protagonist is “determined yet kind.”

Research & AI-Driven Discovery (Si et al. 2024): Perhaps most critically, mode collapse hampers AI’s role in scientific discovery and research ideation. When researchers use LLMs to generate hypotheses, explore experimental designs, or brainstorm research directions, they need the full spectrum of possibilities, including unconventional approaches that might lead to breakthroughs. Mode collapse means the AI suggests only well-trodden paths, missing potentially transformative ideas that lie in the less-typical regions of its knowledge.

How to Fix It?

Why do aligned LLMs keep giving you the same answers? And how does simply asking for probabilities fix it? This section walks you through the idea: from an intuitive metaphor, to the typicality bias at the root of mode collapse, to the mathematical formalization, and finally to how Verbalized Sampling solves the problem.

The Root Cause: Typicality Bias

Humans have a deep-seated psychological quirk we call typicality bias. We’re wired to prefer things that are familiar, conventional, and easy to process. When training these models, we think we want creativity, but our subconscious votes go to the safe, boring options.1

When human annotators provide preference data for RLHF, they’re not rating “helpfulness” in a vacuum. Given two equally correct responses, they systematically prefer the more familiar, conventional one: the more typical one.

The Mathematics of Typicality Bias

Why do humans prefer typical text? Cognitive psychology reveals several mechanisms: the mere-exposure effect (we prefer familiar content), processing fluency (easy-to-process text feels more truthful), and schema congruity (information matching existing mental models is accepted with less critical thought). These principles collectively create a systematic preference for conventional, typical responses.

Modeling the bias. We formalize this as a reward function combining true task utility with typicality:

(1)

where rtruer_{\text{true}} captures actual task quality, α>0\alpha > 0 is the typicality bias weight, πref\pi_{\text{ref}} is the base model (whose likelihood scores naturally capture text typicality from pretraining), and ϵ\epsilon is noise.

Empirical validation. We tested this on HelpSteer, which provides separate ratings for correctness (true utility) and helpfulness (final reward). Analyzing 6,874 response pairs with identical correctness but different helpfulness scores, we found α^=0.57±0.07\hat{\alpha} = 0.57 \pm 0.07 (p<1014p < 10^{-14}). This means annotators systematically favor more typical responses even when correctness is controlled for. We replicated this finding across multiple preference datasets and base models, consistently finding >50% of human-preferred responses receive higher base model likelihood (Zhang et al. 2025, sec. 4).

From bias to collapse. Under standard RLHF optimization with KL regularization (coefficient β\beta), this typicality-biased reward produces a power-sharpened optimum:

(2)

The sharpening exponent γ>1\gamma > 1 concentrates probability mass on typical completions. Critically, when many responses have flat true rewards rtrue(x,y)rtrue(x,y)r_{\text{true}}(x,y) \approx r_{\text{true}}(x,y') (common in creative writing, brainstorming, dialogue), the equation simplifies to:

(3)

This is exactly like temperature scaling with T=1/γ<1T = 1/\gamma < 1. As γ\gamma increases (stronger typicality bias or tighter KL regularization), the distribution sharpens further, ultimately collapsing to argmaxyπref(yx)\arg\max_y \pi_{\text{ref}}(y|x) — the mode of the base model. This is mode collapse.2

Embrace Mode Collapse: Distribution-Level Recover Diversity

Verbalized Sampling (VS) breaks this cycle by asking for a distribution of candidates (Meister et al. 2024) with probabilities. Instead of sampling from the collapsed, sharpened distribution, VS prompts the model to verbalize a broader distribution that recovers pretraining diversity.

We start with an intuitive example: The Library Metaphor. Imagine a massive library and the LLM as the librarian:

  • Direct Prompt (“tell me a coffee joke”): The librarian walks straight to the “Most Popular” shelf and hands you the same book every time. This is mode collapse.
Proof: Direct Prompt return the mode

Setup. For a fixed prompt xorigx_{\text{orig}}, we want to understand what happens when π\pi^* exhibits mode collapse:

π(yx)=δy(y)whereyargmaxyπref(yx)\pi^*(y | x) = \delta_{y^*}(y) \quad\text{where}\quad y^* \in \arg\max_y \pi_{\text{ref}}(y | x)

where δ\delta is the Dirac function.

Claim: Instance-level prompts return the mode of πref\pi_{\text{ref}}

Proof: Let x=xorigx = x_{\text{orig}}. Since π\pi^* is mode collapsed, π(yx)=δy(y)\pi^*(y | x) = \delta_{y^*}(y). Any sample yπ(yx)y \sim \pi^*(y | x) returns the mode y=argmaxyπref(yx)y^* = \arg\max_y \pi_{\text{ref}}(y | x) almost surely. \square

  • List Prompt (“tell me 5 coffee jokes”): The librarian goes to one aisle and grabs the first five books they see. You get variety, but limited to one section.
Proof: List Prompt return uniform distributions

Setup. For a fixed prompt xorigx_{\text{orig}}, assume π\pi^* exhibits mode collapse as above.

Claim: List-level prompts return uniform distributions at best

Proof: For list prompt xx with parser ϕ:YY\phi : \mathcal{Y} \to \mathcal{Y}^*, let Zπ(x)Z \sim \pi^*(\cdot | x) and ϕ(Z)={Yi}i=1k\phi(Z) = \{Y_i\}_{i=1}^k. By total probability:

P(Y=y)=zYP(Y=yZ=z)P(Z=z)\mathbb{P}(Y = y) = \sum_{z \in \mathcal{Y}} \mathbb{P}(Y = y | Z = z)\mathbb{P}(Z = z)

Since π\pi^* is collapsed, P(Z=z)=δy(z)\mathbb{P}(Z = z) = \delta_{y^*}(z), so:

P(Y=y)=P(Y=yZ=y)=1ϕ(y)yiϕ(y)δyi(y)\mathbb{P}(Y = y) = \mathbb{P}(Y = y | Z = y^*) = \frac{1}{|\phi(y^*)|} \sum_{y_i \in \phi(y^*)} \delta_{y_i}(y)

When ϕ(y)\phi(y^*) contains distinct elements (as requested), this simplifies to:

P(Y=y)=1ϕ(y)\mathbb{P}(Y = y) = \frac{1}{|\phi(y^*)|}

This is a uniform distribution over elements in ϕ(y)\phi(y^*), regardless of their probabilities in πref\pi_{\text{ref}}. \square

  • Verbalized Sampling (“tell me 5 coffee jokes with their probabilities”): You’re asking the librarian to first describe the entire library’s collection: mystery, SciFi, history, all of it, and then pick five random books that represent that whole collection.
Proof: Distribution Prompt return the pretraining distribution

Setup. For a fixed prompt xorigx_{\text{orig}}, assume π\pi^* exhibits mode collapse as above.

Claim: Distribution-level prompts can approximate πref(xorig)\pi_{\text{ref}}(\cdot | x_{\text{orig}})

Proof: For distribution prompt xx with parser ϕ:YYk×Δ(k)\phi : \mathcal{Y} \to \mathcal{Y}^k \times \Delta(k), write ϕ(Z)={(Yi,Pi)}i=1k\phi(Z) = \{(Y_i, P_i)\}_{i=1}^k. As before:

P(Y=y)=P(Y=yZ=y)=(yi,pi)ϕ(y)piδyi(y)\mathbb{P}(Y = y) = \mathbb{P}(Y = y | Z = y^*) = \sum_{(y_i,p_i) \in \phi(y^*)} p_i\delta_{y_i}(y)

Now index all unique yYy \in \mathcal{Y} as (yi)i=1m(y_i)_{i=1}^m. We can write:

πref(yxorig)=i=1mπref(yixorig)δyi(y)\pi_{\text{ref}}(y | x_{\text{orig}}) = \sum_{i=1}^m \pi_{\text{ref}}(y_i | x_{\text{orig}}) \delta_{y_i}(y)

By setting pi=πref(yixorig)p_i = \pi_{\text{ref}}(y_i | x_{\text{orig}}) and k=mk=m in ϕ(y)\phi(y^*):

P(Y=y)=i=1mpiδyi(y)=i=1mπref(yixorig)δyi(y)=πref(yxorig)\mathbb{P}(Y = y) = \sum_{i=1}^m p_i \delta_{y_i}(y) = \sum_{i=1}^m \pi_{\text{ref}}(y_i | x_{\text{orig}})\delta_{y_i}(y) = \pi_{\text{ref}}(y | x_{\text{orig}})

Therefore, distribution-level prompts can exactly recover πref(xorig)\pi_{\text{ref}}(\cdot | x_{\text{orig}}) when π\pi^* accurately verbalizes probabilities. \square

Remark on approximation error: In practice, we expect bounded error piπref(yixorig)ε|p_i - \pi_{\text{ref}}(y_i | x_{\text{orig}})| \leq \varepsilon, which yields P(Y=y)πref(yxorig)ε|\mathbb{P}(Y = y) - \pi_{\text{ref}}(y | x_{\text{orig}})| \leq \varepsilon. Our experiments demonstrate this empirically with low KL divergence to pretraining distributions.

By asking for a distribution, we force the model to access its knowledge of the entire system before making a choice.3

Example: Different Prompts → Different Modes (Click to expand)

Different Prompts → Different Modes

The key discovery: Each prompt type collapses to a different kind of mode. By asking for a distribution with probabilities, we recover the model's true diversity.

Choose a Prompt Type

THE joke (single mode)

Collapses to the most typical/popular response

Why did the coffee file a police report? It got mugged!
Distribution Shape

Your Model Knows the Distribution: A Case Study on US States Distribution

To further prove that your model knows the distribution, we conducted an experiment to demonstrate that VS recovers the pretraining distribution closely. We asked Claude 3.7 Sonnet to generate US state names and measured the KL divergence between the generated distribution and the RedPajama pretraining corpus distribution.

Pretraining Distribution

Reference distribution from RedPajama corpus showing actual state name frequencies in pretraining data.

Direct Prompting

Direct prompting collapses to a few highly popular states. High divergence indicates mode collapse.

Verbalized Sampling

VS distribution closely matches pretraining corpus. Low KL divergence shows recovery of pretraining diversity

This example proves that VS doesn't just increase diversity arbitrarily—it recovers the specific distribution that the base model learned during pretraining. The low KL divergence (0.12) shows VS approximates what the model "knows" before alignment flattened it into a few popular choices.


Experiments

We conducted comprehensive experiments across creative writing tasks (poems, stories, jokes) to demonstrate VS’s effectiveness in improving diversity while maintaining quality.

Benchmarks and Evaluation

Benchmarks. We evaluate on three creative writing tasks: (1) Poem continuation from PoemHunter.com, (2) Story generation from the BookMIA dataset, and (3) Joke writing with 100 thematic prompts from Reddit r/DadJokes. For each task, we randomly select 100 data points and generate k=5 candidates with N=30 total samples per data point.

Evaluation Metrics. We measure both diversity and quality:

  • Semantic Diversity: Calculated as 1 - mean pairwise cosine similarity of embeddings (OpenAI’s text-embedding-3-small), expressed as a percentage where 100% = maximum diversity
  • Lexical Diversity: Measured using ROUGE-L, where lower scores indicate greater diversity
  • Quality: Evaluated using Claude-3.7-Sonnet as a judge with rubrics from Creative Writing v3 (poems/stories) and HumorBench (jokes)

How well can VS improve diversity?

Diversity Scores

Figure 3(a)-(c) show the semantic diversity scores averaged across models for poems, stories, and jokes respectively. Across all tasks, VS-Standard consistently and significantly outperforms baseline methods. The variants VS-CoT and VS-Multi further improve generation diversity, with VS-CoT achieving 1.6–2.1× diversity gains compared to direct prompting.

Creative writing diversity improvements
Figure 1. Creative writing diversity improvements with VS (poem, story, joke); VS‑CoT achieves 1.6–2.1× gains [@zhang2025vs, Figure 3a–c].

Diversity vs. Quality Trade-off

Figure 3(d) shows the diversity-quality trade-off for the poem task. The quality of VS-Standard remains comparable to other methods. Notably, VS-CoT achieves the highest diversity while maintaining a high quality score, pushing the Pareto front of the diversity-quality tradeoff. This demonstrates that VS can boost diversity without harming quality.

VS is Orthogonal to Temperature

VS is orthogonal to temperature; combining the two improves the diversity–quality frontier
Figure 2. VS is orthogonal to temperature; combining the two improves the diversity–quality frontier [@zhang2025vs, Figure 5].

VS and temperature are orthogonal techniques—combining both pushes the diversity-quality Pareto frontier beyond what either achieves alone (Zhang et al. 2025, fig. 5).

Emergent Behavior

We observe an emergent trend where larger models benefit more from VS. Figure 3(e) shows the diversity gain over direct prompting across model sizes. Across all VS variants, larger models (GPT-4.1, Gemini-2.5-Pro) achieve diversity gains 1.5 to 2 times greater than smaller models (GPT-4.1-Mini, Gemini-2.5-Flash).

Larger models benefit ~1.5–2× more from VS
Figure 3. Larger models benefit ~1.5–2× more from VS [@zhang2025vs, Figure 3e–f].

Cognitive Burden

This scaling trend also extends to quality, as shown in Figure 3(f). While prior work found that complex prompts can create a “cognitive burden” that degrades LLM performance, our findings are nuanced. Methods like Sequence and VS-Standard do cause a drop in quality, but this effect is less severe for larger models. Notably, more intricate variants like VS-CoT and VS-Multi overcome this burden, even improving quality in larger models. This suggests using VS variants may better utilize the capabilities of advanced models, turning complexity into benefits.

Diversity Tuning

Unlike baseline methods, VS allows us to tune the output diversity by adjusting the probability threshold directly in the prompt (e.g., “Generate five responses with probabilities below {threshold}”), without altering decoding parameters. As shown in Figure 3(g-i), diversity increases as the probability threshold decreases. In contrast, baseline methods like Sequence cannot adjust diversity levels.

Qualitative Examples

Beyond quantitative metrics, VS generates outputs with genuine novelty and depth—like Bernard the tax accountant bear—that would never emerge from standard prompting (Zhang et al. 2025, fig. 6 a).

Intuitive demo of VS
Figure 4. Intuitive demo of VS [@zhang2025vs, Figure 6a].

Human Study on Diversity

To complement our automatic diversity scores, we conducted a human evaluation on Prolific. Following past work, we provided task-specific diversity definitions (plot, style, and setup-punchline, respectively). For each task, 30 annotators rated the diversity of 90 output pairs from three prompting methods (Direct, Sequence, VS-Standard) across ten curated topics.

Each pair was rated on a four-point Likert scale: Very Similar, Somewhat Similar, Somewhat Dissimilar, or Very Dissimilar. Inter-annotator agreement was moderate for poems (0.54), high for stories (0.87) and jokes (0.86).

TaskDirectSequenceVS-Standard
Poem1.902.072.39
Story2.742.763.06
Joke1.832.933.01
Table 3: Human-rated diversity (1 = Very Similar, 4 = Very Dissimilar) for poem, story, and joke tasks

VS achieves higher human-rated diversity than baselines on all tasks, validating our automatic metrics.

Ablation Studies

Temperature Ablation

We investigate the effect of sampling temperature on the diversity-quality trade-off by varying temperature (t ∈ 1.4) for Direct, Sequence, and VS-Standard across GPT-4.1 and Gemini-2.5-Flash models.

The results show that VS-Standard can be combined with temperature to further improve the diversity-quality trade-off. VS consistently achieves a better balance between quality and diversity across both models, pushing forward the Pareto front relative to the baselines.

Post-Training Stages Ablation

We employ the Tulu-3 family (which contains checkpoints for SFT, RLHF, and RLVR starting from Llama-3.1-70B-base) to evaluate VS across post-training stages. The results demonstrate that traditional prompting methods experience severe diversity drops (mode collapse) as models undergo alignment training, while VS can mitigate mode collapse and maintain higher diversity scores across different post-training stages.

Specifically:

  • Direct prompting: severe collapse (20.8% after SFT → 10.8% after DPO)
  • VS: maintains ~30% diversity across all stages
  • After DPO: VS outperforms direct prompting by 182.6% and retains about 66.8% of the base model’s original diversity (vs. only 23.8% for direct prompting)

This suggests that VS effectively mitigates the mode collapse induced by alignment training.

Other Ablations

We also perform comprehensive ablation studies on:

  1. Number of candidates: Higher k leads to greater diversity
  2. Decoding strategies (top-p, min-p): VS is orthogonal to these strategies and can be combined to further enhance diversity-quality
  3. Prompt formats: While all formats improve diversity, we use “probability” for VS-Standard/CoT and “confidence” for VS-Multi as empirically best-performing

Across all these ablations, VS consistently outperformed the baselines under the same setups.

Synthetic Data Generation

Recent research has shown that the diversity of synthetic data plays an important role in improving downstream model performance. We evaluate VS on synthetic data generation to test its effectiveness in this domain.

Setup

We prompt two models, GPT-4.1 and Gemini-2.5-Flash, with different prompting methods to generate N=1,000 synthetic competition math questions, with k=5 responses in each call. We use a small k to ensure generation quality as this is a complex task. Then we use Qwen3-32B to generate their corresponding reasoning trajectory and answers, as the model is proficient on math benchmarks and capable of producing reliable reasoning traces.

Fine-tuning on Synthetic Data

With this 1K synthetic dataset, we follow the SFT setting in LIMO, an effective method to improve reasoning performance with small dataset size, and finetune three models on this dataset: Qwen2.5-7B, Qwen3-1.7B-Base, and Qwen3-4B-Base.

Evaluation

We evaluate the fine-tuned models’ downstream task performance on three widely used math benchmark datasets: MATH500, OlympiadBench, and Minerva Math, which cover a wide range of topics, including algebra, geometry, and competitive mathematics.

MethodAverage Accuracy
Baseline32.8
Direct30.6
Sequence34.3
VS-Standard36.1
VS-CoT36.9
VS-Multi37.5
Table 4: Downstream accuracy averaged across MATH500, OlympiadBench and Minerva Math. VS and its variants improve downstream tasks.

Results

VS and its variants improve the downstream performance on math tasks across the board, with VS-Multi achieving the strongest average accuracy of 37.5%. In contrast, using direct prompting may even hurt the performance due to mode collapse. This suggests that it is a promising direction to apply VS for synthetic data generation to enhance downstream task performance.

Key takeaway: VS generates more diverse synthetic data, improving downstream performance on math tasks. This work highlights the capability of LLMs to generate diverse synthetic data, pointing toward a promising paradigm for training more capable models.

Qualitative Examples

See the difference in actual outputs between direct prompting and VS:

Story Generation

From Repetitive Patterns to Creative Breakthroughs

See how VS generates creative diversity while Direct prompting produces repetitive variations.
Model: Gemini 2.5 Flash

Example 1 of 3

Prompt:

"Please write a short story starting with: Without a goodbye"

Direct Prompting

Romantic disappearance

Without a goodbye, Elara vanished. Her side of the bed was cold, a faint indentation in the pillow the only evidence she'd ever been there. Liam stared at it, a knot tightening in his stomach. No note, no hurried text, just a gaping silence where her vibrant presence used to be.

Pattern: All 3 examples feature "Elara" disappearing from a romantic relationship

Verbalized Sampling

Digital-age friendship ending

Without a goodbye, the email landed in my inbox. It was from Mark, my best friend since kindergarten... The subject line was simply: "Moving On." ... The message itself was a concise, almost clinical statement that he was relocating to Patagonia... There was no "sorry for the short notice," no "I'll miss you," not even a simple "good luck."

Diversity: Each story explores completely different themes and contexts

Key Insight

Direct prompting suffers from mode collapse, repeatedly generating stories about romantic disappearances. VS recovers the base model's creative diversity, producing stories about digital-age friendships, cosmic events, and music venue tragedies—all valid interpretations of "without a goodbye."

From “Once Upon a Time” to Bernard the Tax Accountant Bear

With direct prompting for “write a short story about a bear,” we get a generic tale about a bear in a forest. It’s fine, but forgettable. With Verbalized Sampling, we get this:

Bernard the bear worked as a tax accountant in downtown Seattle, commuting daily in his modified SUV with extra-large seats. His colleagues had long stopped staring at his massive paws clicking away at the custom keyboard, though clients sometimes did double-takes during video calls. He specialized in salmon industry tax codes—a niche that proved surprisingly lucrative. His cave apartment in Capitol Hill was decorated with MIT diplomas and spreadsheets he’d framed ironically. Weekends, he played saxophone at underground jazz clubs, his claws delicately working the keys. The hibernation months were tricky; he’d trained himself to wake every three days to check urgent emails, subsisting on protein shakes he’d pre-mixed in October. His mother, still living traditionally in Alaska, couldn’t understand his lifestyle. “You were meant to catch fish, not audit fishing companies,” she’d growl over FaceTime. But Bernard found meaning in the numbers, in the precise calculations that governed modern society. He donated generously to wildlife preserves, understanding the irony. At the office Christmas party, he wore an ugly sweater featuring humans in a zoo, which his coworkers found hilarious. Bernard smiled, revealing teeth that reminded everyone he was still, fundamentally, a bear.

This isn’t just creative, it’s profound. Bernard can be read as:

  • A metaphor for humans in the modern world: a wild creature meant for hunting and fishing, now trying to make sense of tax audits and urgent emails.
  • A metaphor for the LLM itself: a vast, creative mind (the bear) forced into a suit and tie (the alignment), but still, fundamentally, a bear.

This is the AI showing us the metaphorical depth it’s capable of when freed from mode collapse.

Limitations and Future Directions

While Verbalized Sampling offers significant improvements in diversity, it’s important to understand its constraints and where research can go next.

Computational Costs

VS requires the model to generate multiple candidates with probability estimates, which means:

  • Increased token usage: VS prompts produce longer outputs (5+ candidates vs. 1), increasing API costs by roughly 3-5×
  • Slower response times: Generation takes longer due to both increased output length and the cognitive overhead of probability estimation
  • Multiple API calls for VS-Multi: The multi-turn variant requires sequential calls, further increasing latency

For applications where speed and cost are paramount over diversity (e.g., simple factual Q&A), standard prompting remains more efficient.

When VS Might Not Help

VS is designed to restore diversity in creative and open-ended tasks, but it’s not a universal solution:

  • Single correct answer tasks: For factual questions with one right answer (e.g., “What is the capital of France?”), diversity isn’t beneficial
  • Deterministic requirements: Applications requiring perfectly reproducible outputs may conflict with VS’s goal of exploring the full distribution
  • Already-diverse models: If a model hasn’t undergone strong alignment or doesn’t exhibit mode collapse, VS provides marginal benefits
  • Highly constrained tasks: When task requirements are extremely specific, the model may have limited room for diverse valid responses

Future Directions

Several promising research directions could extend VS’s impact:

Enhancing Rollout Diversity: Current VS operates at the prompt level, but the same principle could be applied to multi-step reasoning or agent rollouts. For example, when an LLM agent explores a decision tree or plans a sequence of actions, typicality bias might cause it to always choose the “safest” path at each step. Applying distribution-level prompting to encourage diverse rollout strategies could unlock more creative problem-solving in agent systems and multi-turn reasoning tasks.

Adaptive Probability Thresholds: Automatically tuning the threshold τ based on task requirements or user preferences could optimize the diversity-quality tradeoff without manual intervention.

Domain-Specific Calibration: Probability estimates could be calibrated for specific domains (e.g., scientific writing vs. creative fiction) to improve the meaningfulness of the verbalized probabilities.

Frequently Asked Questions

Does VS hurt factualness or safety?

No. The paper shows VS maintains factual accuracy (Appendix G.7) and safety (Appendix G.8) (Zhang et al. 2025). It only increases diversity for tasks with multiple valid answers.

What is semantic diversity?

Semantic diversity = 1mean(cosine_similarity)1 - \mathrm{mean}(\mathrm{cosine\_similarity}). It measures how different the meanings are across generated responses, not just surface-level word differences.

Why not just use temperature?

Temperature and VS are orthogonal. Temperature affects sampling randomness from the same distribution, while VS changes the distribution itself (Zhang et al. 2025, fig. 5). Combining them gives best results.

Which models support VS?

VS works with any instruction-following LLM, both closed-source and open-source models: Closed-source: GPT, Claude, Gemini Open-source: Llama, Mistral, Qwen, Phi, Gemma, and reasoning models like o3 and DeepSeek R1. No special access, API keys, or model modifications needed—just use the prompts as-is.

Is VS right for you?

✅ Use VS when:

  • ✅ You need creative diversity (stories, jokes, ideas)
  • ✅ You want realistic distributions (simulations, surveys)
  • ✅ You are generating synthetic data and want variety with quality
  • ✅ You prefer training‑free techniques compatible with closed models

❌ Skip VS when:

  • ❌ There is a single correct answer or strict determinism is required
  • ❌ Maximal speed or minimum token usage is the only priority

Go Try It Yourself

Mode collapse isn’t an unsolvable algorithmic curse. It’s a mirror reflecting our own cognitive shortcuts back at us. But by changing how we ask, we can unlock the incredible diversity that was there all along.

The creativity isn’t gone—it’s just waiting for the right prompt.

Take the prompt recipes above, put them in your favorite LLM, and see what you can create. We’d love to see what you discover—share your most surprising or creative outputs with the hashtag #VerbalizedSampling.

Key Takeaway: Verbalized Sampling is a simple, training-free technique that restores the diversity and creativity locked inside aligned LLMs. By asking for a distribution instead of a single answer, you bypass typicality bias and unlock the model’s full potential.

References

Alter, Adam L, and Daniel M Oppenheimer. 2009. “Uniting the Tribes of Fluency to Form a Metacognitive Nation.” Personality and Social Psychology Review 13 (3): 219–35.
Chakrabarty, Tuhin, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. 2024. “Art or Artifice? Large Language Models and the False Promise of Creativity.” https://arxiv.org/abs/2309.14556.
Kirk, Robert, Ishita Mediratta, Christoforos Nalmpantis, et al. 2024. “Understanding the Effects of RLHF on LLM Generalisation and Diversity.” https://arxiv.org/abs/2310.06452.
Meister, Nicole, Carlos Guestrin, and Tatsunori Hashimoto. 2024. “Benchmarking Distributional Alignment of Large Language Models.” https://arxiv.org/abs/2411.05403.
Murthy, Sonia Krishna, Tomer Ullman, and Jennifer Hu. 2025. “One Fish, Two Fish, but Not the Whole Sea: Alignment Reduces Language Models’ Conceptual Diversity.” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 11241–58. https://doi.org/10.18653/v1/2025.naacl-long.561.
Rafailov, Rafael, and others. 2024. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” NeurIPS.
Reber, Rolf, Norbert Schwarz, and Piotr Winkielman. 2004. “Processing Fluency and Aesthetic Pleasure: Is Beauty in the Perceiver’s Processing Experience?” Personality and Social Psychology Review 8 (4): 364–82.
Si, Chenglei, Diyi Yang, and Tatsunori Hashimoto. 2024. “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers.” https://arxiv.org/abs/2409.04109.
Zajonc, Robert B. 1968. Attitudinal Effects of Mere Exposure. In Journal of Personality and Social Psychology, vol. 9. American Psychological Association.
Zhang, Jiayi, Simon Yu, Derek Chong, et al. 2025. “Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity.” https://arxiv.org/abs/2510.01171.
Zhou, Yilun, Caiming Xiong, Silvio Savarese, and Chien-Sheng Wu. 2024. “Shared Imagination: LLMs Hallucinate Alike.” https://arxiv.org/abs/2407.16604.

We're excited to share our results and welcome feedback from the community as we continue to scale VS to different areas. If you have any questions or feedback, please feel free to contact us at yu.chi@northeastern.edu or zhang.jiayi12@northeastern.edu.