What is Sampling?

TL;DR

Sampling is the final step in the inference process for Large Language Models (LLMs), where the model selects the next token based on probability distributions. LM-Kit.NET offers a variety of sampling strategies, such as RandomSampling, Top-K, Top-P, and Mirostat, that can be dynamically arranged through the SamplersSequence feature. Additionally, LM-Kit includes unique internal sampling strategies and sampler refiners that enhance accuracy and speed for popular use cases like classification and information extraction.

Sampling

Definition: In Large Language Models (LLMs), sampling refers to the process of selecting the next word or token during text generation based on probability distributions. Sampling is the final step in the inference process, where the model makes a decision on the next token by considering the likelihood of each candidate. This process introduces an element of randomness, helping the model generate more varied and creative outputs.

The Role of Sampling in LLMs

Final Step in the Inference Process: Sampling occurs at the end of the inference process, after the model has processed the input and calculated the probability distribution for possible next tokens. At this stage, the model selects a token based on the chosen sampling strategy, which can range from deterministic (choosing the most probable token) to more creative and varied (introducing randomness).
Generating Varied and Creative Outputs: Sampling allows LLMs to generate diverse outputs by selecting tokens based on their probability distributions. This is especially important in scenarios requiring creativity, such as writing, storytelling, or engaging conversational dialogue, where the goal is to avoid overly deterministic or repetitive responses.
Controlling Randomness and Predictability: By adjusting parameters like temperature or choosing different sampling strategies, developers can control the level of randomness in the model's outputs. Lower temperatures lead to more predictable, deterministic outputs, while higher temperatures introduce greater variability, encouraging the model to generate more creative responses.
Balancing Accuracy and Diversity: Sampling strategies like Top-K and Top-P help balance accuracy and diversity in the generated text. Top-K limits the model to selecting from the K most probable tokens, which increases focus, while Top-P ensures that tokens are selected from a cumulative probability distribution, fostering diversity without compromising coherence.
Avoiding Repetition and Incoherence: Advanced sampling strategies, such as Mirostat, are designed to prevent common pitfalls like repetitive ("boredom traps") or incoherent ("confusion traps") outputs. These methods maintain the quality of generated text by controlling perplexity and dynamically adjusting the sampling process to ensure both creativity and coherence.

Practical Application in LM-Kit.NET SDK

In LM-Kit.NET, developers can leverage multiple sampling strategies to control the final step of the inference process and tailor text generation to their needs. Additionally, LM-Kit.NET's internal sampling strategies and sampler refiners are designed to boost both speed and accuracy for tasks like classification and extraction, where precision and performance are key.

RandomSampling: The RandomSampling class allows developers to apply temperature-based sampling, adjusting the randomness of token selection. Key properties like Temperature, Top-K, and Top-P offer control over how diverse or deterministic the model's output is.
- Temperature: Controls the level of randomness. Lower temperatures make the model more predictable, while higher values introduce greater creativity.
- Top-K: Limits token selection to the top K most probable tokens.
- Top-P: Selects tokens based on cumulative probability, balancing diversity and coherence.
SamplersSequence and RandomSamplers: SamplersSequence allows developers to dynamically arrange different sampling strategies in sequence. The RandomSamplers enumeration includes:
- Top-K: Selects from the top K most probable tokens, ensuring focus on likely candidates.
- TailFree: Reduces the influence of low-probability tokens, improving coherence.
- LocallyTypical: Focuses on typical selections within a local context.
- Top-P: Chooses tokens based on cumulative probability, enhancing output diversity.
- MinP: Filters out tokens below a certain probability threshold.
- Temperature: Adjusts randomness by scaling the probability distribution.
GreedyDecoding: GreedyDecoding is a fully deterministic strategy where the model selects the token with the highest probability. While this approach guarantees predictable outputs, it can result in repetitive or uncreative responses in some cases.
MirostatSampling: Mirostat is an advanced neural text decoding algorithm designed to maintain a balance between coherence and creativity by directly controlling perplexity. It dynamically adjusts sampling throughout the generation process to avoid repetition ("boredom traps") or incoherence ("confusion traps").
RepetitionPenalty: RepetitionPenalty helps prevent the model from generating repetitive text by applying penalties to tokens that have already been used. This ensures more engaging and varied outputs.
LogitBias: The LogitBias class allows developers to modify the likelihood of specific tokens being selected, enabling them to guide the model toward desired outcomes or away from undesirable ones.

Code Example

using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Sampling;

var model = LM.LoadFromModelID("gemma3:12b");
var chat = new MultiTurnConversation(model);

// Configure sampling strategy
chat.SamplingMode = new RandomSampling
{
    Temperature = 0.7f,
    TopK = 40,
    TopP = 0.9f
};

var response = chat.Submit("Write a creative story opening.", CancellationToken.None);

Key Classes in LM-Kit.NET Sampling

RandomSampling: Handles temperature-based sampling, offering control over randomness with parameters like Top-K and Top-P to balance between diverse and predictable outputs.
GreedyDecoding: Implements a fully deterministic strategy where the model selects the most probable token at each step, resulting in consistent but potentially repetitive outputs.
MirostatSampling: A dynamic algorithm that controls perplexity during text generation, maintaining a balance between coherence and creativity throughout the output.
RepetitionPenalty: Prevents repetitive text by penalizing tokens that appear frequently or have already been used, ensuring more varied and engaging outputs.
LogitBias: Allows developers to modify token probabilities to favor or avoid specific words during the inference process.
SamplersSequence: Dynamically arranges various sampling strategies (e.g., Top-K, TailFree, Top-P) to provide more control over the text generation process.

Key Terms

Temperature: A parameter that controls the level of randomness in token selection. Lower temperatures result in more predictable outputs, while higher temperatures increase creativity by broadening token selection.
Top-K Sampling: Limits token selection to the top K most probable tokens, ensuring focused and coherent outputs.
Top-P Sampling (Nucleus Sampling): A method that selects tokens based on cumulative probability, balancing diversity and coherence by including a dynamic set of token choices.
Greedy Decoding: A deterministic strategy that always selects the highest-probability token, resulting in predictable but potentially repetitive responses.
SamplersSequence: A feature that allows developers to arrange multiple sampling strategies in sequence, providing flexible control over the generation process.
Repetition Penalty: A mechanism used to penalize tokens that have already been generated, reducing repetition and ensuring more diverse outputs.

Inference: The process by which an LLM generates text or other outputs based on input and learned patterns. Sampling is the last step in inference, determining how the model selects the next token.
Token: A unit of text, such as a word or part of a word, that the model processes during text generation. Tokens are selected based on probability distributions during sampling.
Logits: The raw, unnormalized scores produced by the model for each token in the vocabulary before sampling converts them into probabilities.
Perplexity: A measure of how well the model predicts a sequence of tokens. Lower perplexity indicates more confident predictions, while higher perplexity suggests more variability or uncertainty.
Dynamic Sampling: An advanced technique that automatically adjusts sampling parameters during generation to optimize output quality for specific tasks.
Grammar Sampling: A constrained sampling approach that enforces structural rules on the generated output, ensuring schema compliance and valid formatting.
Speculative Decoding: A performance optimization technique that uses a smaller draft model to propose candidate tokens, which are then verified by the main model.
Context Windows: The maximum number of tokens the model can process in a single inference pass, which affects how much prior context is available during sampling.
Chat Completion: The end-to-end process of generating a response in a conversational setting, where sampling determines the final token selection at each step.

External Resources

Summary

Sampling is the final step in the inference process for Large Language Models (LLMs), where the model selects the next token based on probability distributions. In LM-Kit.NET, developers can choose from various sampling strategies, such as RandomSampling, Top-K, Top-P, and Mirostat, to control the balance between deterministic and creative outputs. LM-Kit also features unique internal sampling strategies and sampler refiners that optimize both accuracy and speed for popular use cases like classification and extraction. The SamplersSequence feature further enhances flexibility by allowing dynamic arrangement of multiple sampling strategies, making LM-Kit.NET highly adaptable for diverse use cases.

Table of Contents