What is a Token?
TL;DR
A token is the smallest unit of text that a Large Language Model (LLM) reads and produces. Tokens are not always whole words. Depending on the tokenizer, a token can be a complete word, a subword fragment, a single character, or even an individual byte. In the LM-Kit.NET SDK, tokens are central to every stage of text processing, from input parsing to output generation. LM-Kit includes internal token healing to reconstruct split tokens and improve accuracy. Because LM-Kit performs edge inference, there is no token-based cost. However, understanding token limits and the Vocabulary class is still essential for optimizing performance.
Token
Definition
In the context of Large Language Models (LLMs) and the LM-Kit.NET SDK, a token is a unit of text that the model processes. A token can represent a full word, a subword piece, a single character, or even a raw byte, depending on the tokenizer the model uses.
It is a common misconception that tokens map one-to-one with words. In practice, frequent words like "the" are often a single token, while less common words are split into multiple subword pieces. Punctuation, whitespace, and special characters each consume their own tokens as well.
Visual Example
Here is how a typical tokenizer might split text into tokens:
"The quick brown fox" → ["The", " quick", " brown", " fox"] (4 tokens)
"unhappiness" → ["un", "happiness"] (2 tokens)
"LM-Kit.NET" → ["LM", "-", "Kit", ".", "NET"] (5 tokens)
Notice that spaces are often attached to the beginning of a token rather than standing alone. This is a characteristic of subword tokenizers such as Byte Pair Encoding (BPE) and SentencePiece.
Vocabulary Size
Modern LLMs typically have vocabularies ranging from 32,000 to 256,000 tokens. A larger vocabulary means more words can be represented as single tokens, which reduces the total token count for a given text. A smaller vocabulary saves memory but may require more tokens to encode the same input.
Tokens are the fundamental units that the model uses to process input and generate text. Text is first broken down into tokens through tokenization before it is fed to the model. Similarly, the model generates tokens one by one during text generation.
The Role of Tokens in LLMs
Model Input and Processing Tokens are the main units the LLM understands. Instead of processing raw text directly, the model works with tokens, making language processing more efficient and structured.
Text Generation When an LLM generates text, it predicts the next token based on the previously generated tokens, allowing for fluid language generation.
Token Limits and Context Windows Each model has a token limit, also called a context window, that specifies the maximum number of tokens it can process at once. Exceeding this limit can result in truncation, which may affect the model's ability to generate or understand long texts.
Handling Complex Language Some words or languages may result in multiple tokens for a single word. Subword tokenization allows models to process complex or fragmented inputs effectively, even when a word has never appeared in the training data.
Practical Application in LM-Kit.NET SDK
The LM-Kit.NET SDK uses tokens as the primary means of breaking down text for inference. Since it operates on edge inference, there is no token-based cost involved, unlike many cloud-based services that charge based on token usage. However, developers still need to manage tokens effectively for optimal performance.
Token Limits Each model has a limit on the number of tokens it can process (e.g., 4,096 or 131,072 tokens). Input text exceeding this limit will be truncated. Managing context length is important for avoiding performance issues.
Internal Token Healing LM-Kit.NET SDK includes an internal token healing feature, which addresses the issue of token splits. When a word is divided into multiple tokens (for example, due to rare or complex words), the healing mechanism intelligently reconstructs these tokens to improve the model's comprehension and generate more accurate text.
Vocabulary Management with the Vocabulary Class LM-Kit.NET SDK offers a fully featured Vocabulary class, designed to manage the model's vocabulary efficiently. The Vocabulary class (
LMKit.Tokenization.Vocabulary) provides developers with tools to interact with and manage the vocabulary used by the LLM, allowing for:- Token Lookups: Convert words or subwords into their corresponding tokens and vice versa.
- Custom Vocabulary Handling: Add or modify the existing vocabulary to suit domain-specific language or jargon.
- Token Count: Retrieve the number of tokens available in the model's vocabulary.
The Vocabulary class is essential for optimizing how the LLM interacts with its vocabulary and ensuring proper handling of tokenization processes. Developers can use this class to streamline vocabulary management and improve tokenization performance, especially when working with specialized language.
Code Example
using LMKit.Model;
var model = LM.LoadFromModelID("gemma3:12b");
// Access the model's vocabulary
var vocabulary = model.Vocabulary;
Console.WriteLine($"Vocabulary size: {vocabulary.Count} tokens");
Key Terms
Tokenizer: A process or tool that converts text into tokens. Different models use different tokenizers (BPE, SentencePiece, WordPiece) to divide text into their corresponding tokens.
Context Window: The maximum number of tokens a model can handle in a single inference pass. Models have a fixed token limit; if exceeded, the input is truncated and output may be incomplete.
Subword Tokenization: A technique where words are split into smaller sub-units called subwords. For example, "unhappiness" might be tokenized into "un" and "happiness." This allows the model to handle words it has never seen before by composing them from known pieces.
Vocabulary: The complete set of all tokens that a model recognizes and uses during tokenization and inference. Vocabulary sizes for modern models typically range from 32K to 256K tokens.
Byte Pair Encoding (BPE): A widely used tokenization algorithm that starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens. BPE strikes a balance between representing common words as single tokens and breaking rare words into multiple pieces.
SentencePiece: A language-independent tokenization library that treats the input as a raw byte stream, making it effective for multilingual models. Many modern LLMs, including Gemma and LLaMA, use SentencePiece-based tokenizers.
Summary
Tokens are the fundamental units of text that the LM-Kit.NET SDK processes to understand and generate language. A token is not always a word; it can be a subword, a character, or a byte, depending on the tokenizer. The SDK's internal token healing feature improves model accuracy by reconstructing split tokens. The Vocabulary class offers powerful tools for managing a model's vocabulary, allowing developers to customize token handling and optimize performance. Because LM-Kit runs on edge inference, there is no token-based cost. Understanding tokens and their management is still crucial for maximizing model efficiency, especially in scenarios with strict context windows or specialized vocabularies.
Related Glossary Topics
Related API Documentation
External Resources
- Neural Machine Translation of Rare Words with Subword Units (BPE paper) by Sennrich, Haddow, and Birch
- SentencePiece: A simple and language independent subword tokenizer on GitHub
- Hugging Face Tokenizers library for fast, production-ready tokenization
- The Illustrated Word2Vec by Jay Alammar, which covers how tokens become vector representations