🔡 What is a token?

📄 TL;DR

A token is a unit of text (such as a word or part of a word) that Large Language Models (LLMs) process during inference. In the LM-Kit.NET SDK, tokens are critical for breaking down text for the model’s understanding and generation. LM-Kit includes features like internal token healing to reconstruct split tokens, improving accuracy. There’s no token-based cost since LM-Kit performs edge inference, but understanding token limits and the Vocabulary class for managing a model’s vocabulary is crucial for optimizing performance.

📚 Token

Definition:
In the context of Large Language Models (LLMs) and the LM-Kit.NET SDK, a token is a unit of text that the model processes. A token can represent a character, word, or part of a word, depending on the tokenizer being used. For example, the word "love" may be represented as a single token, while more complex or rare words, like "programming, " may be split into multiple tokens.

Tokens are the fundamental units that the model uses to process input and generate text. Text is first broken down into tokens through tokenization before it is fed to the model. Similarly, the model generates tokens one by one during text generation.

🔍 The Role of Tokens in LLMs:

Model Input and Processing:
Tokens are the main units the LLM understands. Instead of processing raw text directly, the model works with tokens, making language processing more efficient and structured.
Text Generation:
When an LLM generates text, it predicts and generates the next token based on the previously generated tokens, allowing for fluid language generation.
Token Limits and Context Windows:
Each model has a token limit or context window that specifies the maximum number of tokens it can process at once. Exceeding this limit can result in truncation, which may affect the model's ability to generate or understand long texts.
Handling Complex Language:
Some words or languages may result in multiple tokens for a single word. Tokens allow models to process complex or fragmented inputs effectively.

⚙️ Practical Application in LM-Kit.NET SDK:

The LM-Kit.NET SDK uses tokens as the primary means of breaking down text for inference. Since it operates on edge inference, there is no token-based cost involved, unlike many cloud-based services that charge based on token usage. However, developers still need to manage tokens effectively for optimal performance.

Token Limits:
Each model has a limit on the number of tokens it can process (e.g., 1024 or 4096 tokens). Input text exceeding this limit will be truncated, and managing this is important for avoiding performance issues.
Internal Token Healing:
LM-Kit.NET SDK includes an internal token healing feature, which addresses the issue of token splits. When a word is divided into multiple tokens (e.g., due to rare or complex words), the healing mechanism intelligently reconstructs these tokens to improve the model’s comprehension and generate more accurate text.
Vocabulary Management with the Vocabulary Class:
LM-Kit.NET SDK offers a fully featured Vocabulary class, designed to manage the model’s vocabulary efficiently. The Vocabulary class (LMKit.Tokenization.Vocabulary) provides developers with tools to interact with and manage the vocabulary used by the LLM, allowing for:
- Token Lookups: Convert words or subwords into their corresponding tokens and vice versa.
- Custom Vocabulary Handling: Add or modify the existing vocabulary to suit domain-specific language or jargon.
- Token Count: Retrieve the number of tokens available in the model’s vocabulary.
The Vocabulary class is essential for optimizing how the LLM interacts with its vocabulary and ensuring proper handling of tokenization processes. Developers can use this class to streamline vocabulary management and improve tokenization performance, especially when working with specialized language.

Learn more about the Vocabulary class in LM-Kit.NET’s API documentation: LMKit.Tokenization.Vocabulary.

📖 Common Terms:

Tokenizer: A process or tool that converts text into tokens. Different models use different tokenizers to divide words, characters, or phrases into their corresponding tokens.
Context Window: The maximum number of tokens a model can handle in a single inference. Models have a fixed token limit, which, if exceeded, can lead to truncated or incomplete outputs.
Subword Tokenization: A technique where complex words are split into smaller sub-units or subwords. For example, "unhappiness" might be tokenized into "un, " "happy, " and "ness."
Vocabulary: The set of all tokens that the model recognizes and uses during tokenization and inference.

Embedding: Tokens are transformed into numerical vectors called embeddings, which represent their meanings and relationships in a form that the model can process.
Inference: The process of the model making predictions or generating text based on given input. In LM-Kit.NET, inference is performed on the edge, meaning it runs locally on devices without relying on cloud infrastructure.
Token Probabilities: When generating text, the model assigns probabilities to potential next tokens based on context. The token with the highest probability is selected as the next part of the output.
Byte Pair Encoding (BPE): A common tokenization method that merges frequent character pairs into tokens, creating a balance between representing common words as single tokens and breaking up rare words into multiple tokens.

📝 Summary:

Tokens are the basic units of text that the LM-Kit.NET SDK processes to understand and generate language. The SDK’s internal token healing feature improves model accuracy by reconstructing split tokens. Additionally, the Vocabulary class offers powerful tools for managing a model’s vocabulary, allowing developers to customize token handling and optimize performance. While no token-based cost applies in LM-Kit due to its edge inference design, understanding tokens and their management is crucial for maximizing model efficiency, especially in scenarios with strict token limits or complex vocabularies.

Table of Contents