Class Vocabulary
- Namespace
- LMKit.Tokenization
- Assembly
- LM-Kit.NET.dll
Provides advanced tokenization capabilities and manages the model's vocabulary. The vocabulary of a large language model is composed of tokens that may represent whole words or word fragments. This class lazily loads the vocabulary and offers methods for tokenization, decoding, and token healing.
public sealed class Vocabulary
- Inheritance
-
Vocabulary
- Inherited Members
Properties
- Size
Gets the total number of tokens in the model's vocabulary.
- Vocabs
Gets the complete list of tokens (vocabulary) used by the model. Each entry corresponds to a token or text chunk.
- VocabularyMode
Gets the vocabulary mode used by the model. This indicates the type of tokenizer and vocabulary utilized by the model.
Methods
- Decode(IEnumerable<int>)
Decodes a sequence of token identifiers into a human-readable string. Special tokens (beginning and end of sequence) are skipped.
- DecodeEnd(IEnumerable<int>, int)
Decodes a sequence of token identifiers, returning the ending segment of the resulting string up to a specified maximum length.
- DecodeStart(IEnumerable<int>, int)
Decodes a sequence of token identifiers, returning the beginning segment of the resulting string up to a specified maximum length.
- GetToken(string)
Retrieves the token identifier corresponding to a given string value.
- Tokenize(string)
Tokenizes the given text into an array of token identifiers. Special tokens are added and parsed based on the configuration.