Class Vocabulary
- Namespace
- LMKit.Tokenization
- Assembly
- LM-Kit.NET.dll
Handles model's vocabulary while providing advanced tokenization capabilities.
The vocabulary of a large language model differs from that of a human.
While a human's vocabulary is made up of complete words, the vocabulary of a language model is formed by "tokens."
These tokens can represent entire words, but at times, they may also be fragments of words.
public sealed class Vocabulary
- Inheritance
-
Vocabulary
- Inherited Members
Properties
- Size
Gets the number of tokens in the model's vocabulary.
- Vocabs
A list specifying the vocabulary of the model, where each entry corresponds to a specific token or text chunk.
The value of Token 0 is located at entry 0, Token 1 at entry 1, and so forth.
- VocabularyMode
Gets the vocabulary mode used by the model.
This property indicates the type of tokenizer and vocabulary utilized by the model weights.
Methods
- Decode(IEnumerable<int>)
Converts tokens to a human-readable string.
- DecodeEnd(IEnumerable<int>, int)
Transforms tokens into a human-readable format, retaining the end portion of the string up to a specified length.
- DecodeStart(IEnumerable<int>, int)
Transforms tokens into a human-readable format, retaining the start portion of the string up to a specified length.
- GetToken(string)
Retrieves the token value for a specific chunk of text.
- Tokenize(string)
Splits a text into smaller units (or tokens) that can be processed by the model.