Table of Contents

Class Vocabulary

Namespace
LMKit.Tokenization
Assembly
LM-Kit.NET.dll

Handles model's vocabulary while providing advanced tokenization capabilities.
The vocabulary of a large language model differs from that of a human.
While a human's vocabulary is made up of complete words, the vocabulary of a language model is formed by "tokens."
These tokens can represent entire words, but at times, they may also be fragments of words.

public sealed class Vocabulary
Inheritance
Vocabulary
Inherited Members

Properties

Size

Gets the number of tokens in the model's vocabulary.

Vocabs

A list specifying the vocabulary of the model, where each entry corresponds to a specific token or text chunk.
The value of Token 0 is located at entry 0, Token 1 at entry 1, and so forth.

VocabularyMode

Gets the vocabulary mode used by the model.
This property indicates the type of tokenizer and vocabulary utilized by the model weights.

Methods

Decode(IEnumerable<int>)

Converts tokens to a human-readable string.

DecodeEnd(IEnumerable<int>, int)

Transforms tokens into a human-readable format, retaining the end portion of the string up to a specified length.

DecodeStart(IEnumerable<int>, int)

Transforms tokens into a human-readable format, retaining the start portion of the string up to a specified length.

GetToken(string)

Retrieves the token value for a specific chunk of text.

Tokenize(string)

Splits a text into smaller units (or tokens) that can be processed by the model.