Class Vocabulary

Namespace: LMKit.Tokenization

Assembly: LM-Kit.NET.dll

Provides advanced tokenization capabilities and manages the model's vocabulary. The vocabulary of a large language model is composed of tokens that may represent whole words or word fragments. This class lazily loads the vocabulary and offers methods for tokenization, decoding, and token healing.

public sealed class Vocabulary

Inheritance: object

Vocabulary

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Properties

Size: Gets the total number of tokens in the model's vocabulary.

Vocabs: Gets the complete list of tokens (vocabulary) used by the model. Each entry corresponds to a token or text chunk.

VocabularyMode: Gets the vocabulary mode used by the model. This indicates the type of tokenizer and vocabulary utilized by the model.

Methods

Decode(IEnumerable<int>): Decodes a sequence of token identifiers into a human-readable string. Special tokens (beginning and end of sequence) are skipped.

DecodeEnd(IEnumerable<int>, int): Decodes a sequence of token identifiers, returning the ending segment of the resulting string up to a specified maximum length.

DecodeStart(IEnumerable<int>, int): Decodes a sequence of token identifiers, returning the beginning segment of the resulting string up to a specified maximum length.

GetToken(string): Retrieves the token identifier corresponding to a given string value.

Tokenize(string): Tokenizes the given text into an array of token identifiers. Special tokens are added and parsed based on the configuration.

Table of Contents

Class Vocabulary

Properties

Methods