Table of Contents

Enum VocabularyMode

Namespace
LMKit.Tokenization
Assembly
LM-Kit.NET.dll

Specifies the vocabulary modes used by different tokenizer models.

public enum VocabularyMode

Fields

NONE = 0

No vocabulary mode is specified. Used for models without a vocabulary.

SPM = 1

Uses SentencePiece Model (SPM) vocabulary. This mode is based on byte-level Byte Pair Encoding (BPE) with byte fallback, typically used by LLaMA tokenizer.

BPE = 2

Uses Byte Pair Encoding (BPE) vocabulary. This mode is based on byte-level BPE, commonly used by GPT-2 tokenizer.

WPM = 3

Uses WordPiece Model (WPM) vocabulary. This mode is based on WordPiece, typically used by BERT tokenizer.