Enum VocabularyMode

Specifies the vocabulary modes used by different tokenizer models.

public enum VocabularyMode

Fields

NONE = 0: No vocabulary mode is specified. Used for models without a vocabulary.
SPM = 1: Uses SentencePiece Model (SPM) vocabulary. This mode is based on byte-level Byte Pair Encoding (BPE) with byte fallback, typically used by LLaMA tokenizer.
BPE = 2: Uses Byte Pair Encoding (BPE) vocabulary. This mode is based on byte-level BPE, commonly used by GPT-2 tokenizer.
WPM = 3: Uses WordPiece Model (WPM) vocabulary. This mode is based on WordPiece, typically used by BERT tokenizer.
UGM = 4: Unigram Model (UGM) vocabulary. This mode is based on Unigram, typically used by T5 tokenizer.
RWKV = 5: RWKV tokenizer based on greedy tokenization.