Table of Contents

Method GetTrainingData

Namespace
LMKit.Translation
Assembly
LM-Kit.NET.dll

GetTrainingData(TrainingDataset, int, bool, int?)

Retrieves training data for fine-tuning language detection models from the specified dataset.

public static List<(string, Language)> GetTrainingData(TextTranslation.TrainingDataset dataset, int maxSamples = 1000, bool shuffle = true, int? seed = null)

Parameters

dataset TextTranslation.TrainingDataset

The dataset identifier from the TextTranslation.TrainingDataset enumeration.

maxSamples int

The maximum number of samples to retrieve. Default is 1000.

shuffle bool

If set to true, the dataset is shuffled before samples are selected. Default is true.

seed int?

An optional seed for the random number generator used during shuffling. If null, the shuffle operation is unseeded.

Returns

List<(string, Language)>

A list of tuples where each tuple consists of:

  • A string representing a text sample.
  • A Language enumeration value representing the corresponding language label.

Examples

List<(string, Language)> trainingData = TextTranslation.GetTrainingData(
    TextTranslation.TrainingDataset.LanguageDetection_LMKit2024_09_INT,
    maxSamples: 500,
    shuffle: true,
    seed: 42);

foreach (var sample in trainingData)
{
    Console.WriteLine($"Text: {sample.Item1}, Language: {sample.Item2}");
}

Exceptions

ArgumentException

Thrown if the specified dataset is not recognized.