Table of Contents

Build a Unified Multimodal RAG System for Audio, Text, and Images

Enterprise knowledge lives in many formats: meeting recordings, scanned invoices, technical manuals, photos of whiteboards, and plain-text reports. Traditional RAG pipelines handle only text documents, leaving audio and image content unsearchable. LM-Kit.NET lets you build a single knowledge base that ingests all three modalities by converting audio to text via speech-to-text and images to Markdown via VLM OCR, then embedding everything into one vector store. Users query across all content types with a single question. This tutorial builds a unified multimodal RAG system that combines audio transcriptions, scanned document text, and standard documents into a single searchable knowledge base.


Why Multimodal RAG Matters

Two enterprise problems that a unified multimodal knowledge base solves:

  1. Cross-format institutional knowledge. An engineering firm accumulates project knowledge across meeting recordings, hand-drawn design sketches, scanned site inspection reports, and digital specifications. Engineers searching for information about a past project must check multiple systems. A unified RAG pipeline indexes all content types into one searchable store, so "What was the load-bearing capacity decision for Building C?" finds the answer whether it was spoken in a meeting, written in a spec, or captured in a scanned report.
  2. Compliance and audit readiness. Regulated organizations must demonstrate that they can locate any piece of evidence across all record types: recorded calls, faxed contracts, digital correspondence, and photographed receipts. A multimodal knowledge base provides a single search endpoint for compliance teams, replacing manual searches across disconnected archives.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM ~5 GB (embedding model + Whisper + VLM OCR model)
Disk ~5 GB free for model downloads
Input formats .wav audio, .pdf/.docx/.txt documents, .png/.jpg/.tiff images

Step 1: Create the Project

dotnet new console -n MultimodalRag
cd MultimodalRag
dotnet add package LM-Kit.NET

Step 2: Understand the Architecture

  ┌─────────────────┐   ┌─────────────────┐    ┌─────────────────┐
  │  Audio files    │   │  Scanned images │    │  Text documents │
  │  (.wav)         │   │  (.png, .jpg)   │    │  (.pdf, .docx)  │
  └────────┬────────┘   └────────┬────────┘    └────────┬────────┘
           │                     │                      │
           ▼                     ▼                      ▼
  ┌─────────────────┐   ┌─────────────────┐    ┌─────────────────┐
  │  SpeechToText   │   │  VlmOcr         │    │  Direct text    │
  │  audio → text   │   │  image → MD     │    │  extraction     │
  └────────┬────────┘   └────────┬────────┘    └────────┬────────┘
           │                     │                      │
           └─────────────────────┼──────────────────────┘
                                 │
                                 ▼
                    ┌──────────────────────┐
                    │  RagEngine           │
                    │  ImportText()        │
                    │  (unified embedding) │
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  Vector Store        │
                    │  (single index)      │
                    │                      │
                    │  Audio sections      │
                    │  Image sections      │
                    │  Document sections   │
                    └──────────┬───────────┘
                               │
                               ▼
                    Query across all content

The key insight: once audio and images are converted to text, they become regular text content that the embedding model can index and search. Metadata tags track the original source type so you can filter or attribute results.


Step 3: Set Up the Multimodal Pipeline

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

Console.WriteLine("=== Multimodal Knowledge Base ===\n");

Step 4: Ingest Audio Files (Speech-to-Text)

Convert audio recordings to text and import into the knowledge base:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 4. Ingest audio files
// ──────────────────────────────────────
string audioDir = "content/audio";
if (Directory.Exists(audioDir))
{
    string[] audioFiles = Directory.GetFiles(audioDir, "*.wav");
    Console.WriteLine($"Audio files: {audioFiles.Length}\n");

    foreach (string audioPath in audioFiles)
    {
        string fileName = Path.GetFileNameWithoutExtension(audioPath);
        Console.Write($"  {Path.GetFileName(audioPath)}: transcribing... ");

        try
        {
            using var audio = new WaveFile(audioPath);
            var transcription = stt.Transcribe(audio);

            // Tag with source metadata
            var metadata = new MetadataCollection();
            metadata.Add("source_type", "audio");
            metadata.Add("source_file", Path.GetFileName(audioPath));
            metadata.Add("duration_seconds", audio.Duration.TotalSeconds.ToString("F0"));
            metadata.Add("segment_count", transcription.Segments.Count.ToString());

            // Import transcribed text into the RAG engine
            await rag.ImportTextAsync(
                transcription.Text,
                dataSourceId,
                sectionIdentifier: $"audio:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({transcription.Text.Length} chars)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 5: Ingest Scanned Images (VLM OCR)

Convert images to Markdown and import into the knowledge base:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 5. Ingest scanned images via VLM OCR
// ──────────────────────────────────────
string imageDir = "content/images";
string[] imageExtensions = { ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp" };

if (Directory.Exists(imageDir))
{
    string[] imageFiles = Directory.GetFiles(imageDir)
        .Where(f => imageExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
        .ToArray();

    Console.WriteLine($"Image files: {imageFiles.Length}\n");

    foreach (string imagePath in imageFiles)
    {
        string fileName = Path.GetFileNameWithoutExtension(imagePath);
        Console.Write($"  {Path.GetFileName(imagePath)}: OCR... ");

        try
        {
            var image = ImageBuffer.LoadAsRGB(imagePath);

            // Convert image to Markdown using VLM OCR
            VlmOcr.VlmOcrResult ocrResult = vlmOcr.Run(image);
            string markdownText = ocrResult.TextGeneration.Completion;

            // Tag with source metadata
            var metadata = new MetadataCollection();
            metadata.Add("source_type", "image");
            metadata.Add("source_file", Path.GetFileName(imagePath));
            metadata.Add("ocr_tokens", ocrResult.TextGeneration.GeneratedTokenCount.ToString());

            // Import OCR text into the RAG engine
            await rag.ImportTextAsync(
                markdownText,
                dataSourceId,
                sectionIdentifier: $"image:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({markdownText.Length} chars)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 6: Ingest Scanned PDFs (Multi-Page VLM OCR)

For scanned PDFs that contain no text layer, OCR each page and import the combined Markdown:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 6. Ingest scanned PDFs via VLM OCR
// ──────────────────────────────────────
string scannedPdfDir = "content/scanned_pdfs";

if (Directory.Exists(scannedPdfDir))
{
    string[] scannedPdfs = Directory.GetFiles(scannedPdfDir, "*.pdf");
    Console.WriteLine($"Scanned PDFs: {scannedPdfs.Length}\n");

    foreach (string pdfPath in scannedPdfs)
    {
        string fileName = Path.GetFileNameWithoutExtension(pdfPath);
        Console.Write($"  {Path.GetFileName(pdfPath)}: ");

        try
        {
            var attachment = new Attachment(pdfPath);
            var fullMarkdown = new StringBuilder();

            for (int page = 0; page < attachment.PageCount; page++)
            {
                VlmOcr.VlmOcrResult pageResult = vlmOcr.Run(attachment, pageIndex: page);
                fullMarkdown.AppendLine(pageResult.TextGeneration.Completion);
                fullMarkdown.AppendLine();
            }

            var metadata = new MetadataCollection();
            metadata.Add("source_type", "scanned_pdf");
            metadata.Add("source_file", Path.GetFileName(pdfPath));
            metadata.Add("page_count", attachment.PageCount.ToString());

            await rag.ImportTextAsync(
                fullMarkdown.ToString(),
                dataSourceId,
                sectionIdentifier: $"scanned_pdf:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({attachment.PageCount} pages)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 7: Ingest Text Documents (Direct)

Text documents with native text layers (PDFs, Word, TXT, HTML) are imported directly:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 7. Ingest text documents directly
// ──────────────────────────────────────
string docsDir = "content/documents";
string[] docExtensions = { ".pdf", ".docx", ".txt", ".md", ".html" };

if (Directory.Exists(docsDir))
{
    string[] docFiles = Directory.GetFiles(docsDir)
        .Where(f => docExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
        .ToArray();

    Console.WriteLine($"Text documents: {docFiles.Length}\n");

    foreach (string docPath in docFiles)
    {
        string fileName = Path.GetFileNameWithoutExtension(docPath);
        Console.Write($"  {Path.GetFileName(docPath)}: indexing... ");

        try
        {
            // Extract text from the document
            var attachment = new Attachment(docPath);
            string text = attachment.GetText();

            var metadata = new MetadataCollection();
            metadata.Add("source_type", "document");
            metadata.Add("source_file", Path.GetFileName(docPath));
            metadata.Add("file_type", Path.GetExtension(docPath).TrimStart('.'));

            await rag.ImportTextAsync(
                text,
                dataSourceId,
                sectionIdentifier: $"doc:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({text.Length} chars)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 8: Query Across All Content Types

Search the unified knowledge base with natural language queries:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);

// ──────────────────────────────────────
// 8. Query the unified knowledge base
// ──────────────────────────────────────
var chat = new SingleTurnConversation(chatModel)
{
    SystemPrompt = "Answer the question using only the provided context. " +
                   "Mention the source type (audio recording, scanned document, or text document) " +
                   "when citing information. If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 512
};

Console.WriteLine("Ask questions across all your content (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("Question: ");
    Console.ResetColor();

    string? question = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    // Find matching partitions across all content types
    var matches = rag.FindMatchingPartitions(question, topK: 5, minScore: 0.25f);

    if (matches.Count == 0)
    {
        Console.WriteLine("No relevant content found.\n");
        continue;
    }

    // Show which sources matched
    Console.ForegroundColor = ConsoleColor.DarkGray;
    foreach (var m in matches)
    {
        string sourceType = "unknown";
        if (m.SectionIdentifier.StartsWith("audio:")) sourceType = "audio";
        else if (m.SectionIdentifier.StartsWith("image:")) sourceType = "image";
        else if (m.SectionIdentifier.StartsWith("scanned_pdf:")) sourceType = "scanned PDF";
        else if (m.SectionIdentifier.StartsWith("doc:")) sourceType = "document";

        Console.WriteLine($"  [{sourceType}] {m.SectionIdentifier} (score={m.Similarity:F3})");
    }
    Console.ResetColor();

    // Build context from matched partitions
    var context = new StringBuilder();
    foreach (var m in matches)
    {
        context.AppendLine($"[Source: {m.SectionIdentifier}]");
        context.AppendLine(m.Payload);
        context.AppendLine();
    }

    // Generate answer
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("\nAnswer: ");
    Console.ResetColor();

    chat.AfterTextCompletion += (_, e) =>
    {
        if (e.SegmentType == TextSegmentType.UserVisible)
            Console.Write(e.Text);
    };

    string prompt = $"Context:\n{context}\n\nQuestion: {question}";
    chat.Submit(prompt);
    Console.WriteLine("\n");
}

Step 9: Persist the Knowledge Base

Use FileSystemVectorStore to save the multimodal index to disk so it survives application restarts:

using LMKit.Data;
using LMKit.Data.Storage;
using LMKit.Retrieval;

// ──────────────────────────────────────
// Persistent multimodal knowledge base
// ──────────────────────────────────────
string storageDir = "multimodal_store";
Directory.CreateDirectory(storageDir);

var vectorStore = new FileSystemVectorStore(storageDir);

var persistentRag = new RagEngine(embeddingModel, vectorStore);
string persistentDataSourceId = "multimodal-kb";

// Check what is already indexed
Console.WriteLine($"Vector store entries: {vectorStore.Count}");

// Import new content (already-indexed sections are skipped by checking HasSection)
DataSource ds = rag.DataSources.FirstOrDefault(d => d.Identifier == dataSourceId);
if (ds != null)
{
    foreach (var section in ds.Sections)
    {
        if (!persistentRag.DataSources.Any(d => d.HasSection(section.Identifier)))
        {
            Console.WriteLine($"  Indexing: {section.Identifier}");
            // Re-import text for this section
            // (In a production system, you would store the extracted text alongside the embeddings)
        }
    }
}

Console.WriteLine($"\nPersistent store: {Path.GetFullPath(storageDir)}");

Step 10: Incremental Ingestion with Source Tracking

Build an ingestion loop that only processes new files:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// Incremental ingestion: skip already-indexed files
// ──────────────────────────────────────
Console.WriteLine("\n=== Incremental Ingestion ===\n");

DataSource existingDs = rag.DataSources.FirstOrDefault(d => d.Identifier == dataSourceId);

void IngestIfNew(string sectionId, Func<string> textExtractor, MetadataCollection metadata)
{
    if (existingDs != null && existingDs.HasSection(sectionId))
    {
        Console.ForegroundColor = ConsoleColor.DarkGray;
        Console.WriteLine($"  {sectionId}: already indexed (skipped)");
        Console.ResetColor();
        return;
    }

    Console.Write($"  {sectionId}: indexing... ");
    try
    {
        string text = textExtractor();
        rag.ImportText(text, dataSourceId, sectionId, metadata);

        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine("done");
        Console.ResetColor();
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"failed: {ex.Message}");
        Console.ResetColor();
    }
}

// Example: ingest a new audio file
string newAudioPath = "content/audio/new_meeting.wav";
if (File.Exists(newAudioPath))
{
    var meta = new MetadataCollection();
    meta.Add("source_type", "audio");
    meta.Add("source_file", Path.GetFileName(newAudioPath));

    IngestIfNew(
        $"audio:{Path.GetFileNameWithoutExtension(newAudioPath)}",
        () =>
        {
            using var wav = new WaveFile(newAudioPath);
using LMKit.Media.Image;
            return stt.Transcribe(wav).Text;
        },
        meta);
}

// Example: ingest a new scanned image
string newImagePath = "content/images/new_receipt.png";
if (File.Exists(newImagePath))
{
    var meta = new MetadataCollection();
    meta.Add("source_type", "image");
    meta.Add("source_file", Path.GetFileName(newImagePath));

    IngestIfNew(
        $"image:{Path.GetFileNameWithoutExtension(newImagePath)}",
        () => vlmOcr.Run(ImageBuffer.LoadAsRGB(newImagePath)).TextGeneration.Completion,
        meta);
}

Step 11: Filter Queries by Content Type

Use section identifier prefixes to search within a specific modality:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);

// Search only audio content
var audioMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
    .Where(m => m.SectionIdentifier.StartsWith("audio:"))
    .Take(5)
    .ToList();

// Search only image/scanned content
var imageMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
    .Where(m => m.SectionIdentifier.StartsWith("image:") || m.SectionIdentifier.StartsWith("scanned_pdf:"))
    .Take(5)
    .ToList();

// Search only text documents
var docMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
    .Where(m => m.SectionIdentifier.StartsWith("doc:"))
    .Take(5)
    .ToList();

Console.WriteLine($"Audio matches: {audioMatches.Count}");
Console.WriteLine($"Image matches: {imageMatches.Count}");
Console.WriteLine($"Document matches: {docMatches.Count}");

Model Selection

Embedding Models

Model ID Size Dimensions Best For
embeddinggemma-300m ~300 MB 256 General-purpose, fast, low memory (recommended)
qwen3-embedding:0.6b ~600 MB 1024 Higher dimension, better recall for large collections

Whisper Models (Audio Ingestion)

Model ID VRAM Accuracy Best For
whisper-large-turbo3 ~870 MB Best Important recordings (recommended)
whisper-small ~260 MB Very good High-volume audio archives

VLM OCR Models (Image Ingestion)

Model ID VRAM Speed Best For
lightonocr-2:1b ~2 GB Fastest Purpose-built OCR (recommended)
qwen3-vl:4b ~4 GB Fast Multilingual scanned documents

Chat Models (Q&A)

Model ID VRAM Quality Best For
gemma3:4b ~3.5 GB Good Fast answers, batch queries
qwen3:8b ~6 GB Very good Complex cross-modal questions

Folder Structure

Organize your multimodal content library:

content/
├── audio/                  # Meeting recordings, interviews (.wav)
│   ├── standup-2025-02-03.wav
│   ├── client-call-acme.wav
│   └── training-session-1.wav
├── images/                 # Scanned receipts, whiteboard photos (.png, .jpg)
│   ├── receipt-2025-001.png
│   ├── whiteboard-architecture.jpg
│   └── handwritten-notes.png
├── scanned_pdfs/           # Scanned contracts, legacy archives (.pdf)
│   ├── contract-vendor-a.pdf
│   └── inspection-report-2024.pdf
└── documents/              # Digital documents with text layers (.pdf, .docx, .txt)
    ├── employee-handbook.pdf
    ├── project-spec-v2.docx
    └── meeting-minutes.md

Common Issues

Problem Cause Fix
Audio transcription quality poor Noisy recording or wrong model Use whisper-large-turbo3; set stt.Prompt with domain vocabulary
OCR text missing layout lightonocr-2:1b output is flat for some documents Use qwen3-vl:4b with a custom Instruction for complex layouts
Query returns wrong modality All content is in one pool Filter results by SectionIdentifier prefix (Step 11)
Duplicate content indexed Same file ingested twice Check HasSection before importing (Step 10)
Large VRAM usage All models loaded simultaneously Load and dispose models sequentially; use smaller Whisper model
Slow ingestion on large archives VLM OCR is slow per page Use lightonocr-2:1b for speed; process images in batches
Low retrieval quality Chunk size too large or too small Tune MaxChunkSize on the RagEngine; 256-512 is typical

Next Steps

Share