Build a Unified Multimodal RAG System for Audio, Text, and Images

Enterprise knowledge lives in many formats: meeting recordings, scanned invoices, technical manuals, photos of whiteboards, and plain-text reports. Traditional RAG pipelines handle only text documents, leaving audio and image content unsearchable. LM-Kit.NET lets you build a single knowledge base that ingests all three modalities by converting audio to text via speech-to-text and images to Markdown via VLM OCR, then embedding everything into one vector store. Users query across all content types with a single question. This tutorial builds a unified multimodal RAG system that combines audio transcriptions, scanned document text, and standard documents into a single searchable knowledge base.

Why Multimodal RAG Matters

Two enterprise problems that a unified multimodal knowledge base solves:

Cross-format institutional knowledge. An engineering firm accumulates project knowledge across meeting recordings, hand-drawn design sketches, scanned site inspection reports, and digital specifications. Engineers searching for information about a past project must check multiple systems. A unified RAG pipeline indexes all content types into one searchable store, so "What was the load-bearing capacity decision for Building C?" finds the answer whether it was spoken in a meeting, written in a spec, or captured in a scanned report.
Compliance and audit readiness. Regulated organizations must demonstrate that they can locate any piece of evidence across all record types: recorded calls, faxed contracts, digital correspondence, and photographed receipts. A multimodal knowledge base provides a single search endpoint for compliance teams, replacing manual searches across disconnected archives.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	~5 GB (embedding model + Whisper + VLM OCR model)
Disk	~5 GB free for model downloads
Input formats	`.wav` audio, `.pdf`/`.docx`/`.txt`/`.md` documents, `.png`/`.jpg`/`.tiff` images

Step 1: Create the Project

dotnet new console -n MultimodalRag
cd MultimodalRag
dotnet add package LM-Kit.NET

Step 2: Understand the Architecture

  ┌─────────────────┐   ┌─────────────────┐    ┌─────────────────┐
  │  Audio files    │   │  Scanned images │    │  Text documents │
  │  (.wav)         │   │  (.png, .jpg)   │    │  (.pdf, .docx)  │
  └────────┬────────┘   └────────┬────────┘    └────────┬────────┘
           │                     │                      │
           ▼                     ▼                      ▼
  ┌─────────────────┐   ┌─────────────────┐    ┌─────────────────┐
  │  SpeechToText   │   │  VlmOcr         │    │  Direct text    │
  │  audio → text   │   │  image → MD     │    │  extraction     │
  └────────┬────────┘   └────────┬────────┘    └────────┬────────┘
           │                     │                      │
           └─────────────────────┼──────────────────────┘
                                 │
                                 ▼
                    ┌──────────────────────┐
                    │  RagEngine           │
                    │  ImportText()        │
                    │  (unified embedding) │
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  Vector Store        │
                    │  (single index)      │
                    │                      │
                    │  Audio sections      │
                    │  Image sections      │
                    │  Document sections   │
                    └──────────┬───────────┘
                               │
                               ▼
                    Query across all content

The key insight: once audio and images are converted to text, they become regular text content that the embedding model can index and search. Metadata tags track the original source type so you can filter or attribute results.

Step 3: Set Up the Multimodal Pipeline

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

Console.WriteLine("=== Multimodal Knowledge Base ===\n");

Step 4: Ingest Audio Files (Speech-to-Text)

Convert audio recordings to text and import into the knowledge base:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 4. Ingest audio files
// ──────────────────────────────────────
string audioDir = "content/audio";
if (Directory.Exists(audioDir))
{
    string[] audioFiles = Directory.GetFiles(audioDir, "*.wav");
    Console.WriteLine($"Audio files: {audioFiles.Length}\n");

    foreach (string audioPath in audioFiles)
    {
        string fileName = Path.GetFileNameWithoutExtension(audioPath);
        Console.Write($"  {Path.GetFileName(audioPath)}: transcribing... ");

        try
        {
            using var audio = new WaveFile(audioPath);
            var transcription = stt.Transcribe(audio);

            // Tag with source metadata
            var metadata = new MetadataCollection();
            metadata.Add("source_type", "audio");
            metadata.Add("source_file", Path.GetFileName(audioPath));
            metadata.Add("duration_seconds", audio.Duration.TotalSeconds.ToString("F0"));
            metadata.Add("segment_count", transcription.Segments.Count.ToString());

            // Import transcribed text into the RAG engine
            await rag.ImportTextAsync(
                transcription.Text,
                dataSourceId,
                sectionIdentifier: $"audio:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({transcription.Text.Length} chars)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 5: Ingest Scanned Images (VLM OCR)

Convert images to Markdown and import into the knowledge base:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 5. Ingest scanned images via VLM OCR
// ──────────────────────────────────────
string imageDir = "content/images";
string[] imageExtensions = { ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp" };

if (Directory.Exists(imageDir))
{
    string[] imageFiles = Directory.GetFiles(imageDir)
        .Where(f => imageExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
        .ToArray();

    Console.WriteLine($"Image files: {imageFiles.Length}\n");

    foreach (string imagePath in imageFiles)
    {
        string fileName = Path.GetFileNameWithoutExtension(imagePath);
        Console.Write($"  {Path.GetFileName(imagePath)}: OCR... ");

        try
        {
            var image = ImageBuffer.LoadAsRGB(imagePath);

            // Convert image to Markdown using VLM OCR
            VlmOcr.VlmOcrResult ocrResult = vlmOcr.Run(image);
            string markdownText = ocrResult.TextGeneration.Completion;

            // Tag with source metadata
            var metadata = new MetadataCollection();
            metadata.Add("source_type", "image");
            metadata.Add("source_file", Path.GetFileName(imagePath));
            metadata.Add("ocr_tokens", ocrResult.TextGeneration.GeneratedTokenCount.ToString());

            // Import OCR text into the RAG engine
            await rag.ImportTextAsync(
                markdownText,
                dataSourceId,
                sectionIdentifier: $"image:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({markdownText.Length} chars)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 6: Ingest Scanned PDFs (Multi-Page VLM OCR)

For scanned PDFs that contain no text layer, OCR each page and import the combined Markdown:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 6. Ingest scanned PDFs via VLM OCR
// ──────────────────────────────────────
string scannedPdfDir = "content/scanned_pdfs";

if (Directory.Exists(scannedPdfDir))
{
    string[] scannedPdfs = Directory.GetFiles(scannedPdfDir, "*.pdf");
    Console.WriteLine($"Scanned PDFs: {scannedPdfs.Length}\n");

    foreach (string pdfPath in scannedPdfs)
    {
        string fileName = Path.GetFileNameWithoutExtension(pdfPath);
        Console.Write($"  {Path.GetFileName(pdfPath)}: ");

        try
        {
            var attachment = new Attachment(pdfPath);
            var fullMarkdown = new StringBuilder();

            for (int page = 0; page < attachment.PageCount; page++)
            {
                VlmOcr.VlmOcrResult pageResult = vlmOcr.Run(attachment, pageIndex: page);
                fullMarkdown.AppendLine(pageResult.TextGeneration.Completion);
                fullMarkdown.AppendLine();
            }

            var metadata = new MetadataCollection();
            metadata.Add("source_type", "scanned_pdf");
            metadata.Add("source_file", Path.GetFileName(pdfPath));
            metadata.Add("page_count", attachment.PageCount.ToString());

            await rag.ImportTextAsync(
                fullMarkdown.ToString(),
                dataSourceId,
                sectionIdentifier: $"scanned_pdf:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({attachment.PageCount} pages)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 7: Ingest Text Documents (Direct)

Text documents with native text layers (PDFs, Word, TXT, HTML) are imported directly:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// 7. Ingest text documents directly
// ──────────────────────────────────────
string docsDir = "content/documents";
string[] docExtensions = { ".pdf", ".docx", ".txt", ".md", ".html" };

if (Directory.Exists(docsDir))
{
    string[] docFiles = Directory.GetFiles(docsDir)
        .Where(f => docExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
        .ToArray();

    Console.WriteLine($"Text documents: {docFiles.Length}\n");

    foreach (string docPath in docFiles)
    {
        string fileName = Path.GetFileNameWithoutExtension(docPath);
        Console.Write($"  {Path.GetFileName(docPath)}: indexing... ");

        try
        {
            // Extract text from the document
            var attachment = new Attachment(docPath);
            string text = attachment.GetText();

            var metadata = new MetadataCollection();
            metadata.Add("source_type", "document");
            metadata.Add("source_file", Path.GetFileName(docPath));
            metadata.Add("file_type", Path.GetExtension(docPath).TrimStart('.'));

            await rag.ImportTextAsync(
                text,
                dataSourceId,
                sectionIdentifier: $"doc:{fileName}",
                additionalMetadata: metadata);

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine($"done ({text.Length} chars)");
            Console.ResetColor();
        }
        catch (Exception ex)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine($"failed: {ex.Message}");
            Console.ResetColor();
        }
    }

    Console.WriteLine();
}

Step 8: Query Across All Content Types

Search the unified knowledge base with natural language queries:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);

// ──────────────────────────────────────
// 8. Query the unified knowledge base
// ──────────────────────────────────────
var chat = new SingleTurnConversation(chatModel)
{
    SystemPrompt = "Answer the question using only the provided context. " +
                   "Mention the source type (audio recording, scanned document, or text document) " +
                   "when citing information. If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 512
};

Console.WriteLine("Ask questions across all your content (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("Question: ");
    Console.ResetColor();

    string? question = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    // Find matching partitions across all content types
    var matches = rag.FindMatchingPartitions(question, topK: 5, minScore: 0.25f);

    if (matches.Count == 0)
    {
        Console.WriteLine("No relevant content found.\n");
        continue;
    }

    // Show which sources matched
    Console.ForegroundColor = ConsoleColor.DarkGray;
    foreach (var m in matches)
    {
        string sourceType = "unknown";
        if (m.SectionIdentifier.StartsWith("audio:")) sourceType = "audio";
        else if (m.SectionIdentifier.StartsWith("image:")) sourceType = "image";
        else if (m.SectionIdentifier.StartsWith("scanned_pdf:")) sourceType = "scanned PDF";
        else if (m.SectionIdentifier.StartsWith("doc:")) sourceType = "document";

        Console.WriteLine($"  [{sourceType}] {m.SectionIdentifier} (score={m.Similarity:F3})");
    }
    Console.ResetColor();

    // Build context from matched partitions
    var context = new StringBuilder();
    foreach (var m in matches)
    {
        context.AppendLine($"[Source: {m.SectionIdentifier}]");
        context.AppendLine(m.Payload);
        context.AppendLine();
    }

    // Generate answer
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("\nAnswer: ");
    Console.ResetColor();

    chat.AfterTextCompletion += (_, e) =>
    {
        if (e.SegmentType == TextSegmentType.UserVisible)
            Console.Write(e.Text);
    };

    string prompt = $"Context:\n{context}\n\nQuestion: {question}";
    chat.Submit(prompt);
    Console.WriteLine("\n");
}

Step 9: Persist the Knowledge Base

Use FileSystemVectorStore to save the multimodal index to disk so it survives application restarts:

using LMKit.Data;
using LMKit.Data.Storage;
using LMKit.Retrieval;

// ──────────────────────────────────────
// Persistent multimodal knowledge base
// ──────────────────────────────────────
string storageDir = "multimodal_store";
Directory.CreateDirectory(storageDir);

var vectorStore = new FileSystemVectorStore(storageDir);

var persistentRag = new RagEngine(embeddingModel, vectorStore);
string persistentDataSourceId = "multimodal-kb";

// Check what is already indexed
Console.WriteLine($"Vector store entries: {vectorStore.Count}");

// Import new content (already-indexed sections are skipped by checking HasSection)
DataSource ds = rag.DataSources.FirstOrDefault(d => d.Identifier == dataSourceId);
if (ds != null)
{
    foreach (var section in ds.Sections)
    {
        if (!persistentRag.DataSources.Any(d => d.HasSection(section.Identifier)))
        {
            Console.WriteLine($"  Indexing: {section.Identifier}");
            // Re-import text for this section
            // (In a production system, you would store the extracted text alongside the embeddings)
        }
    }
}

Console.WriteLine($"\nPersistent store: {Path.GetFullPath(storageDir)}");

Step 10: Incremental Ingestion with Source Tracking

Build an ingestion loop that only processes new files:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";

// ──────────────────────────────────────
// Incremental ingestion: skip already-indexed files
// ──────────────────────────────────────
Console.WriteLine("\n=== Incremental Ingestion ===\n");

DataSource existingDs = rag.DataSources.FirstOrDefault(d => d.Identifier == dataSourceId);

void IngestIfNew(string sectionId, Func<string> textExtractor, MetadataCollection metadata)
{
    if (existingDs != null && existingDs.HasSection(sectionId))
    {
        Console.ForegroundColor = ConsoleColor.DarkGray;
        Console.WriteLine($"  {sectionId}: already indexed (skipped)");
        Console.ResetColor();
        return;
    }

    Console.Write($"  {sectionId}: indexing... ");
    try
    {
        string text = textExtractor();
        rag.ImportText(text, dataSourceId, sectionId, metadata);

        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine("done");
        Console.ResetColor();
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"failed: {ex.Message}");
        Console.ResetColor();
    }
}

// Example: ingest a new audio file
string newAudioPath = "content/audio/new_meeting.wav";
if (File.Exists(newAudioPath))
{
    var meta = new MetadataCollection();
    meta.Add("source_type", "audio");
    meta.Add("source_file", Path.GetFileName(newAudioPath));

    IngestIfNew(
        $"audio:{Path.GetFileNameWithoutExtension(newAudioPath)}",
        () =>
        {
            using var wav = new WaveFile(newAudioPath);
using LMKit.Media.Image;
            return stt.Transcribe(wav).Text;
        },
        meta);
}

// Example: ingest a new scanned image
string newImagePath = "content/images/new_receipt.png";
if (File.Exists(newImagePath))
{
    var meta = new MetadataCollection();
    meta.Add("source_type", "image");
    meta.Add("source_file", Path.GetFileName(newImagePath));

    IngestIfNew(
        $"image:{Path.GetFileNameWithoutExtension(newImagePath)}",
        () => vlmOcr.Run(ImageBuffer.LoadAsRGB(newImagePath)).TextGeneration.Completion,
        meta);
}

Step 11: Filter Queries by Content Type

Use section identifier prefixes to search within a specific modality:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
    EnableVoiceActivityDetection = true,
    SuppressNonSpeechTokens = true,
    SuppressHallucinations = true
};

var vlmOcr = new VlmOcr(ocrModel)
{
    MaximumCompletionTokens = 4096
};

// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);

// Search only audio content
var audioMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
    .Where(m => m.SectionIdentifier.StartsWith("audio:"))
    .Take(5)
    .ToList();

// Search only image/scanned content
var imageMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
    .Where(m => m.SectionIdentifier.StartsWith("image:") || m.SectionIdentifier.StartsWith("scanned_pdf:"))
    .Take(5)
    .ToList();

// Search only text documents
var docMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
    .Where(m => m.SectionIdentifier.StartsWith("doc:"))
    .Take(5)
    .ToList();

Console.WriteLine($"Audio matches: {audioMatches.Count}");
Console.WriteLine($"Image matches: {imageMatches.Count}");
Console.WriteLine($"Document matches: {docMatches.Count}");

Model Selection

Embedding Models

Model ID	Size	Dimensions	Best For
`embeddinggemma-300m`	~300 MB	256	General-purpose, fast, low memory (recommended)
`harrier-oss:0.6b`	~360 MB	1024	Multilingual, instruction-aware queries, 32K context
`qwen3-embedding:0.6b`	~600 MB	1024	Higher dimension, better recall for large collections

Whisper Models (Audio Ingestion)

Model ID	VRAM	Accuracy	Best For
`whisper-large-turbo3`	~870 MB	Best	Important recordings (recommended)
`whisper-small`	~260 MB	Very good	High-volume audio archives

VLM OCR Models (Image Ingestion)

Model ID	VRAM	Speed	Best For
`lightonocr-2:1b`	~2 GB	Fastest	Purpose-built OCR (recommended)
`qwen3.5:4b`	~3.5 GB	Fast	Multilingual scanned documents

Chat Models (Q&A)

Model ID	VRAM	Quality	Best For
`gemma4:e4b`	~3.5 GB	Good	Fast answers, batch queries
`qwen3.5:9b`	~7 GB	Very good	Complex cross-modal questions

Folder Structure

Organize your multimodal content library:

content/
├── audio/                  # Meeting recordings, interviews (.wav)
│   ├── standup-2025-02-03.wav
│   ├── client-call-acme.wav
│   └── training-session-1.wav
├── images/                 # Scanned receipts, whiteboard photos (.png, .jpg)
│   ├── receipt-2025-001.png
│   ├── whiteboard-architecture.jpg
│   └── handwritten-notes.png
├── scanned_pdfs/           # Scanned contracts, legacy archives (.pdf)
│   ├── contract-vendor-a.pdf
│   └── inspection-report-2024.pdf
└── documents/              # Digital documents with text layers (.pdf, .docx, .txt)
    ├── employee-handbook.pdf
    ├── project-spec-v2.docx
    └── meeting-minutes.md

Common Issues

Problem	Cause	Fix
Audio transcription quality poor	Noisy recording or wrong model	Use `whisper-large-turbo3`; set `stt.Prompt` with domain vocabulary
OCR text missing layout	`lightonocr-2:1b` output is flat for some documents	Use `qwen3.5:4b` with a custom `Instruction` for complex layouts
Query returns wrong modality	All content is in one pool	Filter results by `SectionIdentifier` prefix (Step 11)
Duplicate content indexed	Same file ingested twice	Check `HasSection` before importing (Step 10)
Large VRAM usage	All models loaded simultaneously	Load and dispose models sequentially; use smaller Whisper model
Slow ingestion on large archives	VLM OCR is slow per page	Use `lightonocr-2:1b` for speed; process images in batches
Low retrieval quality	Chunk size too large or too small	Tune `MaxChunkSize` on the `RagEngine`; 256-512 is typical

Next Steps

Build a RAG Pipeline Over Your Own Documents: foundational RAG with text documents.
Boost Retrieval with Hybrid Search: combine vector and BM25 search for broader recall across multimodal content.
Build Conversational RAG with RagChat: add multi-turn conversation on top of your multimodal knowledge base.
Improve Recall with Multi-Query and HyDE Retrieval: expand queries to find relevant passages across audio, image, and text.
Improve RAG Results with Reranking: add a cross-encoder reranker to boost retrieval precision.
Optimize RAG with Custom Chunking Strategies: tailor TextChunking, MarkdownChunking, or HtmlChunking to each content type.
Build a Persistent Document Knowledge Base with Vector Storage: disk-backed storage for large collections.
Transcribe Audio with Local Speech-to-Text: foundational audio transcription.
Process Scanned Documents with OCR and Vision Models: OCR engine selection and custom providers.
Convert Documents to Markdown with VLM OCR: VLM OCR for document conversion.
Samples: Conversational RAG: multi-turn RAG with RagChat.

Table of Contents