Build a Unified Multimodal RAG System for Audio, Text, and Images
Enterprise knowledge lives in many formats: meeting recordings, scanned invoices, technical manuals, photos of whiteboards, and plain-text reports. Traditional RAG pipelines handle only text documents, leaving audio and image content unsearchable. LM-Kit.NET lets you build a single knowledge base that ingests all three modalities by converting audio to text via speech-to-text and images to Markdown via VLM OCR, then embedding everything into one vector store. Users query across all content types with a single question. This tutorial builds a unified multimodal RAG system that combines audio transcriptions, scanned document text, and standard documents into a single searchable knowledge base.
Why Multimodal RAG Matters
Two enterprise problems that a unified multimodal knowledge base solves:
- Cross-format institutional knowledge. An engineering firm accumulates project knowledge across meeting recordings, hand-drawn design sketches, scanned site inspection reports, and digital specifications. Engineers searching for information about a past project must check multiple systems. A unified RAG pipeline indexes all content types into one searchable store, so "What was the load-bearing capacity decision for Building C?" finds the answer whether it was spoken in a meeting, written in a spec, or captured in a scanned report.
- Compliance and audit readiness. Regulated organizations must demonstrate that they can locate any piece of evidence across all record types: recorded calls, faxed contracts, digital correspondence, and photographed receipts. A multimodal knowledge base provides a single search endpoint for compliance teams, replacing manual searches across disconnected archives.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | ~5 GB (embedding model + Whisper + VLM OCR model) |
| Disk | ~5 GB free for model downloads |
| Input formats | .wav audio, .pdf/.docx/.txt documents, .png/.jpg/.tiff images |
Step 1: Create the Project
dotnet new console -n MultimodalRag
cd MultimodalRag
dotnet add package LM-Kit.NET
Step 2: Understand the Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Audio files │ │ Scanned images │ │ Text documents │
│ (.wav) │ │ (.png, .jpg) │ │ (.pdf, .docx) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ SpeechToText │ │ VlmOcr │ │ Direct text │
│ audio → text │ │ image → MD │ │ extraction │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└─────────────────────┼──────────────────────┘
│
▼
┌──────────────────────┐
│ RagEngine │
│ ImportText() │
│ (unified embedding) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Vector Store │
│ (single index) │
│ │
│ Audio sections │
│ Image sections │
│ Document sections │
└──────────┬───────────┘
│
▼
Query across all content
The key insight: once audio and images are converted to text, they become regular text content that the embedding model can index and search. Metadata tags track the original source type so you can filter or attribute results.
Step 3: Set Up the Multimodal Pipeline
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";
Console.WriteLine("=== Multimodal Knowledge Base ===\n");
Step 4: Ingest Audio Files (Speech-to-Text)
Convert audio recordings to text and import into the knowledge base:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";
// ──────────────────────────────────────
// 4. Ingest audio files
// ──────────────────────────────────────
string audioDir = "content/audio";
if (Directory.Exists(audioDir))
{
string[] audioFiles = Directory.GetFiles(audioDir, "*.wav");
Console.WriteLine($"Audio files: {audioFiles.Length}\n");
foreach (string audioPath in audioFiles)
{
string fileName = Path.GetFileNameWithoutExtension(audioPath);
Console.Write($" {Path.GetFileName(audioPath)}: transcribing... ");
try
{
using var audio = new WaveFile(audioPath);
var transcription = stt.Transcribe(audio);
// Tag with source metadata
var metadata = new MetadataCollection();
metadata.Add("source_type", "audio");
metadata.Add("source_file", Path.GetFileName(audioPath));
metadata.Add("duration_seconds", audio.Duration.TotalSeconds.ToString("F0"));
metadata.Add("segment_count", transcription.Segments.Count.ToString());
// Import transcribed text into the RAG engine
await rag.ImportTextAsync(
transcription.Text,
dataSourceId,
sectionIdentifier: $"audio:{fileName}",
additionalMetadata: metadata);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"done ({transcription.Text.Length} chars)");
Console.ResetColor();
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"failed: {ex.Message}");
Console.ResetColor();
}
}
Console.WriteLine();
}
Step 5: Ingest Scanned Images (VLM OCR)
Convert images to Markdown and import into the knowledge base:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";
// ──────────────────────────────────────
// 5. Ingest scanned images via VLM OCR
// ──────────────────────────────────────
string imageDir = "content/images";
string[] imageExtensions = { ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp" };
if (Directory.Exists(imageDir))
{
string[] imageFiles = Directory.GetFiles(imageDir)
.Where(f => imageExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
.ToArray();
Console.WriteLine($"Image files: {imageFiles.Length}\n");
foreach (string imagePath in imageFiles)
{
string fileName = Path.GetFileNameWithoutExtension(imagePath);
Console.Write($" {Path.GetFileName(imagePath)}: OCR... ");
try
{
var image = ImageBuffer.LoadAsRGB(imagePath);
// Convert image to Markdown using VLM OCR
VlmOcr.VlmOcrResult ocrResult = vlmOcr.Run(image);
string markdownText = ocrResult.TextGeneration.Completion;
// Tag with source metadata
var metadata = new MetadataCollection();
metadata.Add("source_type", "image");
metadata.Add("source_file", Path.GetFileName(imagePath));
metadata.Add("ocr_tokens", ocrResult.TextGeneration.GeneratedTokenCount.ToString());
// Import OCR text into the RAG engine
await rag.ImportTextAsync(
markdownText,
dataSourceId,
sectionIdentifier: $"image:{fileName}",
additionalMetadata: metadata);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"done ({markdownText.Length} chars)");
Console.ResetColor();
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"failed: {ex.Message}");
Console.ResetColor();
}
}
Console.WriteLine();
}
Step 6: Ingest Scanned PDFs (Multi-Page VLM OCR)
For scanned PDFs that contain no text layer, OCR each page and import the combined Markdown:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";
// ──────────────────────────────────────
// 6. Ingest scanned PDFs via VLM OCR
// ──────────────────────────────────────
string scannedPdfDir = "content/scanned_pdfs";
if (Directory.Exists(scannedPdfDir))
{
string[] scannedPdfs = Directory.GetFiles(scannedPdfDir, "*.pdf");
Console.WriteLine($"Scanned PDFs: {scannedPdfs.Length}\n");
foreach (string pdfPath in scannedPdfs)
{
string fileName = Path.GetFileNameWithoutExtension(pdfPath);
Console.Write($" {Path.GetFileName(pdfPath)}: ");
try
{
var attachment = new Attachment(pdfPath);
var fullMarkdown = new StringBuilder();
for (int page = 0; page < attachment.PageCount; page++)
{
VlmOcr.VlmOcrResult pageResult = vlmOcr.Run(attachment, pageIndex: page);
fullMarkdown.AppendLine(pageResult.TextGeneration.Completion);
fullMarkdown.AppendLine();
}
var metadata = new MetadataCollection();
metadata.Add("source_type", "scanned_pdf");
metadata.Add("source_file", Path.GetFileName(pdfPath));
metadata.Add("page_count", attachment.PageCount.ToString());
await rag.ImportTextAsync(
fullMarkdown.ToString(),
dataSourceId,
sectionIdentifier: $"scanned_pdf:{fileName}",
additionalMetadata: metadata);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"done ({attachment.PageCount} pages)");
Console.ResetColor();
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"failed: {ex.Message}");
Console.ResetColor();
}
}
Console.WriteLine();
}
Step 7: Ingest Text Documents (Direct)
Text documents with native text layers (PDFs, Word, TXT, HTML) are imported directly:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";
// ──────────────────────────────────────
// 7. Ingest text documents directly
// ──────────────────────────────────────
string docsDir = "content/documents";
string[] docExtensions = { ".pdf", ".docx", ".txt", ".md", ".html" };
if (Directory.Exists(docsDir))
{
string[] docFiles = Directory.GetFiles(docsDir)
.Where(f => docExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
.ToArray();
Console.WriteLine($"Text documents: {docFiles.Length}\n");
foreach (string docPath in docFiles)
{
string fileName = Path.GetFileNameWithoutExtension(docPath);
Console.Write($" {Path.GetFileName(docPath)}: indexing... ");
try
{
// Extract text from the document
var attachment = new Attachment(docPath);
string text = attachment.GetText();
var metadata = new MetadataCollection();
metadata.Add("source_type", "document");
metadata.Add("source_file", Path.GetFileName(docPath));
metadata.Add("file_type", Path.GetExtension(docPath).TrimStart('.'));
await rag.ImportTextAsync(
text,
dataSourceId,
sectionIdentifier: $"doc:{fileName}",
additionalMetadata: metadata);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"done ({text.Length} chars)");
Console.ResetColor();
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"failed: {ex.Message}");
Console.ResetColor();
}
}
Console.WriteLine();
}
Step 8: Query Across All Content Types
Search the unified knowledge base with natural language queries:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
// ──────────────────────────────────────
// 8. Query the unified knowledge base
// ──────────────────────────────────────
var chat = new SingleTurnConversation(chatModel)
{
SystemPrompt = "Answer the question using only the provided context. " +
"Mention the source type (audio recording, scanned document, or text document) " +
"when citing information. If the context does not contain the answer, say so.",
MaximumCompletionTokens = 512
};
Console.WriteLine("Ask questions across all your content (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("Question: ");
Console.ResetColor();
string? question = Console.ReadLine();
if (string.IsNullOrWhiteSpace(question) || question.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
// Find matching partitions across all content types
var matches = rag.FindMatchingPartitions(question, topK: 5, minScore: 0.25f);
if (matches.Count == 0)
{
Console.WriteLine("No relevant content found.\n");
continue;
}
// Show which sources matched
Console.ForegroundColor = ConsoleColor.DarkGray;
foreach (var m in matches)
{
string sourceType = "unknown";
if (m.SectionIdentifier.StartsWith("audio:")) sourceType = "audio";
else if (m.SectionIdentifier.StartsWith("image:")) sourceType = "image";
else if (m.SectionIdentifier.StartsWith("scanned_pdf:")) sourceType = "scanned PDF";
else if (m.SectionIdentifier.StartsWith("doc:")) sourceType = "document";
Console.WriteLine($" [{sourceType}] {m.SectionIdentifier} (score={m.Similarity:F3})");
}
Console.ResetColor();
// Build context from matched partitions
var context = new StringBuilder();
foreach (var m in matches)
{
context.AppendLine($"[Source: {m.SectionIdentifier}]");
context.AppendLine(m.Payload);
context.AppendLine();
}
// Generate answer
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("\nAnswer: ");
Console.ResetColor();
chat.AfterTextCompletion += (_, e) =>
{
if (e.SegmentType == TextSegmentType.UserVisible)
Console.Write(e.Text);
};
string prompt = $"Context:\n{context}\n\nQuestion: {question}";
chat.Submit(prompt);
Console.WriteLine("\n");
}
Step 9: Persist the Knowledge Base
Use FileSystemVectorStore to save the multimodal index to disk so it survives application restarts:
using LMKit.Data;
using LMKit.Data.Storage;
using LMKit.Retrieval;
// ──────────────────────────────────────
// Persistent multimodal knowledge base
// ──────────────────────────────────────
string storageDir = "multimodal_store";
Directory.CreateDirectory(storageDir);
var vectorStore = new FileSystemVectorStore(storageDir);
var persistentRag = new RagEngine(embeddingModel, vectorStore);
string persistentDataSourceId = "multimodal-kb";
// Check what is already indexed
Console.WriteLine($"Vector store entries: {vectorStore.Count}");
// Import new content (already-indexed sections are skipped by checking HasSection)
DataSource ds = rag.DataSources.FirstOrDefault(d => d.Identifier == dataSourceId);
if (ds != null)
{
foreach (var section in ds.Sections)
{
if (!persistentRag.DataSources.Any(d => d.HasSection(section.Identifier)))
{
Console.WriteLine($" Indexing: {section.Identifier}");
// Re-import text for this section
// (In a production system, you would store the extracted text alongside the embeddings)
}
}
}
Console.WriteLine($"\nPersistent store: {Path.GetFullPath(storageDir)}");
Step 10: Incremental Ingestion with Source Tracking
Build an ingestion loop that only processes new files:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
string dataSourceId = "multimodal-knowledge-base";
// ──────────────────────────────────────
// Incremental ingestion: skip already-indexed files
// ──────────────────────────────────────
Console.WriteLine("\n=== Incremental Ingestion ===\n");
DataSource existingDs = rag.DataSources.FirstOrDefault(d => d.Identifier == dataSourceId);
void IngestIfNew(string sectionId, Func<string> textExtractor, MetadataCollection metadata)
{
if (existingDs != null && existingDs.HasSection(sectionId))
{
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" {sectionId}: already indexed (skipped)");
Console.ResetColor();
return;
}
Console.Write($" {sectionId}: indexing... ");
try
{
string text = textExtractor();
rag.ImportText(text, dataSourceId, sectionId, metadata);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("done");
Console.ResetColor();
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"failed: {ex.Message}");
Console.ResetColor();
}
}
// Example: ingest a new audio file
string newAudioPath = "content/audio/new_meeting.wav";
if (File.Exists(newAudioPath))
{
var meta = new MetadataCollection();
meta.Add("source_type", "audio");
meta.Add("source_file", Path.GetFileName(newAudioPath));
IngestIfNew(
$"audio:{Path.GetFileNameWithoutExtension(newAudioPath)}",
() =>
{
using var wav = new WaveFile(newAudioPath);
using LMKit.Media.Image;
return stt.Transcribe(wav).Text;
},
meta);
}
// Example: ingest a new scanned image
string newImagePath = "content/images/new_receipt.png";
if (File.Exists(newImagePath))
{
var meta = new MetadataCollection();
meta.Add("source_type", "image");
meta.Add("source_file", Path.GetFileName(newImagePath));
IngestIfNew(
$"image:{Path.GetFileNameWithoutExtension(newImagePath)}",
() => vlmOcr.Run(ImageBuffer.LoadAsRGB(newImagePath)).TextGeneration.Completion,
meta);
}
Step 11: Filter Queries by Content Type
Use section identifier prefixes to search within a specific modality:
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Audio;
using LMKit.Media.Image;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load all models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading Whisper model (for audio)...");
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading VLM OCR model (for images)...");
using LM ocrModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("Loading chat model (for Q&A)...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create processing engines
// ──────────────────────────────────────
var stt = new SpeechToText(whisperModel)
{
EnableVoiceActivityDetection = true,
SuppressNonSpeechTokens = true,
SuppressHallucinations = true
};
var vlmOcr = new VlmOcr(ocrModel)
{
MaximumCompletionTokens = 4096
};
// ──────────────────────────────────────
// 3. Create a unified RAG engine
// ──────────────────────────────────────
var rag = new RagEngine(embeddingModel);
// Search only audio content
var audioMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
.Where(m => m.SectionIdentifier.StartsWith("audio:"))
.Take(5)
.ToList();
// Search only image/scanned content
var imageMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
.Where(m => m.SectionIdentifier.StartsWith("image:") || m.SectionIdentifier.StartsWith("scanned_pdf:"))
.Take(5)
.ToList();
// Search only text documents
var docMatches = rag.FindMatchingPartitions(question, topK: 10, minScore: 0.2f)
.Where(m => m.SectionIdentifier.StartsWith("doc:"))
.Take(5)
.ToList();
Console.WriteLine($"Audio matches: {audioMatches.Count}");
Console.WriteLine($"Image matches: {imageMatches.Count}");
Console.WriteLine($"Document matches: {docMatches.Count}");
Model Selection
Embedding Models
| Model ID | Size | Dimensions | Best For |
|---|---|---|---|
embeddinggemma-300m |
~300 MB | 256 | General-purpose, fast, low memory (recommended) |
qwen3-embedding:0.6b |
~600 MB | 1024 | Higher dimension, better recall for large collections |
Whisper Models (Audio Ingestion)
| Model ID | VRAM | Accuracy | Best For |
|---|---|---|---|
whisper-large-turbo3 |
~870 MB | Best | Important recordings (recommended) |
whisper-small |
~260 MB | Very good | High-volume audio archives |
VLM OCR Models (Image Ingestion)
| Model ID | VRAM | Speed | Best For |
|---|---|---|---|
lightonocr-2:1b |
~2 GB | Fastest | Purpose-built OCR (recommended) |
qwen3-vl:4b |
~4 GB | Fast | Multilingual scanned documents |
Chat Models (Q&A)
| Model ID | VRAM | Quality | Best For |
|---|---|---|---|
gemma3:4b |
~3.5 GB | Good | Fast answers, batch queries |
qwen3:8b |
~6 GB | Very good | Complex cross-modal questions |
Folder Structure
Organize your multimodal content library:
content/
├── audio/ # Meeting recordings, interviews (.wav)
│ ├── standup-2025-02-03.wav
│ ├── client-call-acme.wav
│ └── training-session-1.wav
├── images/ # Scanned receipts, whiteboard photos (.png, .jpg)
│ ├── receipt-2025-001.png
│ ├── whiteboard-architecture.jpg
│ └── handwritten-notes.png
├── scanned_pdfs/ # Scanned contracts, legacy archives (.pdf)
│ ├── contract-vendor-a.pdf
│ └── inspection-report-2024.pdf
└── documents/ # Digital documents with text layers (.pdf, .docx, .txt)
├── employee-handbook.pdf
├── project-spec-v2.docx
└── meeting-minutes.md
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Audio transcription quality poor | Noisy recording or wrong model | Use whisper-large-turbo3; set stt.Prompt with domain vocabulary |
| OCR text missing layout | lightonocr-2:1b output is flat for some documents |
Use qwen3-vl:4b with a custom Instruction for complex layouts |
| Query returns wrong modality | All content is in one pool | Filter results by SectionIdentifier prefix (Step 11) |
| Duplicate content indexed | Same file ingested twice | Check HasSection before importing (Step 10) |
| Large VRAM usage | All models loaded simultaneously | Load and dispose models sequentially; use smaller Whisper model |
| Slow ingestion on large archives | VLM OCR is slow per page | Use lightonocr-2:1b for speed; process images in batches |
| Low retrieval quality | Chunk size too large or too small | Tune MaxChunkSize on the RagEngine; 256-512 is typical |
Next Steps
- Build a RAG Pipeline Over Your Own Documents: foundational RAG with text documents.
- Boost Retrieval with Hybrid Search: combine vector and BM25 search for broader recall across multimodal content.
- Build Conversational RAG with RagChat: add multi-turn conversation on top of your multimodal knowledge base.
- Improve Recall with Multi-Query and HyDE Retrieval: expand queries to find relevant passages across audio, image, and text.
- Improve RAG Results with Reranking: add a cross-encoder reranker to boost retrieval precision.
- Optimize RAG with Custom Chunking Strategies: tailor
TextChunking,MarkdownChunking, orHtmlChunkingto each content type. - Build a Persistent Document Knowledge Base with Vector Storage: disk-backed storage for large collections.
- Transcribe Audio with Local Speech-to-Text: foundational audio transcription.
- Process Scanned Documents with OCR and Vision Models: OCR engine selection and custom providers.
- Convert Documents to Markdown with VLM OCR: VLM OCR for document conversion.
- Samples: Conversational RAG: multi-turn RAG with
RagChat.