Can LM-Kit.NET Process Images, PDFs, and Audio in One Application?
TL;DR
Yes. LM-Kit.NET supports text, vision (images), documents (PDF, DOCX, HTML, EML, XLSX, PPTX), speech (audio transcription), and embeddings (text and image vectors) all within a single SDK. You load different models for different modalities and combine them in the same application. No external services or separate libraries are needed.
Supported Modalities
| Modality | What It Does | Key Classes | Example Models |
|---|---|---|---|
| Text | Chat, generation, classification, extraction, translation | MultiTurnConversation, Agent, TextTranslation |
qwen3.5:9b, gemma4:e4b |
| Vision | Image understanding, visual Q&A, image-based extraction | VisionImage, VLM-capable models |
qwen3.5:9b, gemma4:e4b (built-in vision) |
| Documents | PDF text extraction, layout analysis, format conversion | PdfDocument, DocxDocument, HtmlDocument |
N/A (native libraries, no model needed) |
| OCR | Text recognition from images and scanned documents. LM-Kit OCR provides high throughput and very high accuracy on business documents | VlmOcr, LMKitOcr |
paddleocr-vl:0.9b, glm-ocr, glm-4.6v-flash |
| Speech | Audio transcription with language detection | SpeechToText |
whisper-small, whisper-large-turbo3 |
| Embeddings | Text and image vector representations for search and RAG | Embedder, RagEngine |
embeddinggemma-300m, nomic-embed-vision |
| Image segmentation | Foreground/background separation | Segmentation pipeline | u2net |
Example: Multi-Modal Application
Here is how a single application might combine text, vision, speech, and embeddings:
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;
// Load models for different modalities
using LM chatModel = LM.LoadFromModelID("qwen3.5:9b"); // Text + reasoning
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m"); // Vector search
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3"); // Speech
// RAG: index documents and answer questions
var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("manual.pdf");
// Speech: transcribe audio
var stt = new SpeechToText(whisperModel);
var transcription = stt.Transcribe("meeting-recording.wav");
// Text: summarize the transcription using the chat model
var chat = new MultiTurnConversation(chatModel);
string summary = chat.Submit($"Summarize this meeting transcript:\n{transcription.Text}");
Vision and Image Analysis
Vision language models (VLMs) can analyze images, extract text from photos, describe visual content, and answer questions about what they see:
using LMKit.Model;
using LMKit.TextGeneration;
// Gemma 4 has built-in vision capabilities
using LM model = LM.LoadFromModelID("gemma4:e4b");
var chat = new MultiTurnConversation(model);
chat.AddImage("photo-of-receipt.jpg");
string result = chat.Submit("Extract the total amount and date from this receipt.");
Models with vision capabilities include the Qwen 2 VL, Qwen 3.5, Gemma 4, GLM-V 4.6 Flash, MiniCPM-V, and Pixtral families.
Document Processing
LM-Kit.NET includes native libraries for processing documents without requiring a language model:
| Format | Capabilities |
|---|---|
| Text extraction with layout preservation, page splitting, merging, attachment extraction, metadata | |
| DOCX | Text and structure extraction |
| HTML | Parsing and text extraction |
| XLSX | Spreadsheet data extraction |
| PPTX | Presentation text extraction |
| EML / MBOX | Email archive processing |
For AI-powered document understanding (Q&A, summarization, classification), combine document extraction with a chat model or RAG pipeline.
OCR: Text from Images and Scans
Two OCR approaches are available:
- VLM OCR (
VlmOcr): Uses vision language models for high-accuracy recognition. Handles complex layouts, tables, and mathematical formulas. - LM-Kit OCR (
LMKitOcr): High-throughput OCR engine with very high accuracy on business documents and advanced page layout handling.
Memory Planning for Multi-Modal Apps
Running multiple models requires planning your memory budget:
| Combination | Approximate Memory |
|---|---|
| Chat (8B) + Embeddings (300M) | ~5.5 GB |
| Chat (8B) + Embeddings (300M) + Whisper (turbo) | ~6.4 GB |
| Chat (8B) + Vision (separate VLM) + Embeddings | ~7 to 12 GB |
| Chat with built-in vision (Gemma 4 E4B) + Embeddings | ~6 GB |
Using a model with built-in vision (like Gemma 4) saves memory compared to loading a separate VLM.
📚 Related Content
- Can I run multiple AI models at the same time?: Memory management and parallel model loading.
- What languages can LM-Kit.NET models understand?: Multilingual support across all modalities.
- How do I choose the right model size for my hardware?: Select models by capability and memory requirements.
- Model Catalog: Browse models filtered by capability (vision, speech, embeddings, OCR).