Can LM-Kit.NET Process Images, PDFs, and Audio in One Application?

TL;DR

Yes. LM-Kit.NET supports text, vision (images), documents (PDF, DOCX, HTML, EML, XLSX, PPTX), speech (audio transcription), and embeddings (text and image vectors) all within a single SDK. You load different models for different modalities and combine them in the same application. No external services or separate libraries are needed.

Supported Modalities

Modality	What It Does	Key Classes	Example Models
Text	Chat, generation, classification, extraction, translation	`MultiTurnConversation`, `Agent`, `TextTranslation`	`qwen3.6:27b`, `qwen3.5:9b`, `gemma4:e4b`
Vision	Image understanding, visual Q&A, image-based extraction	`VisionImage`, VLM-capable models	`qwen3.6:27b`, `qwen3-vl:4b`, `gemma4:e4b` (built-in vision)
Documents	PDF text extraction, layout analysis, format conversion	`PdfDocument`, `DocxDocument`, `HtmlDocument`	N/A (native libraries, no model needed)
OCR	Text recognition from images and scanned documents. LM-Kit OCR provides high throughput and very high accuracy on business documents	`VlmOcr`, `LMKitOcr`	`paddleocr-vl-1.6:0.9b`, `glm-ocr`, `glm-4.6v-flash`
Speech	Audio transcription with language detection	`SpeechToText`	`whisper-small`, `whisper-large-turbo3`
Embeddings	Text and image vector representations for search and RAG	`Embedder`, `RagEngine`	`embeddinggemma-300m`, `nomic-embed-vision`
Image segmentation	Foreground/background separation	Segmentation pipeline	`u2net`

Here is how a single application might combine text, vision, speech, and embeddings:

using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;

// Load models for different modalities
using LM chatModel = LM.LoadFromModelID("qwen3.5:9b");        // Text + reasoning
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m"); // Vector search
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3");  // Speech

// RAG: index documents and answer questions
var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("manual.pdf");

// Speech: transcribe audio
var stt = new SpeechToText(whisperModel);
var transcription = stt.Transcribe("meeting-recording.wav");

// Text: summarize the transcription using the chat model
var chat = new MultiTurnConversation(chatModel);
string summary = chat.Submit($"Summarize this meeting transcript:\n{transcription.Text}");

Vision and Image Analysis

Vision language models (VLMs) can analyze images, extract text from photos, describe visual content, and answer questions about what they see:

using LMKit.Model;
using LMKit.TextGeneration;

// Gemma 4 has built-in vision capabilities
using LM model = LM.LoadFromModelID("gemma4:e4b");

var chat = new MultiTurnConversation(model);
chat.AddImage("photo-of-receipt.jpg");

string result = chat.Submit("Extract the total amount and date from this receipt.");

Models with vision capabilities include the Qwen 3 VL family (qwen3-vl:2b through qwen3-vl:30b), the Qwen 3.5 and Qwen 3.6 families (vision is built into the dense and MoE checkpoints), Gemma 4 (gemma4:e2b through gemma4:31b), GLM-V 4.6 Flash (glm-4.6v-flash), MiniCPM-V/O families, Ministral 3, and Pixtral. The Qwen 2 VL and Qwen 2.5 VL families remain available for legacy projects.

Document Processing

LM-Kit.NET includes native libraries for processing documents without requiring a language model:

Format	Capabilities
PDF	Text extraction with layout preservation, page splitting, merging, attachment extraction, metadata
DOCX	Text and structure extraction
HTML	Parsing and text extraction
XLSX	Spreadsheet data extraction
PPTX	Presentation text extraction
EML / MBOX	Email archive processing

For AI-powered document understanding (Q&A, summarization, classification), combine document extraction with a chat model or RAG pipeline.

OCR: Text from Images and Scans

Two OCR approaches are available:

VLM OCR (VlmOcr): Uses vision language models for high-accuracy recognition. Handles complex layouts, tables, and mathematical formulas.
LM-Kit OCR (LMKitOcr): High-throughput OCR engine with very high accuracy on business documents and advanced page layout handling.

Running multiple models requires planning your memory budget:

Combination	Approximate Memory
Chat (8B) + Embeddings (300M)	~5.5 GB
Chat (8B) + Embeddings (300M) + Whisper (turbo)	~6.4 GB
Chat (8B) + Vision (separate VLM) + Embeddings	~7 to 12 GB
Chat with built-in vision (Gemma 4 E4B) + Embeddings	~6 GB

Using a model with built-in vision (like Gemma 4) saves memory compared to loading a separate VLM.

Can I run multiple AI models at the same time?: Memory management and parallel model loading.
What languages can LM-Kit.NET models understand?: Multilingual support across all modalities.
How do I choose the right model size for my hardware?: Select models by capability and memory requirements.
Model Catalog: Browse models filtered by capability (vision, speech, embeddings, OCR).

Table of Contents

Can LM-Kit.NET Process Images, PDFs, and Audio in One Application?

TL;DR

Supported Modalities

Example: Multi-Modal Application

Vision and Image Analysis

Document Processing

OCR: Text from Images and Scans

Memory Planning for Multi-Modal Apps

📚 Related Content