Table of Contents

Can LM-Kit.NET Process Images, PDFs, and Audio in One Application?


TL;DR

Yes. LM-Kit.NET supports text, vision (images), documents (PDF, DOCX, HTML, EML, XLSX, PPTX), speech (audio transcription), and embeddings (text and image vectors) all within a single SDK. You load different models for different modalities and combine them in the same application. No external services or separate libraries are needed.


Supported Modalities

Modality What It Does Key Classes Example Models
Text Chat, generation, classification, extraction, translation MultiTurnConversation, Agent, TextTranslation qwen3.5:9b, gemma4:e4b
Vision Image understanding, visual Q&A, image-based extraction VisionImage, VLM-capable models qwen3.5:9b, gemma4:e4b (built-in vision)
Documents PDF text extraction, layout analysis, format conversion PdfDocument, DocxDocument, HtmlDocument N/A (native libraries, no model needed)
OCR Text recognition from images and scanned documents. LM-Kit OCR provides high throughput and very high accuracy on business documents VlmOcr, LMKitOcr paddleocr-vl:0.9b, glm-ocr, glm-4.6v-flash
Speech Audio transcription with language detection SpeechToText whisper-small, whisper-large-turbo3
Embeddings Text and image vector representations for search and RAG Embedder, RagEngine embeddinggemma-300m, nomic-embed-vision
Image segmentation Foreground/background separation Segmentation pipeline u2net

Example: Multi-Modal Application

Here is how a single application might combine text, vision, speech, and embeddings:

using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Speech;

// Load models for different modalities
using LM chatModel = LM.LoadFromModelID("qwen3.5:9b");        // Text + reasoning
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m"); // Vector search
using LM whisperModel = LM.LoadFromModelID("whisper-large-turbo3");  // Speech

// RAG: index documents and answer questions
var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("manual.pdf");

// Speech: transcribe audio
var stt = new SpeechToText(whisperModel);
var transcription = stt.Transcribe("meeting-recording.wav");

// Text: summarize the transcription using the chat model
var chat = new MultiTurnConversation(chatModel);
string summary = chat.Submit($"Summarize this meeting transcript:\n{transcription.Text}");

Vision and Image Analysis

Vision language models (VLMs) can analyze images, extract text from photos, describe visual content, and answer questions about what they see:

using LMKit.Model;
using LMKit.TextGeneration;

// Gemma 4 has built-in vision capabilities
using LM model = LM.LoadFromModelID("gemma4:e4b");

var chat = new MultiTurnConversation(model);
chat.AddImage("photo-of-receipt.jpg");

string result = chat.Submit("Extract the total amount and date from this receipt.");

Models with vision capabilities include the Qwen 2 VL, Qwen 3.5, Gemma 4, GLM-V 4.6 Flash, MiniCPM-V, and Pixtral families.


Document Processing

LM-Kit.NET includes native libraries for processing documents without requiring a language model:

Format Capabilities
PDF Text extraction with layout preservation, page splitting, merging, attachment extraction, metadata
DOCX Text and structure extraction
HTML Parsing and text extraction
XLSX Spreadsheet data extraction
PPTX Presentation text extraction
EML / MBOX Email archive processing

For AI-powered document understanding (Q&A, summarization, classification), combine document extraction with a chat model or RAG pipeline.


OCR: Text from Images and Scans

Two OCR approaches are available:

  • VLM OCR (VlmOcr): Uses vision language models for high-accuracy recognition. Handles complex layouts, tables, and mathematical formulas.
  • LM-Kit OCR (LMKitOcr): High-throughput OCR engine with very high accuracy on business documents and advanced page layout handling.

Memory Planning for Multi-Modal Apps

Running multiple models requires planning your memory budget:

Combination Approximate Memory
Chat (8B) + Embeddings (300M) ~5.5 GB
Chat (8B) + Embeddings (300M) + Whisper (turbo) ~6.4 GB
Chat (8B) + Vision (separate VLM) + Embeddings ~7 to 12 GB
Chat with built-in vision (Gemma 4 E4B) + Embeddings ~6 GB

Using a model with built-in vision (like Gemma 4) saves memory compared to loading a separate VLM.


Share