Create Multi Turn Chatbot with Vision in .NET Applications

🎯 Purpose of the Sample

The Multi-Turn Chat with Vision demo extends the capabilities of the LM-Kit.NET SDK by adding support for visual attachments in a multi-turn conversational flow. This sample shows how to integrate both Large Language Models (LLMs) and Small Language Models (SLMs) into a .NET application. By supporting models of various sizes, you can run the chatbot on devices ranging from powerful servers to smaller edge devices. The demo produces image-driven insights in addition to maintaining text-based conversational context across multiple exchanges.

👥 Industry Target Audience

This sample is particularly relevant for developers and businesses in the following domains:

🖼️ Visual Content Analysis: Leverage AI to describe and interpret images, useful for social media, content moderation, or creative applications.
🏥 Healthcare: Aid diagnostic or telemedicine scenarios by explaining medical images (under appropriate privacy and regulatory conditions).
🏪 Retail/E-commerce: Automate product tagging and generate product descriptions based on images.
🛡️ Security: Implement intelligent threat detection by interpreting surveillance images or identifying objects in real time.

🚀 Problem Solved

Conventional text-based chatbots lack the ability to process and respond to image inputs, while also requiring significant computational resources to run large models. By incorporating both LLM and SLM support with vision capabilities, this demo enables:

Context-Aware Dialogue: Continues multi-turn conversations even after processing an image.
Image Interpretation: Utilizes specialized AI models that can analyze and describe images.
Flexible Model Deployment: Run the application on varied hardware, from powerful servers to resource-constrained edge devices, by choosing the appropriate model size (LLM or SLM).
Seamless Text & Image Handling: Creates a unified chatbot experience where users can submit text or attach images for more nuanced responses.

💻 Sample Application Description

Multi-Turn Chat with Vision is a console application that prompts users to select a vision-capable model (LLM or SLM), upload an image, and engage in multi-turn dialogue with the AI assistant. The chatbot retains conversation context while adding insights from the attached image.

✨ Key Features

🖼️ Image Attachment: Users can attach an image, which the chatbot then describes or analyzes.
📈 Model Selection: Choose from predefined vision-enabled models or supply a custom model URI, whether it’s an LLM or an SLM.
⏳ Progress Tracking: Displays model download and loading progress, ensuring transparency in the setup phase.
🔄 Multi-Turn Interaction: Maintains conversation context across multiple turns for coherent and relevant responses.
📝 Special Commands: Provides commands to reset the conversation, continue, or regenerate the last response.

🧠 Supported Models

This sample supports both LLMs and SLMs with vision capability:

MiniCPM 2.6 Vision 8.1B (requires ~6.8 GB VRAM)
Alibaba Qwen 2 Vision 2.2B (requires ~3 GB VRAM)
Alibaba Qwen 2 Vision 8.3B (requires ~7.3 GB VRAM)

You can also specify a custom model URI for other compatible LLMs or SLMs with vision support. Depending on the hardware resources available, you can choose the model best suited for your environment.

🛠️ Commands

/reset: Clears the conversation history and starts a new session.
/continue: Continues the last assistant response (useful if the assistant’s output is cut off or incomplete).
/regenerate: Generates a new assistant response for the last user input.

🛠️ Getting Started

📋 Prerequisites

.NET Framework 4.6.2 or .NET 6.0

📥 Download the Project

Clone the LM-Kit.NET samples repository or download the multi_turn_chat_with_vision source code:
```
git clone https://github.com/LM-Kit/lm-kit-net-samples.git
```

Navigate to the multi_turn_chat_with_vision directory:

cd lm-kit-net-samples/console_net/multi_turn_chat_with_vision

Project Link: multi_turn_chat_with_vision

▶️ Running the Application

🔨 Build the application:
```
dotnet build
```
▶️ Run the application:
```
dotnet run
```
🔍 Follow the on-screen prompts:
- Select one of the pre-configured vision models or provide a custom model URI (LLM or SLM).
- Enter the path to the image file you want to analyze.
- Engage in multi-turn conversation, leveraging the visual context from your attached image.

By following these steps, you can explore how LM-Kit.NET handles both text and image inputs in a context-aware conversation, enabling advanced use cases that combine visual understanding with multi-turn text-based dialogue management, whether on a powerful server or a resource-constrained device.

Table of Contents