Table of Contents

Perform Model Quantization in .NET Applications


📌 Purpose of the Sample

The Model Quantization Demo demonstrates how to use the LM-Kit.NET SDK to perform model quantization, which is a process that reduces the precision of a model's weights. This process helps in optimizing the model for faster inference and reduced memory footprint while balancing the trade-off between model size and quality.


👥 Industry Target Audience

This sample is particularly useful for developers and organizations in the following sectors:

  • 📊 Machine Learning: Engineers and researchers optimizing models for deployment on resource-constrained devices.
  • 📱 Mobile Applications: Developers looking to deploy efficient AI models on mobile devices.
  • 🌐 IoT Devices: Companies integrating AI models into IoT devices that have limited computational resources.
  • 🏠 Edge Computing: Developers needing to deploy AI models at the edge with constrained hardware capabilities.

🛠️ Problem Solved

Large-scale language models often require significant computational resources for inference. Quantization helps in reducing the model size and improving inference speed without a substantial loss in model accuracy. This demo addresses the challenge of efficiently deploying large models by providing a straightforward approach to quantize models into various precision formats. Additionally, by reducing the memory footprint, it allows multiple models to be handled by the same server without keeping the entire key-value cache in memory, making the system more scalable.


💻 Sample Application Description

The Model Quantization Demo is a console application that allows users to input the path to a model file and specify the desired quantization format. The application supports multiple quantization formats, providing flexibility in choosing the right balance between model size and performance.

✨ Key Features

  • 🔍 Model Validation: Ensures the provided model is in the correct format before proceeding with quantization.
  • ⚙️ Flexible Quantization: Supports multiple quantization formats, allowing users to select the appropriate precision based on their needs.
  • 🔄 Batch Quantization: Provides an option to generate quantized versions of the model for all supported formats.
  • ⚡ Efficiency: Fast loading and saving operations ensure minimal delay in the quantization process.

📐 Supported Quantization Formats

Format Description
Q2_K Smallest size, significant quality loss
Q3_K_S Very small size, high quality loss
Q3_K_M Very small size, high quality loss
Q3_K_L Small size, substantial quality loss
Q4_K_S Small size, greater quality loss
Q4_K_M Medium size, balanced quality (recommended)
Q5_K_S Large size, low quality loss (recommended)
Q5_K_M Large size, very low quality loss (recommended)
Q6_K Very large size, extremely low quality loss
Q8_0 Very large size, extremely low quality loss (not recommended)
ALL Generates versions for all formats

🚀 Getting Started

📝 Prerequisites

  • .NET Framework 4.6.2 or .NET 6.0

📥 Download the Project

▶️ Running the Application

  1. 📂 Clone the repository:

    git clone https://github.com/LM-Kit/lm-kit-net-samples.git
    
  2. 📁 Navigate to the project directory:

    cd lm-kit-net-samples/console_framework_4.62/quantizer
    

    or

    cd lm-kit-net-samples/console_net6/quantizer
    
  3. 🔨 Build and run the application:

    dotnet build
    dotnet run
    

💡 Example Usage

  1. 🔑 Set the License Key (if available):

    LMKit.Licensing.LicenseManager.SetLicenseKey(""); // Set an optional license key here if available.
    
  2. 📁 Enter the Path to the Model: The application will prompt you to enter the path to the model file you wish to quantize.

    Please enter the path to the model for quantization (recommended format: fp16):
    
  3. 🔧 Select the Quantization Format: The application will ask you to specify the desired quantization format from the list of supported formats.

    Please enter the output model format (e.g., 'Q4_K_M'):
    
  4. ⚙️ Quantization Process: The application will then perform the quantization and save the quantized model to the specified path. The process is optimized for speed, ensuring minimal delay.

📊 Example Output

The application will display the status of the quantization process, including the generation of the quantized model file and its format. For example:

Generating model-quantized-Q4_K_M.gguf with precision MOSTLY_Q4_K_M...
Quantization completed successfully.