Table of Contents

💬 What is Chat Completion?


📄 TL;DR

Chat completion is a generation mode where the model produces the next assistant message from a structured conversation history instead of a single standalone prompt. You send a sequence of messages (system, user, assistant, tool output), and the model answers while taking the full transcript into account.

In LM-Kit.NET, chat completion is primarily implemented through MultiTurnConversation, which maintains the evolving ChatHistory and tracks practical constraints like context size, remaining token budget, and per-turn limits. (LM-Kit Docs)


🧠 What Is Chat Completion?

Think of chat completion as “predict the next reply in a dialogue”.

A conversation is represented as:

  • A history of messages (ChatHistory) (LM-Kit Docs)
  • Each message has a role (AuthorRole) such as System, User, Assistant, and Tool (LM-Kit Docs)
  • The model generates the next Assistant message based on the whole transcript

This is what makes assistants feel coherent. A user can say “do the same but shorter”, and the model knows what “the same” refers to because it sees the previous turns.


⚙️ How Chat Completion Works in LM-Kit.NET

1) The runtime stores the conversation as ChatHistory

ChatHistory contains the message list and provides helpers to build the model prompt, including:

  • Messages and MessageCount (LM-Kit Docs)
  • Role-aware formatting via prefixes and suffixes (SystemPrefix, UserPrefix, AssistantPrefix, and their corresponding suffixes) (LM-Kit Docs)
  • ToText() to render the full formatted prompt, and ToTokens() to render the prompt as model tokens (LM-Kit Docs)

That last part is important because it connects the “chat transcript” you see as a developer to what the model actually receives.

2) MultiTurnConversation appends turns as you chat

MultiTurnConversation exposes a ChatHistory property and automatically appends messages as you call Submit(...). (LM-Kit Docs)

It also supports real-world workflows:

  • Start fresh with a chosen context size, or let LM-Kit pick an optimal one (contextSize = -1) based on hardware and model settings (LM-Kit Docs)
  • Resume from an existing ChatHistory (it clones the history) (LM-Kit Docs)
  • Restore from a serialized session (bytes or file) (LM-Kit Docs)

3) Context is a budget, not a vibe

Chat completion is always constrained by the model’s context window. LM-Kit.NET surfaces that clearly:

  • ContextSize: total window for prompt plus response (LM-Kit Docs)
  • ContextRemainingSpace: what is still available right now (LM-Kit Docs)
  • MaximumCompletionTokens: per-turn cap (default 2048, -1 disables the limit subject to context capacity) (LM-Kit Docs)

That makes chat completion predictable: you can measure when you are close to overflow and decide how to trim or summarize history.


🧩 Roles: the hidden superpower of chat

Roles are how you get reliable behavior without turning every prompt into a fragile wall of text.

LM-Kit.NET’s AuthorRole explicitly distinguishes:

  • System: sets behavior and high-level context (LM-Kit Docs)
  • User: what the user said (LM-Kit Docs)
  • Assistant: what the model generated (LM-Kit Docs)
  • Tool: structured tool results returned to the model after a tool call (LM-Kit Docs)
  • Developer: application-level instructions and policies (LM-Kit Docs)

If you want “super interesting” behavior, this is where it starts: instead of stuffing everything into one prompt, you place information in the right role so the model treats it correctly.


🔧 Tools + Chat Completion: where assistants become useful

Chat completion becomes an “agent loop” once tools enter the picture.

LM-Kit.NET’s tools demo describes a pattern where the model can decide to call one or more tools, pass JSON arguments matching each tool schema, then use the tool’s JSON results to craft a grounded reply. Tools implement ITool, and behavior can be shaped via ToolChoice. (LM-Kit Docs)

Two key design details matter a lot in production:

✅ Tool results belong in the transcript

Tool output is stored as AuthorRole.Tool, meaning it becomes part of the conversation state that future turns can reference. (LM-Kit Docs)

⚠️ Grammar constraints do not mix with tools

MultiTurnConversation documents a hard incompatibility: if Grammar is set (used to constrain output like JSON) and any tool is registered, Submit(...) throws an InvalidOperationException. (LM-Kit Docs)

Practical takeaway: pick one per flow.

  • Need strict JSON? Use grammar constraints.
  • Need tool calling? Let tools enforce structure and validation.

🧠 Memory in chat completion

MultiTurnConversation includes a Memory store for recalling relevant context across turns, plus controls like MaximumRecallTokens to cap how much recalled content is injected per turn. (LM-Kit Docs)

This is the difference between:

  • short-term memory: the live ChatHistory
  • long-term memory: recalled context injected when useful

It is how chats stay helpful even when the user returns after many turns or switches topics.


🔁 Chat completion vs single-turn completion

If your app is “ask once, answer once”, single-turn is simpler and faster.

LM-Kit.NET makes the distinction explicit:

  • SingleTurnConversation is designed for single-turn Q&A and does not preserve context between questions and answers (LM-Kit Docs)
  • MultiTurnConversation preserves the full dialogue history, tool results, and memory recall across turns (LM-Kit Docs)

A nice mental model:

  • Single-turn is like a search box.
  • Multi-turn chat is like a relationship, it remembers.

🧪 A tiny C# mental model (history-first)

// Pseudocode style to illustrate the flow
// 1) Create a multi-turn conversation
var chat = new MultiTurnConversation(lm, contextSize: -1);

// 2) ChatCompletion is "append message, generate next assistant turn"
var reply1 = chat.Submit("Hello! Summarize this project in 3 bullets.");

// 3) The history now contains System/User/Assistant turns
var promptText = chat.ChatHistory.ToText();

// 4) Next turn builds on the full transcript
var reply2 = chat.Submit("Make it shorter and more technical.");

The “magic” is not the second prompt. The magic is that it is evaluated inside a growing transcript. (LM-Kit Docs)


🌟 How to make chat completion feel amazing

Three high-leverage tips that map directly to LM-Kit.NET concepts:

  1. Treat context like money Watch ContextRemainingSpace and avoid “history bloat” by summarizing older turns when needed. (LM-Kit Docs)

  2. Use roles deliberately Put policies in System or Developer roles, not inside user text. Roles exist to prevent your prompt from becoming spaghetti. (LM-Kit Docs)

  3. Ground facts with tools If the answer must be correct, call tools and store results as tool messages, then let the assistant explain. (LM-Kit Docs)

Bonus: if you want the assistant to “feel” different, sampling changes help. LM-Kit has a dedicated multi-turn chat sample for custom sampling strategies (top-k, top-p, temperature, logit biases). (LM-Kit Docs)


📝 Summary

Chat completion is next-message generation over a role-aware conversation history.

In LM-Kit.NET, the core building blocks are:

  • MultiTurnConversation for multi-turn state, context budgeting, memory recall, and tool-aware chat (LM-Kit Docs)
  • ChatHistory for storing, formatting, tokenizing, and serializing the conversation (LM-Kit Docs)
  • AuthorRole for separating system instructions, user input, assistant output, and tool results (LM-Kit Docs)