What is Chat Completion?

TL;DR

Chat completion is a generation mode where the model produces the next assistant message from a structured conversation history instead of a single standalone prompt. You send a sequence of messages (system, user, assistant, tool output), and the model answers while taking the full transcript into account.

In LM-Kit.NET, chat completion is primarily implemented through MultiTurnConversation, which maintains the evolving ChatHistory and tracks practical constraints like context size, remaining token budget, and per-turn limits. (LM-Kit Docs)

What Is Chat Completion?

Think of chat completion as "predict the next reply in a dialogue".

A conversation is represented as:

A history of messages (ChatHistory) (LM-Kit Docs)
Each message has a role (AuthorRole) such as System, User, Assistant, and Tool (LM-Kit Docs)
The model generates the next Assistant message based on the whole transcript

This is what makes assistants feel coherent. A user can say "do the same but shorter", and the model knows what "the same" refers to because it sees the previous turns.

How Chat Completion Works in LM-Kit.NET

1) The runtime stores the conversation as `ChatHistory`

ChatHistory contains the message list and provides helpers to build the model prompt, including:

Messages and MessageCount (LM-Kit Docs)
Role-aware formatting via prefixes and suffixes (SystemPrefix, UserPrefix, AssistantPrefix, and their corresponding suffixes) (LM-Kit Docs)
ToText() to render the full formatted prompt, and ToTokens() to render the prompt as model tokens (LM-Kit Docs)

That last part is important because it connects the "chat transcript" you see as a developer to what the model actually receives.

2) `MultiTurnConversation` appends turns as you chat

MultiTurnConversation exposes a ChatHistory property and automatically appends messages as you call Submit(...). (LM-Kit Docs)

It also supports real-world workflows:

Start fresh with a chosen context size, or let LM-Kit pick an optimal one (contextSize = -1) based on hardware and model settings (LM-Kit Docs)
Resume from an existing ChatHistory (it clones the history) (LM-Kit Docs)
Restore from a serialized session (bytes or file) (LM-Kit Docs)

3) Context is a budget, not a vibe

Chat completion is always constrained by the model's context window. LM-Kit.NET surfaces that clearly:

ContextSize: total window for prompt plus response (LM-Kit Docs)
ContextRemainingSpace: what is still available right now (LM-Kit Docs)
MaximumCompletionTokens: per-turn cap (default 2048, -1 disables the limit subject to context capacity) (LM-Kit Docs)

That makes chat completion predictable: you can measure when you are close to overflow and decide how to trim or summarize history.

Roles: the hidden superpower of chat

Roles are how you get reliable behavior without turning every prompt into a fragile wall of text.

LM-Kit.NET's AuthorRole explicitly distinguishes:

System: sets behavior and high-level context (LM-Kit Docs)
User: what the user said (LM-Kit Docs)
Assistant: what the model generated (LM-Kit Docs)
Tool: structured tool results returned to the model after a tool call (LM-Kit Docs)
Developer: application-level instructions and policies (LM-Kit Docs)

If you want "super interesting" behavior, this is where it starts: instead of stuffing everything into one prompt, you place information in the right role so the model treats it correctly.

Tools + Chat Completion: where assistants become useful

Chat completion becomes an "agent loop" once tools enter the picture.

LM-Kit.NET's tools demo describes a pattern where the model can decide to call one or more tools, pass JSON arguments matching each tool schema, then use the tool's JSON results to craft a grounded reply. Tools implement ITool, and behavior can be shaped via ToolChoice. (LM-Kit Docs)

Two key design details matter a lot in production:

Tool results belong in the transcript

Tool output is stored as AuthorRole.Tool, meaning it becomes part of the conversation state that future turns can reference. (LM-Kit Docs)

Grammar constraints do not mix with tools

MultiTurnConversation documents a hard incompatibility: if Grammar is set (used to constrain output like JSON) and any tool is registered, Submit(...) throws an InvalidOperationException. (LM-Kit Docs)

Practical takeaway: pick one per flow.

Need strict JSON? Use grammar constraints.
Need tool calling? Let tools enforce structure and validation.

Memory in chat completion

MultiTurnConversation includes a Memory store for recalling relevant context across turns, plus controls like MaximumRecallTokens to cap how much recalled content is injected per turn. (LM-Kit Docs)

This is the difference between:

short-term memory: the live ChatHistory
long-term memory: recalled context injected when useful

It is how chats stay helpful even when the user returns after many turns or switches topics.

Chat completion vs single-turn completion

If your app is "ask once, answer once", single-turn is simpler and faster.

LM-Kit.NET makes the distinction explicit:

SingleTurnConversation is designed for single-turn Q&A and does not preserve context between questions and answers (LM-Kit Docs)
MultiTurnConversation preserves the full dialogue history, tool results, and memory recall across turns (LM-Kit Docs)

A nice mental model:

Single-turn is like a search box.
Multi-turn chat is like a relationship. It remembers.

A tiny C# mental model (history-first)

using LMKit.Model;
using LMKit.TextGeneration;

var model = LM.LoadFromModelID("gemma3:12b");
var chat = new MultiTurnConversation(model, contextSize: -1);

var reply1 = chat.Submit("Hello! Summarize this project in 3 bullets.", CancellationToken.None);
var reply2 = chat.Submit("Make it shorter and more technical.", CancellationToken.None);

The "magic" is not the second prompt. The magic is that it is evaluated inside a growing transcript. (LM-Kit Docs)

How to make chat completion feel amazing

Three high-leverage tips that map directly to LM-Kit.NET concepts:

Treat context like money Watch ContextRemainingSpace and avoid "history bloat" by summarizing older turns when needed. (LM-Kit Docs)
Use roles deliberately Put policies in System or Developer roles, not inside user text. Roles exist to prevent your prompt from becoming spaghetti. (LM-Kit Docs)
Ground facts with tools If the answer must be correct, call tools and store results as tool messages, then let the assistant explain. (LM-Kit Docs)

Bonus: if you want the assistant to "feel" different, sampling changes help. LM-Kit has a dedicated multi-turn chat sample for custom sampling strategies (top-k, top-p, temperature, logit biases). (LM-Kit Docs)

Summary

Chat completion is next-message generation over a role-aware conversation history.

In LM-Kit.NET, the core building blocks are:

MultiTurnConversation for multi-turn state, context budgeting, memory recall, and tool-aware chat (LM-Kit Docs)
ChatHistory for storing, formatting, tokenizing, and serializing the conversation (LM-Kit Docs)
AuthorRole for separating system instructions, user input, assistant output, and tool results (LM-Kit Docs)

Key Terms

Chat Completion: Generation mode where the model predicts the next assistant message from a structured conversation history.
ChatHistory: The ordered list of messages (with roles) that forms the model's input prompt. (LM-Kit Docs)
AuthorRole: Enum that tags each message as System, User, Assistant, Tool, or Developer. (LM-Kit Docs)
System Prompt: A System-role message that sets high-level behavior and instructions for the model.
Multi-turn Conversation: A dialogue where context accumulates across multiple user/assistant exchanges. (LM-Kit Docs)
Context Window: The maximum number of tokens the model can see at once, shared between prompt and response. (LM-Kit Docs)
Tool Call: A structured request emitted by the model to invoke an external function, with results stored back as a Tool-role message. (LM-Kit Docs)

Text Completion: Single-prompt generation mode
Token: The units processed during chat completion
Context Windows: Token budget constraints
Inference: The underlying generation process
Sampling: How tokens are selected during generation
Function Calling: Tool invocation during chat
AI Agents: Autonomous systems built on chat completion
Prompt Engineering: Crafting effective system prompts
AI Agent Memory: Long-term recall across sessions

Table of Contents

What is Chat Completion?

TL;DR