In this blog post Build a Chat Bot with Streamlit An End to End Guide for Teams we will walk through how to design, build, and deploy a production-ready chat bot using Streamlit and modern large language models (LLMs).
Build a Chat Bot with Streamlit An End to End Guide for Teams is about blending a fast UI framework with powerful AI. Streamlit turns Python scripts into web apps in minutes. LLMs add natural conversation, reasoning, and task execution. Together, they let technical teams prototype in hours and iterate toward production confidently.
Why Streamlit for chat bots
Streamlit is Python-first, reactive, and batteries-included. You write simple code, and it handles layout, state, forms, caching, and auth (via secrets). For chat experiences, Streamlit provides native chat components, quick data viz, and frictionless deployment options. For managers, this means short time-to-value and low operational overhead.
How chat bots work at a high level
Modern chat bots have three core layers:
- Interface: a responsive chat UI that captures messages and renders responses.
- Reasoning: an LLM that interprets intent, maintains context, and drafts replies.
- Knowledge and tools: optional retrieval (RAG) or function calls to fetch data or act.
Streamlit handles the interface and app state. An LLM provider (OpenAI, Azure OpenAI, Anthropic, etc.) powers reasoning. A vector store or API integrations provide grounded knowledge. The result is a conversational UI that is both helpful and reliable.
The key technologies behind this stack
Here is the main technology you will use and why it matters:
- Streamlit: reactive Python web app framework with chat widgets (
st.chat_input
,st.chat_message
), caching (st.cache_*
), and session state. - LLM API: a hosted model endpoint (e.g., OpenAI) for chat completions and function/tool calling.
- Embeddings and vector search (optional): FAISS or a managed vector DB to retrieve relevant documents for RAG.
- Secrets management: Streamlit secrets or environment variables to store API keys safely.
- Containerization/deployment: Streamlit Community Cloud, Docker on AWS/GCP/Azure, or an internal platform.
Architecture overview
A minimal app flows like this:
- Initialize session state for the chat transcript.
- Render prior messages; capture new user input.
- Send the conversation context and user message to an LLM API.
- Optionally, augment with retrieved context (RAG) before calling the LLM.
- Stream or display the model response; persist to session state.
Prerequisites
- Python 3.9+
pip install streamlit openai
(or your preferred LLM client)- Set
OPENAI_API_KEY
(or relevant provider key) in your environment or.streamlit/secrets.toml
Step 1 Build a minimal chat UI
This is the smallest useful Streamlit chat bot. It remembers history and calls an LLM.
import os
import streamlit as st
from openai import OpenAI
st.set_page_config(page_title="Streamlit Chat Bot", page_icon="🤖", layout="centered")
# Initialize LLM client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Initialize chat history in session state
if "messages" not in st.session_state:
st.session_state.messages = [
{"role": "system", "content": "You are a helpful assistant."}
]
st.title("Streamlit Chat Bot")
# Render history (skip system in UI)
for msg in st.session_state.messages:
if msg["role"] == "system":
continue
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
# Chat input
if prompt := st.chat_input("Ask me anything"):
# Add user message
st.session_state.messages.append({"role": "user", "content": prompt})
# Display user message immediately
with st.chat_message("user"):
st.markdown(prompt)
# Get LLM response
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
completion = client.chat.completions.create(
model="gpt-4o-mini", # or your preferred model
messages=st.session_state.messages,
temperature=0.2,
)
reply = completion.choices[0].message.content
st.markdown(reply)
# Save assistant reply
st.session_state.messages.append({"role": "assistant", "content": reply})
Run with streamlit run app.py
and you’ll have a functional chat bot.
Make it feel fast with streaming
Streaming small chunks improves perceived performance.
with st.chat_message("assistant"):
placeholder = st.empty()
collected = []
for chunk in client.chat.completions.create(
model="gpt-4o-mini",
messages=st.session_state.messages,
stream=True,
temperature=0.2,
):
delta = chunk.choices[0].delta.content or ""
collected.append(delta)
placeholder.markdown("".join(collected))
reply = "".join(collected)
Step 2 Manage state and prompts responsibly
- Keep a short, clear system prompt that sets persona and constraints.
- Truncate long histories to control latency and cost.
- Store only what you need; avoid logging secrets or PII.
def trimmed_history(messages, max_tokens=2000):
# Simple heuristic: keep last N messages; for production, measure tokens
keep = []
for m in reversed(messages):
keep.append(m)
if len(keep) > 10: # tune as needed
break
return list(reversed(keep))
Step 3 Add retrieval for accurate answers (RAG)
Retrieval Augmented Generation lets the bot cite your documents rather than guessing. Below is a lightweight local approach using sentence embeddings and FAISS.
- Index: compute embeddings for your documents and build a FAISS index.
- Retrieve: on each question, get top-k chunks and pass them to the model.
# pip install faiss-cpu sentence-transformers
import faiss
from sentence_transformers import SentenceTransformer
EMB_MODEL = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Build the index once (cache it for speed)
@st.cache_resource
def build_index(docs: list[str]):
vectors = EMB_MODEL.encode(docs, normalize_embeddings=True)
index = faiss.IndexFlatIP(vectors.shape[1])
index.add(vectors)
return index, vectors
# Example documents
DOCS = [
"Company policy: Support hours are 9am-5pm AEST, Mon-Fri.",
"Refunds are processed within 7 business days.",
"To reset your password, use the account settings page.",
]
INDEX, DOC_VECS = build_index(DOCS)
def retrieve(query, k=3):
qv = EMB_MODEL.encode([query], normalize_embeddings=True)
D, I = INDEX.search(qv, k)
return [DOCS[i] for i in I[0]]
# In your chat handler
context = "\n".join(retrieve(prompt))
rag_system_prompt = (
"You are a helpful assistant. Use the provided context to answer. "
"If the answer is not in the context, say you do not know.\n\n"
f"Context:\n{context}\n\n"
)
messages = [{"role": "system", "content": rag_system_prompt}] + trimmed_history(
st.session_state.messages
)
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.1,
)
For larger corpora, consider a managed vector DB (Pinecone, Weaviate, Qdrant Cloud) and chunking PDFs/HTML via loaders.
Step 4 Add tool use and function calling
Use LLM tool calling to let the bot fetch live data or perform actions. Define tool schemas and route model requests to Python functions.
tools = [
{
"type": "function",
"function": {
"name": "get_support_hours",
"description": "Return current support hours by region",
"parameters": {"type": "object", "properties": {"region": {"type": "string"}}, "required": ["region"]},
},
}
]
def get_support_hours(region: str):
return {"AEST": "9am-5pm Mon-Fri", "PST": "9am-5pm Mon-Fri"}.get(region, "Unknown")
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=st.session_state.messages,
tools=tools,
)
choice = resp.choices[0]
if choice.finish_reason == "tool_calls":
for call in choice.message.tool_calls:
if call.function.name == "get_support_hours":
import json
args = json.loads(call.function.arguments)
result = get_support_hours(args["region"])
st.session_state.messages.append({
"role": "tool",
"tool_call_id": call.id,
"name": "get_support_hours",
"content": str(result),
})
# Send tool results back to the model
final = client.chat.completions.create(
model="gpt-4o-mini",
messages=st.session_state.messages,
)
reply = final.choices[0].message.content
Step 5 Evaluate and observe
- Golden sets: keep a small suite of Q&A pairs that the bot should answer.
- Telemetry: log prompts, response times, token usage (exclude PII).
- User feedback: add a thumbs up/down and capture rationale.
fb = st.radio("Was this helpful?", ["👍", "👎"], horizontal=True, key=f"fb_{len(st.session_state.messages)}")
if fb:
st.session_state.setdefault("feedback", []).append({"msg": prompt, "fb": fb})
Security, privacy, and governance
- Secrets: store API keys in
.streamlit/secrets.toml
, not in code or git. - PII: mask or avoid sending PII to third-party providers.
- Rate limits: add retry/backoff; degrade gracefully on provider outages.
- Allow-list tools and sanitize tool inputs/outputs.
- Model choice: prefer enterprise offerings with data controls (e.g., Azure OpenAI with no training on inputs).
Cost and latency control
- Use small models for routine turns; escalate to larger models only when needed.
- Trim history and context; use retrieval to provide only relevant chunks.
- Cache expensive retrieval with
st.cache_data
. - Batch background jobs; set
max_tokens
thoughtfully.
Packaging and configuration
Add a simple requirements.txt
:
streamlit==1.37.0
openai>=1.30.0
faiss-cpu
sentence-transformers
And a minimal .streamlit/secrets.toml
:
OPENAI_API_KEY = "<your-key>"
Deployment options
Streamlit Community Cloud
- Push to GitHub.
- Connect the repo in Streamlit Cloud.
- Add secrets in the dashboard; deploy.
Docker on your cloud
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8501
ENV PYTHONUNBUFFERED=1
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
Run locally with:
docker build -t streamlit-chat .
docker run -p 8501:8501 -e OPENAI_API_KEY=$OPENAI_API_KEY streamlit-chat
Then deploy the image to ECS, Cloud Run, AKS, or your platform of choice. Add autoscaling and a load balancer for production traffic.
Common pitfalls and how to avoid them
- Endless context growth: trim or summarize older turns.
- Hallucinations: use RAG and instruct the model to admit uncertainty.
- Slow responses: stream tokens and prefetch retrieval.
- Inconsistent answers: standardize system prompts and temperature.
- Key leakage: keep credentials in secrets; never print them in logs.
What good looks like
- Clear, concise system prompt that aligns with your domain.
- Fast first token via streaming and lightweight models.
- Grounded answers with citations from your knowledge base.
- Audit trail of prompts, context, and decisions.
- Automated deployments and rollbacks with container images.
Next steps
- Add user auth and role-based access to tailor answers by department.
- Support file uploads and on-the-fly indexing for ad-hoc documents.
- Introduce analytics on conversation quality and deflection rates.
- Experiment with tool calling to integrate internal APIs.
Wrap up
Streamlit plus a modern LLM is a powerful, pragmatic foundation for chat experiences. Start small with the minimal app, add retrieval for trustworthy answers, and layer in tools and deployment. With careful attention to state, cost, and governance, you can ship a helpful bot quickly—and improve it continuously as your users engage.
Discover more from CPI Consulting
Subscribe to get the latest posts sent to your email.