In this blog post Supercharge LangChain apps with an LLM cache for speed and cost we will show how to make LangChain applications faster, cheaper, and more reliable by caching LLM outputs.
Supercharge LangChain apps with an LLM cache for speed and cost is about one idea: do not recompute answers you already paid for. Caching turns repeat prompts into instant responses, smoothing spikes and protecting your budget. You will learn what is happening under the hood, when to use (and avoid) caching, and how to deploy it from laptop to production.
Why LLM caching matters
LLM calls are slow compared to memory or network cache lookups and they cost money. Many workloads repeat the same prompts: unit tests, evaluations, deterministic pipelines, or user flows with minor variations. A cache cuts latency from seconds to milliseconds and eliminates duplicate spend. As a bonus, it reduces provider rate-limit pressure and improves perceived reliability when an upstream API blips.
How LangChain caching works under the hood
LangChain ships a pluggable LLM cache that sits behind its model interfaces. When you call an LLM or ChatModel, LangChain computes a cache key that includes:
- The serialized model and parameters (e.g., model name, temperature, tools)
- The full prompt (or message list) after formatting
If the key exists, LangChain returns the stored generations. If not, it calls the provider and stores the result for future hits. Backends range from in-memory (fast, ephemeral) to SQLite (local persistence) to Redis (shared, production-grade). There are also semantic caches that use embeddings to match “similar” prompts, not just exact strings.
When to cache and when not to
- Cache: evaluation runs, prompt-engineering loops, deterministic chains, knowledge-base queries that change slowly, and expensive multi-step workflows.
- Be careful: prompts with real-time data (dates, stock prices), user-personalized content, or prompts where the latest context changes the answer.
- Mitigate staleness: set TTLs, include cache-busting context (e.g., content version), or use semantic cache with conservative thresholds.
Quick start with an in-memory cache
Great for local development and tests.
pip install langchain langchain-openai
import os
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache
from langchain_openai import ChatOpenAI
os.environ["OPENAI_API_KEY"] = "<your-key>"
# Enable global LLM cache
set_llm_cache(InMemoryCache())
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
resp1 = llm.invoke("Explain vector databases in one sentence.")
resp2 = llm.invoke("Explain vector databases in one sentence.") # served from cache
print(resp1.content)
Note: the cache key includes the model and parameters. If you change temperature or system prompts, you will get a new entry.
Persist results with SQLite
Use this when you want responses to survive restarts or share across small teams via a file.
from langchain.globals import set_llm_cache
from langchain.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))
SQLite is simple and reliable. Store the database with your experiment artifacts for reproducibility.
Scale out with Redis in production
Redis gives you shared cache across app servers, eviction policies, metrics, and high availability.
pip install redis
from redis import Redis
from langchain.globals import set_llm_cache
from langchain.cache import RedisCache
redis_client = Redis(host="localhost", port=6379, db=0)
set_llm_cache(RedisCache(redis_client=redis_client))
Tip: use a dedicated database or key prefix per environment (dev/stage/prod) to avoid accidental cross-talk.
Go beyond exact matches with semantic caching
Exact caching misses hits when prompts differ by small wording changes. A semantic cache hashes an embedding of the prompt and returns a prior answer if it is “close enough.” This is powerful for chat UX and search-style prompts.
pip install redis
pip install langchain-openai
from langchain_openai import OpenAIEmbeddings
from langchain.cache import RedisSemanticCache
from langchain.globals import set_llm_cache
emb = OpenAIEmbeddings(model="text-embedding-3-small")
set_llm_cache(
RedisSemanticCache(
redis_url="redis://localhost:6379/0",
embedding=emb,
score_threshold=0.85 # stricter for higher precision
)
)
Choose a threshold carefully. Start strict (e.g., 0.85–0.9 cosine similarity) and loosen only if you see good matches during evaluation.
Bypass or invalidate the cache
Sometimes you want a fresh answer even if a cache entry exists.
from langchain.globals import get_llm_cache, set_llm_cache
# Temporarily disable cache for a single call
prev = get_llm_cache()
set_llm_cache(None)
try:
fresh = llm.invoke("Explain vector databases in one sentence.")
finally:
set_llm_cache(prev)
# Clear the cache (supported by most backends)
cache = get_llm_cache()
if hasattr(cache, "clear"):
cache.clear()
For SQLite, you can also delete the database file. For Redis, consider expiring keys or selective deletion by prefix.
Use caching in LCEL chains
Caching works seamlessly inside LangChain Expression Language pipelines.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain.globals import set_llm_cache
from langchain.cache import SQLiteCache
set_llm_cache(SQLiteCache(".langchain_cache.db"))
prompt = ChatPromptTemplate.from_messages([
("system", "You are a concise assistant."),
("human", "Summarize: {text}")
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = prompt | llm | StrOutputParser()
out1 = chain.invoke({"text": "LLM caching reduces repeated computation and cost."})
out2 = chain.invoke({"text": "LLM caching reduces repeated computation and cost."}) # cached
print(out1)
Operational tips
- Keys and versions: include content version, tenant, and prompt template version in your prompts to control cache scope.
- TTLs: for Redis, set expiration where appropriate to avoid staleness and unbounded growth.
- Observability: track cache hit rate. For Redis, expose INFO stats and keyspace metrics; for SQLite, log cache hits in your application.
- Warmups: prefill the cache after deploy by replaying common prompts to avoid cold-start latency for users.
Security and compliance
- Do not cache secrets, PII, or raw user inputs if policy prohibits persistence. Use in-memory cache or encryption-at-rest where required.
- Segment caches by customer or environment to prevent data leakage.
Common pitfalls
- Stale data: set TTLs and embed context versioning.
- Hidden misses: whitespace or minor prompt differences cause misses; normalize prompts and prefer templates.
- Over-caching: avoid caching queries that depend on time or mutable state.
- High variance prompts: with high temperature or randomness, cached outputs may not represent expected variability; consider caching only deterministic steps.
Backends at a glance
- InMemoryCache: fastest, process-local, great for tests.
- SQLiteCache: single-file persistence, simple ops, good for laptops and CI.
- RedisCache: shared, scalable, supports TTLs and ops tooling.
- RedisSemanticCache: fuzzy matching with embeddings for prompt variants.
- Other integrations: GPTCache and vendor caches can be swapped in if they implement LangChain’s cache interface.
Conclusion
LLM caching is a low-effort, high-impact optimization for LangChain apps. Start with in-memory during development, move to SQLite for reproducibility, and adopt Redis (exact or semantic) in production. With careful scoping, TTLs, and observability, you will cut latency, trim spend, and improve reliability without changing your application logic.
Discover more from CPI Consulting
Subscribe to get the latest posts sent to your email.