Continuous Evaludation of AI Applications

February 19, 2026 - 2 minutes read - 335 words

Sensible Defaults - AI…

Stage : Draft Building trust…

Accuracy - c Helfullness - Safety - Latency -

chlalnges

Subjecitivyt - differet people … jitne muh utni baat… open-endeded ness - natual language… context sensitity - depending on time, use, scalability - human eval is gold standards,, but impossible…

So even if automated evaluation is not perfect, still needed.

Observability - in AI

Loging LLM inputs (promts, settings, parameters), output (result, confidence score, .. ) and metadata (timestamp, latecy, versions., sesson id, app context etc)
deect errors , perf bottlenes, token usage spike ,, debuging and optimization…
OpenTelemetry - collect traces – unversal data collected - capture
Promethies - queery store metrics - error rate, speed. - store
Graphana - dashboards, notifictions alerts – view..
latency - request - response - search/voice assistence is very critical
token usages - input and output - log them .. optimise prompts
user statisfaction - answers are helpful
- feedback
- implicit singnal - ask same question again in different ways..
- timeout of safety vailoene
- hanmful violence
Detecting hulucination
- often looks correct but incoeect
- RAG
AI bais - human langugea .. baised against.. - reduce fairness, and trust - training on more data, tuning
Toxicity - serious harm to brand… - gaurdrails, filters etc..

RCA -

hulicenious - RAG failed, could not access
- 5Why -
  - why was
- Fishbonne
  - where they
- Failt tree -connection between problems

Feedback loops

tag error
feeding data back to ..

HUman eval

slow, costly,,

Auto eval

missed tone, quality…

combine both.. compliment…

did we evaludated all neded ?
what is recall rage
did it retrivew relevant documents?

Design evaluation

logging,
monotring,
and feedback
performacne, cost and privacy.

integrating CI/CD pipeline

pre-dep checks
- acuracy, safe, perf
- at dev desk
eval in CICD pipeline
canary - A/B testing..

Memory

IOT… - lab is not enough, … need

batary is not known.

alerts play book

what to do..
who to

Incident Response Plan..

Scalable human evaludation…

sampling
traige - human time where it matters more.

Cost vs Budget

hybrid model.

＃evolution ＃ai ＃genai ＃technology ＃prompt ＃tutorial ＃learnings ＃RAG ＃fine-tuning ＃development ＃embeddings ＃agenticai ＃future ＃practice