Continuous Evaludation of AI Applications
- 2 minutes read - 335 wordsSensible Defaults - AI…
Stage : Draft Building trust…
Accuracy - c Helfullness - Safety - Latency -
chlalnges
Subjecitivyt - differet people … jitne muh utni baat… open-endeded ness - natual language… context sensitity - depending on time, use, scalability - human eval is gold standards,, but impossible…
So even if automated evaluation is not perfect, still needed.
Observability - in AI
-
Loging LLM inputs (promts, settings, parameters), output (result, confidence score, .. ) and metadata (timestamp, latecy, versions., sesson id, app context etc)
-
deect errors , perf bottlenes, token usage spike ,, debuging and optimization…
-
OpenTelemetry - collect traces – unversal data collected - capture
-
Promethies - queery store metrics - error rate, speed. - store
-
Graphana - dashboards, notifictions alerts – view..
-
latency - request - response - search/voice assistence is very critical
-
token usages - input and output - log them .. optimise prompts
-
user statisfaction - answers are helpful
- feedback
- implicit singnal - ask same question again in different ways..
- timeout of safety vailoene
- hanmful violence
-
Detecting hulucination
- often looks correct but incoeect
- RAG
-
AI bais - human langugea .. baised against.. - reduce fairness, and trust - training on more data, tuning
-
Toxicity - serious harm to brand… - gaurdrails, filters etc..
RCA -
- hulicenious - RAG failed, could not access
- 5Why -
- why was
- Fishbonne
- where they
- Failt tree -connection between problems
- 5Why -
Feedback loops
- tag error
- feeding data back to ..
HUman eval
- slow, costly,,
Auto eval
- missed tone, quality…
combine both.. compliment…
-
did we evaludated all neded ?
-
what is recall rage
-
did it retrivew relevant documents?
Design evaluation
-
logging,
-
monotring,
-
and feedback
-
performacne, cost and privacy.
integrating CI/CD pipeline
- pre-dep checks
- acuracy, safe, perf
- at dev desk
- eval in CICD pipeline
- canary - A/B testing..
Memory
IOT… - lab is not enough, … need
batary is not known.
alerts play book
- what to do..
- who to
Incident Response Plan..
Scalable human evaludation…
- sampling
- traige - human time where it matters more.
Cost vs Budget
hybrid model.
#evolution #ai #genai #technology #prompt #tutorial #learnings #RAG #fine-tuning #development #embeddings #agenticai #future #practice