Writing
How I exposed Gmail as AI-native tooling over IMAP/SMTP, no OAuth dance, no browser automation, so any MCP-compatible agent can search, read and send email through standardised tool calls.
Four specialised agents (Researcher, Judge, Content Builder, Orchestrator) coordinate over Google's A2A protocol to generate structured courses. Here's how the quality loop and SSE streaming work under the hood.
Qwen2.5:3B vs Gemma2:2B vs Llama3.2, measured on TTFT, tokens/sec, and quality across 6 prompt categories. Spoiler: the fastest model isn't the best, and the best isn't even close on speed.
Three serving backends, one GPU, one model, benchmarked across practical, overload, and extreme concurrency. TRT-LLM wins at c=64 by 41%; vLLM and SGLang reclaim the lead when the queue never empties.
Serving a 120B-parameter model from a single card, launching the vLLM server with AITER kernels, wiring an OpenAI-compatible API, and stress-testing with bench serve, bench latency, and bench throughput.
From single pod to production: GPU operator setup, MetalLB bare-metal load balancing, request-driven autoscaling, Prometheus observability, and the four failure modes that stop most deployments before a token is served.
Every environment variable and vLLM serve flag explained: AITER master switch, FP8 KV-cache, INT4 all-reduce quantization, chunked prefill trade-offs, and what each knob does to throughput and latency on CDNA3.