Blog — Chinmay Hebbal

01

MCP

BUILDING A GMAIL MCP SERVER FROM SCRATCH

How I exposed Gmail as AI-native tooling over IMAP/SMTP, no OAuth dance, no browser automation, so any MCP-compatible agent can search, read and send email through standardised tool calls.

Python FastAPI Docker IMAP/SMTP Anthropic

02

A2A

MULTI-AGENT PIPELINES WITH GOOGLE ADK & A2A PROTOCOL

Four specialised agents (Researcher, Judge, Content Builder, Orchestrator) coordinate over Google's A2A protocol to generate structured courses. Here's how the quality loop and SSE streaming work under the hood.

Google ADK A2A Protocol Gemini 2.5 Pro FastAPI SSE

03

SLM

BENCHMARKING SMALL LLMs ON 4 GB VRAM

Qwen2.5:3B vs Gemma2:2B vs Llama3.2, measured on TTFT, tokens/sec, and quality across 6 prompt categories. Spoiler: the fastest model isn't the best, and the best isn't even close on speed.

Ollama Streamlit ROUGE-1 Qwen2.5 Gemma2 Llama3.2

04

Inference

vLLM vs SGLang vs TRT-LLM on RTX 5090

Three serving backends, one GPU, one model, benchmarked across practical, overload, and extreme concurrency. TRT-LLM wins at c=64 by 41%; vLLM and SGLang reclaim the lead when the queue never empties.

vLLM SGLang TensorRT-LLM CUDA 12.9 RTX 5090 Streamlit

05

AMD ROCm

LLMs AS A PRODUCTION SERVICE ON AMD MI300X

Serving a 120B-parameter model from a single card, launching the vLLM server with AITER kernels, wiring an OpenAI-compatible API, and stress-testing with bench serve, bench latency, and bench throughput.

vLLM ROCm MI300X AITER OpenAI API

06

Kubernetes

KUBERNETES ON AMD INSTINCT GPUS

From single pod to production: GPU operator setup, MetalLB bare-metal load balancing, request-driven autoscaling, Prometheus observability, and the four failure modes that stop most deployments before a token is served.

Kubernetes AMD Instinct MetalLB Prometheus GPU Operator

07

AMD ROCm

GPU OPTIMIZATION FOR LLM INFERENCE ON AMD MI300X

Every environment variable and vLLM serve flag explained: AITER master switch, FP8 KV-cache, INT4 all-reduce quantization, chunked prefill trade-offs, and what each knob does to throughput and latency on CDNA3.

vLLM ROCm AITER FP8 INT4 CDNA3