Draft Outline
- Use vLLM for high-throughput serving and production-style inference workloads
- Use Hugging Face tooling for experimentation, model access, fine-tuning workflows, and ecosystem breadth
- Use llama.cpp for lightweight local inference, quantized models, and CPU/GPU-constrained environments
- Compare latency, throughput, hardware needs, ecosystem maturity, and operational complexity