How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast

Nov 03

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Latent Space: The AI Engineer Podcast

Nov 03

Overview Shownote Highlights Transcript Chapters Pins

In this episode, we dive into the technical and strategic decisions behind building high-performance AI models on alternative hardware, as Quentin Anthony shares insights from his work at Zyphra and EleutherAI. From rethinking GPU ecosystems to pioneering edge-optimized architectures, the conversation reveals how low-level engineering choices are shaping the future of accessible, efficient AI deployment.

Quentin Anthony discusses Zyphra's shift to AMD MI300X GPUs, where superior memory bandwidth enables faster training than NVIDIA H100s for certain workloads. He emphasizes writing custom kernels in ROCm and even assembly when necessary, bypassing high-level frameworks for maximum control. Their hybrid Mamba-transformer models, like Zamba 2, achieve competitive performance at smaller parameter counts, ideal for edge devices from phones to desktops. Quentin reflects on being a rare success case in the Menlo AI productivity study, advocating disciplined AI use—avoiding context rot by restarting chats and delegating only specific subtasks. He critiques overreliance on AI for core reasoning and highlights the value of velocity-focused engineering teams. Open-source research at EleutherAI prioritizes interpretability, while he argues companies should invest in developer ecosystems over traditional marketing to drive real innovation.