Unlocking Potential of Large Models in Production

Abstract:

The recent paradigm shift from traditional ML to GenAI and LLMs has brought with it a new set of non-trivial LLMOps challenges around deployment, scaling, and operations that make building an inference platform to meet all business requirements an unsolved problem. This talk highlights these new challenges along with best-practices and solutions for building out large, scalable, and reliable inference platforms on top of cloud native technologies such as Kubernetes, Kubeflow, Kserve, and Knative. Which tools help effectively benchmark and assess the quality of an LLM? What type of storage and caching solutions enable quick auto-scaling and model downloads? How can you ensure your model is optimized for the specialized accelerators running in your cluster? How can A/B testing or rolling upgrades be accomplished with limited compute? What exactly do you monitor in an LLM? In this session we will use KServe as a case study to answer these questions and more.

public-talks

Slides, videos, and supporting files for my public talks

Unlocking Potential of Large Models in Production