At our recent Generative London event, Gemma Garriga, Technical Director, Office of the CTO, Google Cloud and Meryem Arik, Co-Founder of TitanML spoke with Antoine Moyroud, Partner at Lightspeed, to decode the intricacies of optimizing large language model (LLM) inference. As the buzz around deploying LLMs for real-world applications intensifies, the discussion shared valuable thoughts on quantization, fine-tuning, hardware choices, benchmarking challenges, and the ever-evolving landscape of model architectures.
Here are four areas of AI technology the panel covered in their discussion:
The panel dived into quantization techniques, exploring methods to slim down models without compromising accuracy. Meryem highlighted activation weight quantization (AWQ) as a standout, capable of achieving substantial compression without sacrificing accuracy. Alongside, the discussion touched upon TensorRT-LLM, VLLM, Titan Takeoff Inference Server, Google Cloud’s Vertex AI, and its varying trade-offs, offering developers a view of these optimization techniques.
Fine-Tuning vs Prompting
Drawing from practical examples, Gemma shed light on the efficacy of fine-tuning and prompting strategies. She shared real-life examples of leveraging fine-tuning and other techniques to successfully deliver complex automated experiences, e.g. for a leading quick service restaurant in the USA. Gemma highlighted how fine tuning allows you to get more precise at specialised tasks with smaller models and better serving performance. Meryem’s work demonstrated how prompting techniques still excel in tasks like knowledge injection or generating specific outputs, notably JSON schemas. We discussed other existing solutions such as Outlines or the latest OpenAI dev day announcements on JSON mode. Meryem also shared her thoughts on how we will see fine-tuning dying out as models get better and alternative techniques get better.
Optimizing Total Cost of Ownership
Beyond shrinking model sizes, the conversation emphasized the broader spectrum of cost optimization. Shorter prompts, model pruning, and runtime tweaks emerged as additional avenues to reduce expenses. The importance of benchmarking on actual workloads rather than confined lab tests was emphasized. Gemma stressed the importance of striking the right balance between leveraging the latest open source technologies and considering total cost of ownership. She emphasized evaluating techniques against real-world enterprise use cases. Reducing cost of inference might end up increasing the total cost of ownership of a solution, and a smarter start to test product market fit is simply using managed APIs that already integrate in MLOps platforms.
Adapting to the Shifting Model Landscape
Acknowledging the continued relevance of transformers, the panel stressed the evolving nature of model architectures. Encouraging developers to focus on real-world use cases, Meryem noted that models evolve rapidly, with transformers only emerging a few years ago unexpectedly. She pointed out we have not yet reached the limits of what can be achieved through increased compute and training on larger datasets. The panel agreed it is difficult to predict what the next major breakthrough may be. Companies must balance extracting value from today’s techniques while future-proofing their systems to some degree. As new architectures appear, lab benchmarks may also quickly become outdated, highlighting the importance of real-world use case evaluation.
As the generative AI journey unfolds, the emphasis remains on real-world optimization, marrying technical innovation with practical deployment. Continuous advancements in quantization, runtime optimization, and more pave the way for exciting prospects in deploying large language models at scale. The intent of the discussion was to provide a compass for developers and founders, directing their focus towards the pragmatic implementation of generative AI while keeping an eye on the latest developments.