The Challenge of Building LLMs

Lightspeed recently kicked off the Generative Europe event series with its inaugural meet up in Paris. This first edition gathered an impressive audience of founders, technical experts, researchers, and PhD graduates working with Large Language Models. Panelists Thomas Scialom, lead researcher on LLaMa 2 and Guillaume Lample, Chief Science Officer at Mistral.ai and lead researcher on LLaMa 1 were joined on stage by moderator Antoine Moyroud, a Partner at Lightspeed focusing on AI.

Throughout the talk, the three addressed the latest technical developments and challenges in building and evaluating state-of-the-art LLMs through an in-depth exploration of pre-training techniques, software optimization strategies, benchmarking methodologies, and the interdependence of hardware and software advancements.

Here, four take-aways from the event, and a reminder to please join us at one of our next Generative Europe sessions in Berlin or London!

1. The Critical Role of Software Optimisation in Pre-training and Compute Optimisation

Guillaume began by emphasizing that even with the same hardware, software optimizations can lead to dramatic training speedups. He provided the example of early CUDA implementations for LSTMs being slow, but that a single software update from Nvidia accelerated training 3x overnight using the same GPUs.

Both Thomas and Guillaume discussed at length the difficulty new hardware providers face in replicating mature software stacks with optimizations like those in CUDA that Nvidia has built up over 10 years supporting the deep learning community.

They analyzed why new architectures struggle without years invested in a full software ecosystem around them with optimized libraries, tools, and frameworks developed through open collaboration. We discussed the risk of teams not optimizing their Python code to leverage internal GPU clusters effectively. This poses a challenge as it can result in underutilization of the hardware and missed opportunities for improved performance. Both delved into examples of why software optimization is so challenging but critical, such as ensuring models maximize parallelism. They debated at length why achieving good performance requires considering both how models are architected to utilize hardware as well as software-level optimizations.

The panelists explored the close intertwining of hardware and software. Better pre-trained models enable more parameter efficient fine-tuning, reducing computational needs. Meanwhile, optimized software stacks are required to scale models to fully leverage growing hardware capabilities.

They analyzed the chicken-and-egg problem new architectures face in needing software support to demonstrate capabilities despite little community adoption justifying dedicating time to “less common” architectures. We discussed the “Hardware Lottery” paper written by Sara Hooker on this topic.

2. Benchmarking Complexities

Guillaume explored in greater depth the sources of noise and inconsistencies between implementations in current benchmarks. Factors like prompt formatting, minor code changes, and hardware-dependent optimizations can produce variance in results, calling into question the reliability of performance comparisons. Both advocated for standardized, reproducible evaluation pipelines. They delved into technical obstacles that make benchmark development difficult, such as the sensitivity of metrics to small variations.

Thomas provided examples of new complexities introduced by fine-tuning, like its multi-task nature and human evaluation aspects. The speakers analyzed difficulties experimentally assessing models tackling more nuanced problems.

We discussed the need for continued research to address benchmarking limitations and the recent release of Microsoft Research’s Phi 1.5.

We discussed open challenges like developing techniques to evaluate models beyond traditional metrics as capabilities exceed human performance. The speakers also debated approaches for automated evaluations that introduce less variability than ratings from different human evaluators.

Overall, the conversation underscored benchmarking as an important area still requiring significant progress and emphasized that evaluation will remain a challenge as models push into more complex problem spaces that strain existing methodologies. Continued collaboration is seen as key to advancing benchmarking standards and practices.

3. The Global State of Compute

Guillaume acknowledged that European providers currently lag behind American counterparts in available compute and data resources. However, he highlighted growing public investment from European governments looking to strengthen domestic AI capabilities. Countries are building out national supercomputing clusters and dedicated language model facilities such as the Leonardo cluster, the Jean Zay Supercomputer or the future Jules Verne initiative supported by the EuroHPC European Joint Undertaking.

Approaches like federated learning could help optimize training across Europe’s many smaller clusters but it was speculated that current algorithms are not on par with existing alternatives to enable this decentralized scenario. The limited interconnectivity and heterogeneous hardware were discussed as limiting factors.

Both Guillaume and Thomas saw opportunities for Europe if infrastructure gaps can be addressed. They agreed challenges remain as different regions work to establish competitive homegrown AI industries and contribute to collaborative research at scale. The recent Iliad Group x Scaleway announcements in that regard.

4. A Path Forward Through Open Collaboration

The panel shared their optimism that open collaboration and rapidly advancing hardware will help propel the entire field forward. They analyzed how open sharing of research through conferences and preprints has already accelerated progress tremendously. While individual closed models may achieve state-of-the-art performance for periods of time, over the long arc of progress open approaches tend to reach and eventually surpass them by building on each other’s work. They were confident that as hardware capabilities expand exponentially in the coming years, open source models would be able to match or exceed currently closed counterparts’ abilities through the cumulative, collaborative efforts of the global research community tackling challenges openly.

Thomas and Guillaume emphasized their view that international cooperation, rather than competition, holds the greatest promise to advance the frontier of language model techniques for the benefit of all. Coincidence (or not), the Mistral team released their Mistral 7B model, which the open source community has quickly started testing, benchmarking and fine tuning for different applications. Discover more about the Mistral 7B model here.

Authors

Antoine Moyroud