The year 2024 brought numerous exciting conferences – one of them was the ICLR (International Conference on Learning Representations) in Vienna. In this article, Jens Brandt, a PhD student at the IDE+A Institute, shares his impressions of the conference, exciting research highlights, and personal experiences from the world of machine learning.
ICLR 2024 Primer
Dominant Keywords:
– Large Language Model. Reinforcement Learning, Diffusion Model, Graph Neural Network, Generative Model, Interpretability
KPIs
– 7262 Submissions, 2260 Accepted
– 6533 Total Attendees
– 8950 Reviewers
– 1647 USA 814 China 494 Germany
– 41.1% of all Submissions Written with LLM Support
Summary
This year’s ICLR was dominated by research in the field of large language models and visual- and multi-modal foundation models. A lot of research happened towards in-context learning, fine-tuning strategies, model size reduction and mechanistical understanding of the underlying dynamics in large foundational models. Besides that, Reinforcement Learning, Data Centric ML and AI for Science (e.g. PINNs) were also very dominant in the poster sessions.
Keynote Highlights
In her keynote talk – Why your work matters for climate in more ways than you think – Priya L. Donti (Assistant Professor MIT, Co-Founder and Chair, Climate Change AI) elaborated on the influence of the AI community on the climate. AI applications, ranging from forecasting models to enhancing machine efficiency, offer promising avenues for mitigating and understanding climate change. In contrast, there are also AI applications that increase emissions, e.g. efficiency improvements in the oil industry. The increase in hardware efficiency, which has kept pace with the demand for computing power in the past, has slowed down and in recent years humanity has required more and more computing power (including AI applications). Moreover, the systemic societal impacts of AI continue to expand. Highlighting a disparity between the prevailing AI paradigm and the reality on the ground, she argues that the reality is often characterized by little data that is difficult to move, limited computing power (e.g. edge devices) and the need to save energy. Methodological frontiers critical for addressing climate concerns include physics-informed ML, safe and robust ML, interpretable ML, uncertainty quantification, generalization, causality, energy-efficient ML, TinyML, and AutoML.
In her keynote talk – Learning through AI’s winters and springs: unexpected truths on the road to AGI – Raia Hadsell (Senior Director of Research and Robotics at DeepMind) reflected the past decades of AI research. She constitutes that the heydays of Reinforcement Learning are over since learning from scratch is extremely challenging and currently no sufficient level of generalization seems achievable. She sees the biggest chance to break those hard walls by deploying a lot of agents in a real-world environment instead of simulations, since the reality is much more complex and noisier than simulations. Furthermore, she discusses if models should be multitudes or monoliths and highlights the advantages of multitudes in respect to distributed training on differing hardware (See DiLoCo and DiPaCo). Lastly, she talked about the great possibilities in AI for Science (e.g. GenCast).
In his keynote talk – The ChatGLM’s Road to AGI – Tang Jie (Professor Tsinghua University) talks about a suite of models from Zhipu AI that match the offered models from OpenAI. He highlighted that most of their models are open source and elaborated on their idea of an AGI and some finding regarding emerging abilities.
In his keynote talk – The emerging science of benchmarks – Moritz Hardt (Director Max Planck Institute for Intelligent Systems) analyzed the dynamics of publicly available benchmarks that are used excessively in the research community. He shows that the vault assumption doesn’t hold on those test sets because there is a closed feedback loop between the ML Community and the benchmarks which in theory should reduce the life span of such benchmark sets. Luckily, the competitive nature of those benchmarks seems to work as some kind of regularization and “recovers” the vault assumption. Furthermore, he showed that the extensive data cleaning and curating of ImageNet was probably not necessary [1]. Finally, he discussed the polymorphic era of benchmarks, multi-task benchmarks, concerns about static benchmarks and the ambitious plans to develop dynamic benchmarks that are continuously adversarial extended.
Test of Time Award
For the ICLR Test of Time award, the Program Chairs examined papers from ICLR 2013 & 2014 and looked for ones with long-lasting impact. The winner is Auto-Encoding Variational Bayes by Diederik Kingma and Max Welling [2]. This paper gave rise to the variational autoencoder (VAE) and integration of deep learning with scalable probabilistic inference. [3]
Research Highlights
A lot of works investigated how the size of a model can be reduced without losing much performance in the process. Especially pruning approaches, that try to remove less important weights from a model and initialization approaches that try to initialize a small model with weights based on a trained big model seemed promising. [4-8] Especially the work of A. Bair et al. stood out because it proposes a pruning method that tries to find a sparse model that does not lose its out-of-distribution robustness (See Fig. 1) [9].

My favorite paper and presentation of the conference Vision Transformers Need Registers by T. Darcet et al. investigated artifacts in the attention maps of vision transformers, especially high-norm tokens in low-informative background areas. They assume that those tokens are used by the model to store some kind of global information for internal computation. To address this, the simple fix is to append additional register tokens to the input that serve as storage for this global information. The work of M. Sun et al. investigated this phenomenon as well (See Fig. 2). They analyzed what happens when you remove the resulting high activations during inference and found that the magnitude of the activations is important, but not the exact value. Therefore, they interpreted these activations as some kind of learned bias. To fix this issue, that might cause stability problems in training, they found an even simpler fix without the need of appending additional tokens. [10,11]

2nd Workshop on Mathematical and Empirical Understanding of Foundation Models
The workshop kicked off with a talk by Sasha Rush (Associate Professor Cornell University, Hugging Face). He contrasted new state space models like Mamba or xLSTM against transformers. The biggest benefit of the former lies in the faster, less memory-heavy computation of the models, which especially comes in handy when training models with very long context length. After that, he presented MambaByte, a Language Model that works without Tokenization (on Byte Level), which wouldn’t be possible with transformer models due to the quadratic compute increase with the sequence length. Lastly, he introduced DiffuSSM, a SSM for Image Generation.
After a Talk by Yuandong Tian (META, FAIR) about recent findings in respect to the self-attention mechanism, Hannaneh Hajishirzi (University of Washington) presented OLMo, a A State-of-the-Art, Truly Open LLM and Framework with corresponding cleaned training dataset. If you are interested in all the details of training modern LLMs, this work is for you. There were further interesting presentations about the inner workings of transformer models in the afternoon. if you are curious, I can provide some more material on this, but my train ride home is about to end soon so I stop here.
Sources
1. https://arxiv.org/abs/2404.02112
2. https://arxiv.org/abs/1312.6114
3. https://blog.iclr.cc/2024/05/07/iclr-2024-test-of-time-award/
4. https://arxiv.org/pdf/2306.11695
5. https://openreview.net/pdf?id=dyrGMhicMw
6. https://openreview.net/pdf?id=ldJXXxPE0L
7. https://openreview.net/pdf?id=kOBkxFRKTA
8. https://arxiv.org/abs/2303.04947
9. https://openreview.net/forum?id=QFYVVwiAM8