Growing machine learning models

Most of the groundbreaking results in the field of Machine Learning that we have seen lately use enormous neural networks containing hundreds of millions of parameters. Training such a model is not only time-consuming but also has quite a negative environmental impact. Moreover, training CharGPT, for example, is estimated to have cost around one million dollars.

These issues make it impossible for an individual or even a smaller company to conduct research on such models, making the overall speed of technological advancement lower than it could be possible.

Therefore, at TechnoLynx, we believe that lowering the barrier of entry to these technologies is of utmost importance, which partly includes optimizing the training processes.

Luckily, research is being done on this topic, and we welcome LiGO, a new method for learning by model growth. LiGO is shown to perform better than any previous model growth technique, let alone training from scratch. As a bonus, surprisingly, the network trained using LiGO is showing better results than traditional training techniques in many cases.

Why model growth matters in practice

The cost wall around frontier models is not only financial. It shapes who gets to do the work, what gets attempted, and how long each iteration takes. When a single training run consumes the kind of budget that only well-capitalised labs can absorb, the search space for new architectures narrows sharply. Smaller groups end up fine-tuning what already exists rather than asking the harder structural questions about how these systems should be built in the first place.

Model-growth techniques like LiGO change the arithmetic. Instead of initialising a large network from random weights and paying the full training bill, the approach reuses a smaller trained network as scaffolding — expanding it along width and depth while preserving what was already learned. The result is fewer wasted gradient steps and, in the cases reported, a final model that sometimes generalises better than its from-scratch equivalent. That second outcome is the more interesting one. It hints that the path a network takes through optimisation matters as much as its final shape.

Credits: MIT News