Function-Preserving Expansion and the Unsolved Composition Problem in Continual Learning
Catastrophic forgetting remains a fundamental obstacle in the deployment of large language models. When practitioners adapt a pre-trained model to specialized domains, whether medical diagnosis or scientific reasoning, standard fine-tuning procedures overwrite the parameter values that encode general world knowledge. The resulting model may excel at its narrow specialty while losing basic arithmetic or linguistic competencies, rendering it unsuitable for real-world applications that require both expertise and general competence.
Existing solutions broadly fall into two categories, each with distinct limitations. Regularization-based methods, such as elastic weight consolidation, penalize deviations from the original parameter values. However, these approaches impose a zero-sum constraint: within a fixed capacity, any computational resources dedicated to preserving past knowledge necessarily reduce the capacity for acquiring new skills. Alternatively, capacity growth methods add new parameters for new tasks while freezing the original model, but they introduce a dilemma between training stability and learning efficiency. Randomly initialized expansion modules ensure function preservation at initialization but ignore the rich knowledge embedded in pre-trained weights. Conversely, methods that reuse pre-trained weights typically violate the function-preserving constraint, leading to unstable training dynamics.
In "Grow, Don't Overwrite: Fine-tuning Without Forgetting," Adila, Mazzawi, Dherin, and Gonzalvo from Google Research and the University of Wisconsin-Madison propose a novel function-preserving expansion technique that resolves this stability-efficiency trade-off. Their approach expands model capacity by replicating pre-trained parameters within transformer MLP submodules while applying a scaling correction that guarantees mathematical identity to the original model at initialization.
The Mechanics of Stable Capacity Expansion
The core innovation lies in a surgical modification of the MLP submodules within Transformer architectures. A standard MLP layer consists of an up-projection matrix W^(1) that expands the hidden dimension from size h to an intermediate dimension p, followed by a nonlinearity and a down-projection matrix W^(2) that compresses back to dimension h. The authors' method expands the internal capacity by creating k copies of the up-projection weights, effectively multiplying the intermediate dimension by k. To ensure the expanded model remains functionally identical to the original at initialization, they apply a compensatory scaling to the down-projection weights, dividing them by the replication factor k. This scaling perfectly counteracts the wider internal activation, preserving the mathematical mapping from input to output.
The paper introduces two training variants. In G-Freeze, only the newly added parameters are trained while all original parameters remain frozen. In G-Train, the entire up-projection matrix is trained while the down-projection matrix remains frozen. Both variants ensure that the model can learn new task-specific representations without destabilizing the pre-trained knowledge embedded in the frozen components. Notably, even when expanding every MLP layer in the network, this approach requires training only 60% of the original model's parameters, offering substantial computational savings compared to standard fine-tuning which modifies 100% of parameters.
Empirical Results and the Promise of Modularity
The authors evaluate their method across diverse benchmarks including question-answering, machine translation, and mathematical reasoning tasks. Their results demonstrate that the expansion approach matches the performance of standard full fine-tuning on downstream tasks while exhibiting almost zero performance degradation on proxy benchmarks measuring original capabilities. This effectively eliminates the catastrophic forgetting phenomenon without compromising plasticity.
Perhaps most intriguing is the demonstration of modularity through selective expansion. The authors show that expanding and fine-tuning only a targeted subset of layers, rather than the entire network, achieves performance equivalent to full expansion at a fraction of the computational cost. This suggests that the model's capacity for new skills can be augmented through targeted surgical interventions rather than global modification, opening possibilities for efficient, task-specific adaptations.
The Composition Problem: Beyond Single Task Adaptation
While the technical elegance of function-preserving expansion is compelling, the paper's evaluation framework reveals a critical limitation that the field must address. The experiments focus exclusively on single-task adaptation scenarios: taking a base pre-trained model, expanding it for one specific task, and evaluating performance on both the new task and original capabilities. However, real deployment scenarios typically require continual or lifelong learning, where models must accumulate capabilities across multiple sequential tasks.
This raises what we might term the composition problem. If we expand a base model for task A using this method, then subsequently expand that already-expanded model for task B, does the second expansion interfere with the capabilities learned in the first? The current method provides no mechanism for algebraic composition of multiple expansions. Unlike adapter-based methods or low-rank adaptation techniques that can be added, removed, or merged through simple arithmetic operations, the scaling corrections in this expansion method create complex dependencies between layers. When multiple expansion operations stack, the scaling factors compound in ways that may not preserve the functional integrity of earlier adaptations.
Furthermore, the modularity demonstrated through selective layer expansion, while computationally efficient, creates a proliferation of model variants. Each task-specific expansion produces a distinct monolithic architecture with expanded dimensions in specific layers