Current multimodal foundation models (MMFMs) face a critical dilemma in modality expansion: joint multimodal learning locks models into fixed modality combinations, while progressive multimodal learning enables modal expansion but risks catastrophic forgetting due to parameter overwriting. We believe that the root cause of this challenge is the tight coupling of cross-modal general knowledge and modality-specific knowledge. Inspired by Piaget's Theory of Cognitive Development, we propose the Phased Knowledge Consolidation (PKC) principle: early-stage modality learning should prioritize cross-modal general knowledge, while later-stage should focus on modality-specific knowledge. Based on PKC, we introduce AllSparkv2, a progressive multimodal learning framework that decouples cross-modal general knowledge from modality-specific knowledge at both the architecture and training strategy levels. In architecture, we propose Modal Mixture of Experts (M-MoE), where dedicated experts handle different modalities to decouple the parameter space and new modality experts inherit cross-modal general knowledge by initializing from existing ones. In training, we implement a hierarchical modality learning strategy, starting with vision as the initial modality, followed by point clouds as the successive modality. AllSparkv2 undergoes full-parameter training on vision for powerful cross-modal general knowledge, while only modality-specific experts are trained for point clouds, preserving existing knowledge. Experiments show that AllSparkv2 can progressively integrate new modalities. Although training only point cloud experts, allsparkv2’s performance on the 3D-MM-Vet Benchmark remains nearly identical to full-parameter training (34.05 vs. 34.0). Meanwhile, freezing old modality parameters fundamentally prevents catastrophic forgetting observed in full-parameter training (0 vs. -4.6). Moreover, we observe that point cloud performance improves in tandem with vision, demonstrating cross-modal enhancement effect. AllSparkv2 not only presents a novel multimodal learning framework but also offers fresh insights into multimodal expansion and coordination. The code and models are available at https://github.com/GeoX-Lab/AllSparkv2.
Overview of AllSparkv2.