DeepSeek team just published a paper on Manifold-Constrained Hyper-Connections. It addresses a pretty specific bottleneck we are seeing with recent attempts to scale residual streams.

The core issue they are tackling is that while widening the residual stream (Hyper-Connections or HC) gives you better performance by adding more information capacity, it usually breaks the identity mapping property that makes ResNets and Transformers trainable in the first place. When you just let those connection matrices learn freely, your signal magnitudes go haywire during deep network training which leads to exploding gradients.

Their solution is actually quite elegant. They force the learnable matrices to live on a specific manifold, specifically the Birkhoff polytope. Practically, this means they use the Sinkhorn-Knopp algorithm to ensure the connection matrices are “doubly stochastic,” meaning all rows and columns sum to 1. This is clever because it turns the signal propagation into a weighted average rather than an unbounded linear transformation. That preserves the signal mean and keeps the gradient norms stable even in very deep networks.

What I found most interesting though was the engineering side. Usually, these multi-stream ideas die because of memory bandwidth rather than FLOPs. Expanding the width by times typically creates a massive I/O bottleneck. They managed to get around this with some heavy kernel fusion and a modified pipeline schedule they call DualPipe to overlap communication.

The results look solid. They trained a 27B model and showed that mHC matches the stability of standard baselines while keeping the performance gains of the wider connections. It only added about 6.7% time overhead compared to a standard baseline, which is a decent trade-off for the gains they are seeing in reasoning tasks like GSM8K and math. It basically makes the “wider residual stream” idea practical for actual large-scale pre-training.

Expanding the residual stream adds more pathways for information to flow which helps with training on constrained hardware by decoupling the model’s capacity from its computational cost. Usually if you want a model to be “smarter” or maintain more state depth, you have to increase the hidden dimension size which makes your Attention and Feed-Forward layers quadratically more expensive to run. The mHC approach lets you widen that information highway without touching the expensive compute layers. The extra connections they add are just simple linear mappings which are computationally negligible compared to the heavy matrix multiplications in the rest of the network.

They further combined this technique with a Mixture-of-Experts (MoE) architecture, which is the component that actually reduces the number of active parameters during any single forward pass. The mHC method ensures that even with that sparsity, the signal remains stable and creates a mathematically sound path for gradients to flow without exploding VRAM usage. The intermediate states of those extra streams are now discarded during training and get computed on the fly during the backward pass. This allows you to train a model that behaves like a much larger dense network while fitting into the memory constraints of cheaper hardware clusters.

  • mub@lemmy.ml
    link
    fedilink
    arrow-up
    1
    ·
    12 hours ago

    It is rare that I fail to get the gist of what is being said in these technical explanations, but this one has me actually wondering about the gist of the gist. Some of it made me feel like it was made up nonsense.

    • ☆ Yσɠƚԋσʂ ☆@lemmy.mlOP
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      11 hours ago

      It seemed pretty clear to me. If you have any clue on the subject then you presumably know about the interconnect bottleneck in traditional large models. The data moving between layers often consumes more energy and time than the actual compute operations, and the surface area for data communication explodes as models grow to billions parameters. The mHC paper introduces a new way to link neural pathways by constraining hyper-connections to a low-dimensional manifold.

      In a standard transformer architecture, every neuron in layer N potentially connects to every neuron in layer N+1. This is mathematically exhaustive making it computationally inefficient. Manifold constrained connections operate on the premise that most of this high-dimensional space is noise. DeepSeek basically found a way to significantly reduce networking bandwidth for a model by using manifolds to route communication.

      Not really sure what you think the made up nonsense is. 🤷

      • mub@lemmy.ml
        link
        fedilink
        arrow-up
        1
        ·
        4 hours ago

        Thx. That is more helpful.

        I don’t actually think it was nonsense, it just sounded like it.

  • Paul Sutton (zleap)@techhub.social
    link
    fedilink
    arrow-up
    11
    ·
    1 day ago

    @yogthos

    It looks like the west has been caught with its pants down, China is developing what seems to be far more efficient AI tech, perhaps while big tech are motivated by money and IP theft, China has just got on with developing better ideas.

    This also seems to be a major win for Open Source models, which is a good thing. Could also be a good thing for the EU who want to develop their own AI solutions to break from US big tech.

        • just another dev@lemmy.my-box.dev
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          21 hours ago

          Snark aside, thanks for clarifying which kind of ip theft was meant, because this is not the kind of ip theft that is normally associated with training models.

          It would have been incredibly impressive if they managed to train it without stealing acquiring tons of data.

          • ☆ Yσɠƚԋσʂ ☆@lemmy.mlOP
            link
            fedilink
            arrow-up
            4
            ·
            21 hours ago

            I’m personally against copyrights as a concept and absolutely don’t care about this aspect, especially when it comes to open models. The way I look at is that the model is unlocking this content and making this knowledge available to humanity.

        • just another dev@lemmy.my-box.dev
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 day ago

          I was thinking about the training data, of which you need massive amounts to train. And as far as I know, pretty much all companies have worked on a scraping basis, rather than paying for (or even asking for).

          What kind of ip theft were you thinking of?

          • hitmyspot@aussie.zone
            link
            fedilink
            arrow-up
            1
            ·
            13 hours ago

            I was referring to both scraping to create the models and using the models to create infringing content.