They’re probably stacks of 8x NPU Huawei servers all cooperatively serving the same few models.
As an older example, I believe Deepseek V3 was most optimally served with ~384 GPUs in a single cluster, before they switched to Chinese NPUs. So they’d have some software that ties all these together as one “server” and maybe multiple of those all serving API requests for one endpoint.
But it doesn’t actually need all 384 in each server. Many models will fit in a single 8-GPU/NPU server, but the software pools more just to try and utilize the hardware better.
If one server fails, the system would return a few requests as empty and have to restart the serving software, but… that’s fine. All the data is ephemeral. Even if the whole 24MW unit fails, they can just route API requests somewhere else, and a few failed generations isn’t a big deal.
They’re probably stacks of 8x NPU Huawei servers all cooperatively serving the same few models.
As an older example, I believe Deepseek V3 was most optimally served with ~384 GPUs in a single cluster, before they switched to Chinese NPUs. So they’d have some software that ties all these together as one “server” and maybe multiple of those all serving API requests for one endpoint.
But it doesn’t actually need all 384 in each server. Many models will fit in a single 8-GPU/NPU server, but the software pools more just to try and utilize the hardware better.
If one server fails, the system would return a few requests as empty and have to restart the serving software, but… that’s fine. All the data is ephemeral. Even if the whole 24MW unit fails, they can just route API requests somewhere else, and a few failed generations isn’t a big deal.