๐ก Key Takeaways
โข Moore Threads launches MTT S5000 GPU delivering 1000 TFLOPS single-card performance with native FP8 precision
โข 10 EFLOPS computing cluster operational, achieving 60% MFU in Dense model training with 95% linear scaling
โข Third-party validation shows DeepSeek-236B training maintains 0.6% accuracy deviation versus H100 clusters
๐ฏ Opening
Chinese AI computing landscape shifts as Moore Threads announces MTT S5000, a flagship GPU targeting large model training with hardware-level FP8 acceleration and 1000 TFLOPS single-card performance.
๐ Hardware Specifications
MTT S5000 specifications include 80GB video memory with 1.6TB/s bandwidth, 784GB/s inter-card interconnect bandwidth, and native support across FP8 through FP64 precision. FP8 precision reduces data width by half compared to BF16/FP16, lowering VRAM pressure by 50% while theoretically doubling compute throughput.
โก Performance Benchmarks
Testing reveals MTT S5000 achieves 30% training performance improvement versus H100 in multi-modal large model fine-tuning tasks. In 16k long-sequence input testing, single-card Prefill throughput reaches 2.5 times H20 performance. Industry sources indicate S5000 surpasses H100 in specific precision metrics, approaching Blackwell architecture levels.
๐ Architecture and Software Stack
Based on fourth-generation MUSA architecture, S5000 integrates hardware-level FP8 Tensor Core acceleration units fully supporting DeepSeek, Qwen, and other frontier architectures. The MUSA full-stack software platform provides native compatibility with PyTorch, Megatron-LM, vLLM, and SGlang frameworks, enabling zero-cost code migration while maintaining CUDA ecosystem compatibility.
๐ Training Cluster Performance
10 EFLOPS computing cluster using S5000 has achieved operational deployment. Dense model training demonstrates 60% model flops utilization (MFU), MoE models maintain around 40% MFU, and training linear scaling efficiency reaches 95%. From 64-card to 1024-card expansion, system maintains above 90% linear scaling efficiency, with training speed scaling nearly synchronously with computing power.
๐ Third-Party Validation
January 2026 witnessed Zhipu Research Institute complete end-to-end training and alignment validation of RoboBrain 2.5 frontier agentic model using S5000 thousand-card cluster. Results show training loss values maintain merely 0.6% relative accuracy deviation versus H100 clusters. Under equivalent data volumes, downstream task evaluation scores surpass H100, validating large-scale cluster high precision.
๐ Inference Performance
S5000 demonstrates superior performance in inference scenarios. December 2025 joint testing between Moore Threads and SiliconFlow for DeepSeek-V3 671B full-parameter version achieved single-card Prefill throughput exceeding 4000 tokens/s and Decode throughput surpassing 1000 tokens/s, refreshing domestic GPU inference records. For complex inter-agent high-frequency communication and instantaneous code block generation requirements, S5000 implements far exceeding industry benchmarks in DeepSeek frontier model inference.
๐ฌ Scientific Computing Capabilities
S5000 outperforms H100 in scientific computing scenarios through native FP64 double-precision computing capabilities. In SPONGE simulation engine, performance reaches 1.7 times H100. In molecular docking tool DSDP testing, computational efficiency demonstrates overwhelming advantages, achieving 8.1 times H100 performance.
โจ Conclusion
Moore Threads MTT S5000 provides viable domestic computing alternatives spanning complete large model training capabilities. From FP8 precision support, single-card 1000 TFLOPS performance, to ten-thousand-card clusterๅฎๆ achievements and third-partyๆบๆ validation results, the product demonstrates domestic GPUs not only execute inference effectively but already support large-scale model training computing requirements.