Cnockwork.io的火炬Pass软件通过让GPU进行现场迁移,每年在大型AI集群中节省数百万人,从而防止AI培训崩溃。
Clockwork.io's TorchPass software prevents AI training crashes by enabling live GPU migration, saving millions annually in large AI clusters.
Clockwork.io已推出TerchPass软件解决方案, 让大型AI培训群组的 GPU 实时迁移和差错容忍, 防止硬件故障、网络问题或驱动器故障期间重开费用高昂的工程。
Clockwork.io has launched TorchPass, a software solution that enables live GPU migration and fault tolerance in large AI training clusters, preventing costly restarts during hardware failures, network issues, or driver bugs.
该系统保持培训的连续性,没有设置检查站,支持反应性、主动性和维护性故障,每年可节省600多万美元,用于2 048个GPU装置。
The system maintains training continuity without checkpointing, supports reactive, proactive, and maintenance-based failover, and can save over $6 million annually in a 2,048-GPU setup.
随着大规模集束式拆卸的故障率上升,在16,384-GPU系统中,故障率仅达1.8小时,TorchPass提高了可靠性、GPU利用率和示范培训效率。
As failure rates rise in massive clusters—dropping mean time to failure to just 1.8 hours in a 16,384-GPU system—TorchPass improves reliability, GPU utilization, and model training efficiency.
早期采用者报告投入、复原力和服务级协议业绩得到加强,为AI基础设施中的主要成本障碍提供了软件驱动的解决方案。
Early adopters report enhanced throughput, resilience, and service-level agreement performance, offering a software-driven fix to a major cost barrier in AI infrastructure.