KIOSHIROI's CS-learning Road

Links About 🚇

2026年2月01日

1 min read

笔记-Lecture 02: Development Infrastructure & Tooling

This note is hosted on Notion: Lecture 02: Development Infrastructure & Tooling

views | comments

Dream： 项目规范 + 样本数据 → 可持续改进的系统并可以大规模部署

Reality：

Frameworks & Distributed Training

Pytorch, JAX, Tensorflow, ONNX, Huggingface, Tim

DDP, Horovod

Sharded Data-Parallel

ZeRO-3 (Fully-sharded DataParallel)

Model-Parallel (Deepspeed, FairScale, Pytorch支持)

Tensor Parallel (Megatron LM)

trainer = Trainer(stategy="ddp_sharded")

加速方法：

混合精度分布式训练
- 16-bit 混合精度；
- 单GPU/多GPU/多节点
Ultra-fast dense transformer kernels
稀疏注意力
- 内存/计算高效的稀疏注意力；
- 支持 10x long 序列 than dense；
- 灵活支持多种稀疏架构
1-bit Adam, 0/1 Adam, 1-bit LAMB
Additional Memory & Bandwidth Optimizations

Compute

三个因素：

GPU处理的数据量
GPU处理数据的速度
CPU和GPU，GPU之间的通信时间

单位时间价格高的显卡可能会让总试验成本下降。

Resource Management

Slurm，Docker，Kubernetes，Kubeflow，Anyscale，Grad.ai，Detemined.ai

Experiment Management

tensorboard, Weight&Biased

Hypeparams Optimizer:

W&B Sweeps：yaml 扫描超参数

在 Notion 参与讨论

本文托管在 Notion，欢迎到原文评论区留言交流

在 Notion 打开

笔记-Lecture 02: Development Infrastructure & Tooling

https://kioshiroi.github.io/blog/fsdl2

Author KIOSHIROI

Published at 2026年2月01日

Copyright CC BY-NC-SA 4.0

Buy me a cup of coffee ☕.

笔记-Lecture 03: Troubleshooting & Testing

笔记-Lecture 01: When to Use ML and Course Vision

Comment seems to stuck. Try to refresh?✨