Testing Software

在测试前要搞明白两个问题：

哪些是我们测试要捕捉的真正bug？
哪些错报会由我们的测试引起？

	bug	OK
√	Missed Alarm	Smooth Sailing
×	Caught Bug	False Alarm

在一些情况下，正确性很重要：

Cases：
- Medical dx & intervention
- Self-driving vehicles
- Banking, Finance
Patterns：
- Autonomous systems
- High stakes

使用测试工具，但不追求完全覆盖

Pytest, doctest, codecov

标准化统一编码风格

black Formatter, Flake8, shellcheck

强迫的风格限制令人沮丧

建议：

将规则限制在为了完成目标的最小风格要求。（坚持标准，避免争议，保持版本控制历史）
“opt-in”规则，随着时间扩大覆盖，尤其是对已有的代码库

代码库管理工作流：Github Actions，precommit

80/20 原则

Testing ML Systems

在软件工程中，我们将源码编译成程序：

在ML中，训练将数据“编译”成模型：

ML中每一个部分都更难测试。

Data：庞大，难以理解
Training：复杂，定义不明确，不成熟
Model：调试和检查工具更差

我们主要使用“冒烟”测试

让你知道哪里出了问题并帮助理解问题

这些测试易于实施，而且仍然非常有效。

遵循了 80/20 原则，20%的工作实现80%的价值。

Data Test

对于数据，使用期望测试（expectation testing）

To start：test data by checking basic properties

检查数据行列的一些属性（范围，缺失等）

对于数值，尽可能扩大范围避免误报

great_expectations

Training Test

对于训练，使用记忆测试（memorization testing）

用小数据集测试训练的效果。

Model Test

对于模型，使用回归测试（regression testing）

模型其实是一种函数，所以像测试函数一样测试它们。

使用Test-driven development

将 Loss（损失）看做一个 “fuzzy” 测试信号

Treat the los as “how badly this test was failed”
During training, our model changes to do better on the test.
Model stops changing once its passes the tests well enough.

将损失看作"测试失败的程度"⁠⁠ — 损失值越高，说明模型在测试上表现越差，就像测试失败得越严重。

训练过程中模型不断改进⁠⁠ — 通过训练，模型会调整参数以在测试上表现更好，降低损失值。

模型在通过测试后停止变化⁠⁠ — 当模型的损失降低到足够好的程度（即"通过"了测试），模型就不再需要大幅改变了。

**结论：**梯度下降是 TDD.

在测试时应当收集高损失的样本

Find the data points with highest loss and put them in a suite labeled “hard”.
But note that problem could be in the model OR it could be in the data!

Aggregate individual failures into named suites. This is the ML version of “regression tests”.

在生产环境中运行测试对ML很重要（需要构建基础设施和工具快速修复生产环境中的错误）

总结：

聚焦于low-hanging fruit and “smoke” test for ML.
Expectation test for data
Memorization test for training
Regression test for models

Troubleshooting models

Key Ideas:

“Make it run” 通过阻止常见errors: shape, OOM, numerics
“Make it fast” 通过性能测试和移除瓶颈
“Make it right” 通过scaling model/data 和 sticking with proven architectures

Make it run

绝大多数的ML 失败仅由小部分bug引起：

Shape errors: Tensors don’t match.
Out of memory errors: Tensors don’t fit on GPU.
Numerical errors: Tensors have NaNs or +-infs.

Shape errors:

Step through in a debugger.
在代码中注释shapes

OMM errors:

--precision=16
--auto_scale_batch_size
--accumulate_grad_batches

Tensor Parallelism
Gradient checkpoints

Numerical errors:

--track_grad_norm
--precision=64

Look into normalization tracking for your architecture.

Make it fast

--profiler

Make it right

ML models are always wrong.

在生产环境从不完美
如果非零损失算是一种部分失败，那么我们的测试总是部分失败。

所以不可能”make it right”

Solve all your problems with “scale”:

Overfitting? Scale up data.
Under-fitting? Scale up model.
Distribution shift? Scale up both.
Can’t afford scale? Fine-tune a model trained at scale.

在 Notion 参与讨论

本文托管在 Notion，欢迎到原文评论区留言交流

在 Notion 打开

笔记-Lecture 03: Troubleshooting & Testing