Contents
- Introduction
- Class-wise self-knowledge distillation (CS-KD)
- Class-wise regularization
- Effects of class-wise regularization
- Experiments
- Classification accuracy
- References
Introduction
- 为了缓解模型过拟合,作者提出 Class-wise self-knowledge distillation (CS-KD),用同一类别的其他样本的预测类别概率去进行自蒸馏,使得模型输出更有意义和更加一致的预测结果
Class-wise self-knowledge distillation (CS-KD)
Class-wise regularization

- class-wise regularization loss. 使得属于同一类别样本的预测概率分布彼此接近,相当于对模型自身的 dark knowledge (i.e., the knowledge on wrong predictions) 进行蒸馏
其中,
x
,
x
′
\mathbf x,\mathbf x'
x,x′ 为属于同一类别的不同样本,
P
(
y
∣
x
;
θ
,
T
)
=
exp
(
f
y
(
x
;
θ
)
/
T
)
∑
i
=
1
C
exp
(
f
i
(
x
;
θ
)
/
T
)
P(y \mid \mathbf{x} ; \theta, T)=\frac{\exp \left(f_y(\mathbf{x} ; \theta) / T\right)}{\sum_{i=1}^C \exp \left(f_i(\mathbf{x} ; \theta) / T\right)}
P(y∣x;θ,T)=∑i=1Cexp(fi(x;θ)/T)exp(fy(x;θ)/T),
T
T
T 为温度参数;注意到,
θ
~
\tilde \theta
θ~ 为 fixed copy of the parameters
θ
\theta
θ,梯度不会通过
θ
~
\tilde \theta
θ~ 回传到模型参数,从而避免 model collapse (cf. Miyaeto et al.) - total training loss


Effects of class-wise regularization
- Reducing the intra-class variations.
- Preventing overconfident predictions. CS-KD 通过将同一类别其他样本的预测类别分布作为软标签来避免 overconfident predictions,这比一般的 label-smoothing 方法生成的软标签更真实 (more ‘realistic’)


Experiments
Classification accuracy
- Comparison with output regularization methods.

- Comparison with self-distillation methods.

- Evaluation on large-scale datasets.

- Compatibility with other regularization methods.


- Ablation study.
(1) Feature embedding analysis.
(2) Hierarchical image classification.

- Calibration effects.


References
- Yun, Sukmin, et al. “Regularizing class-wise predictions via self-knowledge distillation.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
- code: https://github.com/alinlab/cs-kd



















