CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Author: Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo
Date: May 13, 2019
URL: https://arxiv.org/abs/1905.04899

Introduction

CNN 은 computer vision 문제에 많이 사용되고 있음.
효율적이고 높은 성능을 위해 data augmentation, regularization 등 기법을 적용.
특정 부분에 overfitting(?) 되는 것을 방지하기 위해 dropout, regional dropout 과 같은 방법 사용.
그 외에도 일부분을 0으로 채운다거나 노이즈로 채우는 방법, 정보가 있는 부분의 pixel을 줄이는 방법 등이 성능 향상을 보였으나 CNN은 데이터가 많이 고픈데….데이터를 없앤다..? 라는 부분에서 의문을 가짐.
영상의 일부를 자르고 다른 영상으로 대체하는 CutMix 를 제안.

CutMix

Algorithm

A, B 두개의 클래스만 존재.

$$(x, y): \text{Training image, label}$$ $$(A, B): \text{Training class}$$ $$(x_A, y_A), (x_B, y_B): \text{Training sample}$$

어느 부분을 섞을 것인지 binary mask (M) 생성
생성된 mask를 통해 섞을 비율 lambda 추출.
Label의 경우 비율에 One-hot encoding이 합친 후 영상에서의 각 클래스의 비율로 변경.

$$\mathrm{M}: \text{Binary mask where to drop out and fill}$$ $$\lambda: \text{Combination ratio}$$ $$\tilde{x} = \mathrm{M} \bigodot x_A + (1 - \mathrm{M}) \bigodot x_B$$ $$\tilde{y} = \lambda{y_A} + (1 - \lambda)y_B$$

M에서 bounding box 좌표 (B) 추출.
x, y 좌표는 Uniform distribution.
$x_B$에서 B 를 매칭시켜서 crop 후 B에 매칭되는 $x_A$ 의 부분에 paste.

$$\mathrm{B}: \text{Bounding box coordinates } (r_x, r_y, r_w, r_h)$$ $$r_x \sim \text{Unif }(0, W), r_w = W\sqrt{1-\lambda},$$ $$r_y \sim \text{Unif } (0, H), r_h = H\sqrt{1-\lambda}$$