Authors:
(1) Nora Schneider, Computer Science Department, ETH Zurich, Zurich, Switzerland ([email protected]);
(2) Shirin Goshtasbpour, Computer Science Department, ETH Zurich, Zurich, Switzerland and Swiss Data Science Center, Zurich, Switzerland ([email protected]);
(3) Fernando Perez-Cruz, Computer Science Department, ETH Zurich, Zurich, Switzerland and Swiss Data Science Center, Zurich, Switzerland ([email protected]).
Table of Links
2 Background
3.1 Comparison to C-Mixup and 3.2 Preserving nonlinear data structure
4 Experiments and 4.1 Linear synthetic data
4.2 Housing nonlinear regression
4.3 In-distribution Generalization
4.4 Out-of-distribution Robustness
5 Conclusion, Broader Impact, and References
A Additional information for Anchor Data Augmentation
Abstract
We propose a novel algorithm for data augmentation in nonlinear over-parametrized regression. Our data augmentation algorithm borrows from the literature on causality and extends the recently proposed Anchor regression (AR) method for data augmentation, which is in contrast to the current state-of-the-art domain-agnostic solutions that rely on the Mixup literature. Our Anchor Data Augmentation (ADA) uses several replicas of the modified samples in AR to provide more training examples, leading to more robust regression predictions. We apply ADA to linear and nonlinear regression problems using neural networks. ADA is competitive with state-of-the-art C-Mixup solutions. [1]
1 Introduction
Data augmentation is one of the key ingredients of any successful application of a machine learning classifier. The first example that typically comes to mind is the in-depth description of the data augmentation in the now-famous Alexnet paper [26]. Data augmentation algorithms come in different flavors, and they mostly rely on the expectation that small perturbations, invariances, or symmetries applied to the input will not change the class label. That way, we can present ‘fresh new’ samples as alterations of the available examples for training. These transformations modify the input distribution to make the algorithm more robust for cases where the distribution of the test set may differ from that of the training set. We refer the reader to the related work section (Section 2.1) for an overview and description of different data augmentation strategies.
The literature for data augmentation in regression is slim. The paper on Mixup augmentation [51] proposes a simple and general scheme for data augmentation using convex combinations of samples. The authors only apply their data augmentation proposal to classification problems. They conjecture in the discussion that the application to regression is straightforward, however, this is not the case in practice. Mixup is theoretically analyzed in [5, 52] as a regularization technique for classification and regression problems. However, it is only illustrated in classification problems.
The Mixup algorithm has been extended to regression problems in [18, 49], in which the authors explain that Mixup cannot be blindly applied to regression problems. To our knowledge, these are the only two papers in which data augmentation for regression is proposed. RegMix [18] relies on a hard-to-train prior neural network controller before augmenting the data using a Mixup strategy. C-Mixup [49], a method proposed more recently, solves some of the issues limiting the standard Mixup algorithm for regression problems. The authors propose to mix only closeby samples in the output space (i.e., samples which have close enough labels). This strategy is only valid when the target variables are monotonic with the input and is applied in a transformed space. The authors present comprehensive results in data augmentation for in-distribution generalization, task generalization and out-of-distribution robustness.
In this paper, we rely on the causality literature to provide a different avenue for augmenting data in regression problems. Causal discovery finds the causes of a response variable among a given set of observations or helps to recognize the causal relations between a set of variables [39]. These causes allow us to understand how these relations will change if we were to intervene in a subset of the (input) variables or what would be the effect on the output. So, in general, the regression model will be robust to perturbations in the input variables making the prediction less sensitive to changes in the distribution of the test set. For example, the authors in [40] use the invariance property for prediction to perform causal inference. In turn, Anchor Regression (AR) builds upon the causality literature to obtain robust regression solutions when the input variables have been perturbed [42]. The procedure relies on anchor variables capturing the heterogeneity within a dataset and a parameter γ that measures the deviation with respect to the least square solution. Once the values of the anchors are known, AR modifies the data and obtains the least square solution, as detailed in Section 2.2.
In this paper, we propose Anchor Data Augmentation (ADA) to augment the training dataset with several replicas of the available data. We use a simple clustering of the data to encode a homogeneous group of observations and use different values of γ to robustify the solution to different strengths of potential distribution shifts. In every minibatch, we sample γ from a predetermined range around γ = 1. As AR was developed for linear regression, the data augmentation strategy needs to be modified for nonlinear regression accordingly. We validate ADA for in-distribution generalization and out-of-distribution robustness under the same conditions proposed in C-Mixup [49], as well as some illustrative linear and nonlinear regression examples. In the replicated experiments, ADA is competitive or superior to other augmentation strategies such as C-Mixup, although on some datasets the performance gain is marginal.
The rest of the paper is organized as follows: First, we provide background information in Section 2. We give a brief overview of related work on data augmentation in Section 2.1 and summarize the key concepts on Anchor Regression in Section 2.2. Second, Section 3 shows how we extend Anchor Regression and introduces ADA. Section 4 reports empirical evidence that our approach can improve predictions, especially in over-parameterized settings. We conclude the paper in Section 5.
This paper is available on arxiv under CC0 1.0 DEED license.
[1] Our Python implementation of ADA is available at: https://github.com/noraschneider/ anchordataaugmentation/