Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization

year:
paper: https://arxiv.org/pdf/2008.07545
website:
code:
connections: whitening, second order optimization, generalization

Finish reading paper

Focus on theoretical insights by creating random tasks:

In this work, we take an intermediate step by augmenting existing datasets, in effect increasing the breadth of the task distribution based on existing task regularities. We generate a large number of tasks by taking existing supervised learning datasets, randomly projecting their inputs and permuting their classification labels. While the random projection removes spatial structure from the inputs, this structure is not believed to be central to the task (for instance, the performance of SGD-trained fully connected networks is invariant to projection by a random orthogonal matrix (Wadia et al., 2021)). Task augmentation allows us to investigate fundamental questions about learning-to-learn in the regime of many tasks without relying on huge amounts of existing tasks or elaborate schemes to generate those.

Link to original

Whitening in ML

Whitening is a data preprocessing step that removes correlations between input features. It is used across many scientific disciplines, including geology, physics , machine learning, linguistics, and chemistry. It has a particularly rich history in neuroscience, where it has been proposed as a mechanism by which biological vision realizes Barlow’s redundancy reduction hypothesi.

Whitening is often recommended since, by standardizing the variances in each direction in feature space, it typically speeds up the convergence of learning algorithms, and causes models to better capture contributions from low variance feature directions. Whitening can also encourage models to focus on more fundamental higher-order statistics in data, by removing second-order statistics. Whitening has further been a direct inspiration for deep learning techniques such as batch normalization and dynamical isometry.

Our argument proceeds in two parts: First, we prove that when a model with a fully connected first layer whose weights are initialized isotropically is trained with either gradient descent or stochastic gradient descent (SGD), information in the data covariance matrix is the only information that can be used to generalize. This result is agnostic to the choice of loss function and to the architecture of the model after the first layer. Second, we show that whitening always destroys information in the data covariance matrix.

Whitening the data and then training with gradient descent or SGD therefore results in either diminished or nonexis- tent generalization properties compared to the same model trained on unwhitened data. The seriousness of the effect varies with the difference between the number of datapoints $n$ and the number of features $d$ , worsening as $n - d$ gets smaller.

Empirically, we find that this effect holds even when the first layer is not fully connected and when its weight initialization is not isotropic - for example, in a convolutional network
trained from a Xavier initialization.

Second Moment Matrix

The second moment matrix $M \in R^{d \times d}$ represents the feature-feature relationships of samples $x \in R^{1 \times d}$ by taking the average outer product:
$M = E [x^{T} x]$

The empirical estimator is $\hat{M} = \frac{1}{n} X^{T} X$ , where $X \in R^{n \times d}$

Entries $(i, j)$ are the mean dot products of the feature columns: $\hat{M}_{i, j} = \frac{1}{n} \sum_{k} x_{k, i} x_{k, j}$ , so the feature-feature relationships

Entries $(i, i)$ on the diagonal are the mean squares of the feature columns: $\hat{M}_{i, i} = \frac{1}{n} \sum_{k} x_{k, i}^{2}$ , so the second moments.

Is the feature gram matrix scaled by $\frac{1}{n}$ .

If features are centered, it becomes the sample covariance matrix.

→ Is a symmetric matrix and positive semidefinite.

Link to original

Max Wolf's Second Brain

Explorer

Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization

Graph View

Backlinks