Buckets:
Title: Why DP-SGD Needs Sparser Models
URL Source: https://arxiv.org/html/2301.13104
Markdown Content: (cvpr) Package cvpr Warning: Package ‘hyperref’ is not loaded, but highly recommended for camera-ready version
Equivariant Differentially Private Deep Learning:
Why DP-SGD Needs Sparser Models
Florian A. Hölzl, Daniel Rueckert, Georgios Kaissis
Institute for Artifical Intelligence in Medicine, Technical University of Munich
{florian.hoelzl, daniel.rueckert, g.kaissis}@tum.de
Abstract
Differentially Private Stochastic Gradient Descent (DP-SGD) limits the amount of private information deep learning models can memorize during training. This is achieved by clipping and adding noise to the model’s gradients, and thus networks with more parameters require proportionally stronger perturbation. As a result, large models have difficulties learning useful information, rendering training with Differentially Private Stochastic Gradient Descent (DP-SGD) exceedingly difficult on more challenging training tasks. Recent research has focused on combating this challenge through training adaptations such as heavy data augmentation and large batch sizes. However, these techniques further increase the computational overhead of DP-SGD and reduce its practical applicability. In this work, we propose using the principle of sparse model design to solve precisely such complex tasks with fewer parameters, higher accuracy, and in less time, thus serving as a promising direction for DP-SGD. We achieve such sparsity by design by introducing equivariant convolutional networks for model training with Differential Privacy (DP). Using equivariant networks, we show that small and efficient architecture design can outperform current state-of-the-art (SOTA) with substantially lower computational requirements. On CIFAR-10, we achieve an increase of up to 9%percent 9 9%9 % in accuracy while reducing the computation time by more than 85%percent 85 85%85 %. Our results are a step towards efficient model architectures that make optimal use of their parameters and bridge the privacy-utility gap between private and non-private deep learning for computer vision.
Figure 1: Sparsity by design indicates that fewer overall parameters are required to learn the same representations from an input image. Conventional CNNs achieve this by applying convolutions with small kernels across the image instead of flattening and densely connecting all pixels. Equivariant convolutions introduce additional transformations on the kernel, allowing for the detection of features independent of their e.g. rotation and/or reflection (pose). As a result, equivariant networks need even fewer parameters to learn the same information and are thus even sparser by design. Image taken from the ImageNet dataset.
1 Introduction and Related Work
Artificial Intelligence is increasingly applied to fields where extremely sensitive data is used, such as medicine or the social sciences. Previous research has demonstrated that sensitive information can be reverse-engineered from unprotected machine learning models[1, 2]. This renders privacy concerns a major hurdle to developing and deploying machine learning systems in fields where the protection of sensitive data is necessary, e.g. due to privacy regulations, intellectual property requirements or other ethical considerations and necessitates steps preventing the exposure of such data to unauthorized third parties.
Privacy-enhancing technologies allow one to derive insights from sensitive datasets while quantitatively bounding the risk of information leakage about the training samples. By giving formal guarantees on data protection, they represent the best chance to date to incentivize data sharing in an ethical and responsible manner. Differential Privacy (DP)[3], the gold-standard technique for privacy preservation, is most commonly applied in the field of deep learning by utilizing DP-SGD[4], which imposes an upper bound on how much information can be extracted from the model’s weights about the individuals whose data was used for training. This guarantee is achieved during model training through a modified optimization algorithm, which includes clipping the gradient vector based on a predefined threshold of its ℓ 2 subscript ℓ 2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm and then adding noise to it (see Sec.2). Such a gradient is then considered privatized. A predictive model trained with privatized gradients is subsequently private as well, allowing analysts to share the model while retaining the ability to bound the capacity of adversaries to derive information from the data used for training.
However, using gradients privatized with DP-SGD often leads to sharp reduction in prediction performance. This problem is called the privacy-utility trade-off and is due to two main factors: On one hand, clipping the gradient diminishes its information content and biases its direction; moreover, the total noise magnitude scales proportionally to the ℓ 2 subscript ℓ 2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of the gradient, and thus to the number of model parameters. This leads to a decrease in the signal-to-noise ratio, in that the true gradient’s signal is diminished relative to the noise introduced by DP-SGD. Consequently, the effective training of large models, typically used to achieve SOTA results from scratch in non-private training, is rendered disproportionately difficult and falls far short of attaining comparable results to non-private training. In this paper, we propose to address the privacy-utility trade-off by introducing sparse model design through Equivariant CNN (ECNN) architectures.
1.1 Computational Considerations of DP-SGD
Recent advancements in DP deep learning have focused largely on optimizing the training regime of larger models to mitigate the previously mentioned difficulties. The current SOTA for training CIFAR-10 from scratch was recently established by [5] and incrementally improved by[6]. These works’ techniques focus on extensive training adaptions (which likely have an effect on the smoothness of the optimisation landscape) to improve prediction performance. Among the most important adaptations are (1) large-batch training[7], (2) averaging per-sample gradients across multiple augmentation of the same image before clipping (augmentation multiplicity)[8, 9], and (3) temporal parameter averaging techniques[10]. This results in remarkable accuracy gains on over-parameterized models over the previous SOTA, which seemed out of reach up until recently for models such as the WideResNet (WRN), that incorporate several million parameters. However, the accuracy gains presented by the authors of the aforementioned work come at a high cost: the massive computational resources and time required to train those models have an overbearing financial (and environmental) impact and impose great difficulties in reproducing the results. As a result, further progress in this direction is cumbersome and inefficient, and renders proposing improvements on top of the presented results out of reach, especially for scientific institutions without access to large-scale computational resources. In summary, current SOTA models trained with DP-SGD (1) in part still substantially lack behind their non-private counterparts and (2) have an impracticable computational burden that makes research improvements difficult.
Orthogonal to the aforementioned works, other studies in DP deep learning have introduced training regimes which additionally leverage publicly available data, which is ostensibly usable without any privacy constraints[11, 12]. By starting from a well performing non-private base model, fine-tuning on private datasets with DP-SGD can lead to promising results close to non-private training from scratch. However, Tramèr et al. question if using public data for pre-training should actually be considered differential-privacy-preserving as pre-trained models can leak private information contained within the public dataset and thus negatively impact the public perception of the field[13]. Relying on public data in many cases thus opposes the very foundation of privacy-enhancing technologies, i.e. obtaining access to important insights while minimizing access to the sensitive data required. Even more importantly for practical application, relying on pre-training as a solution is not a panacea. Large quantities of public data are not available in areas such as medicine, where data, at least from the same distribution, still comes with the same privacy considerations and (even when available in sufficient quantity) is difficult to access. Not only has the importance of in-distribution data been shown in previous works[14, 15], but relying on pre-training alone increases the barrier to extend DP-SGD to new modalities and novel prediction tasks. Pre-training and its positive impact showcased on benchmark datasets is thus an interesting observation, but will not solve the previously mentioned challenges that currently hold back DP-SGD from widespread adaptation in practice.
1.2 Motivation
We thus contend that an optimal solution to the aforementioned dilemmas will require a fundamental reconsideration, which will marry high prediction performance when training from scratch with high computational efficiency. The guiding hypothesis of our work is, that networks that exhibit designed sparsity can achieve the aforementioned goal. As we demonstrate in this work, the notion of designed sparsity, introduced by [16], combines two major characteristics beneficial for DP-SGD: (1) increased representational efficiency and (2) a reduced set of (possibly redundant) model parameters. The requirement for DP models to learn high-quality features to achieve parity with non-private models has been previously discussed[17]. However, the aforementioned work utilises a cascade of static feature extractors, which lack the flexibility of their learned counterparts and whose capability to scale to more complex problems remains an open research question. In contrast, we show that networks that are sparse by design remain trainable with a similar or higher degree of expressiveness, but with fewer parameters than comparable deep learning models. In fact, even conventional CNNs are an example of such networks, since a single kernel is used to compute the features of a whole image. They inherently exhibit parameter sharing with respect to translations and can thus be seen as a sparse version of a fully connected layer. We argue that, through additional improvements in model architectures that lead to higher designed sparsity, networks can learn features more efficiently, offering a promising direction for DP-SGD. To evaluate this hypothesis, we introduce ECNN s for DP-SGD. As shown in Fig.1, ECNN s further increase parameter sharing by being equivariant to transformations such as rotations and reflections. They thus offer an even higher degree of designed sparsity and a possible solution for efficient training under DP.
While rotational equivariance can be approximated with no formal guarantee by conventional (non-equivariant) CNNs through an increase in model width, dataset size and augmentation techniques, i.e. the exact techniques mentioned earlier, this approximation has two important drawbacks. They (1) massively increase the computation time of the (already very demanding) DP-SGD, as e.g. each additional set of simultaneous augmentations increases the time complexity almost linearly. And (2), naïvely scaling the number of parameters proportionally increases the total noise power of the added Gaussian noise, thus risking \say drowning out the learning signal. Network layers equivariant to rotation and reflection transformations preserve the relative pose of features in addition to the translational equivariance preserved by standard convolutions. The resulting additional parameter sharing thus avoids the redundant learning of identical convolutional filters for multiple poses. In addition to their high parameter efficiency, ECNN architectures are known to exhibit increased data efficiency and improved generalization, especially in domains with high degrees of intra-image symmetry[18, 19]. So far, however, no works have investigated how to combine equivariant layers with DP training nor analyzed the potential beneficial changes to the training regime, even though their characteristics render them highly attractive for this use-case.
In this work, we show the need for sparse model architectures under DP-SGD, and that ECNN s are a promising step in this direction. Their higher designed sparsity compared to standard CNNs, allows them to outperform larger models while simultaneously decreasing the required computation time. As we will demonstrate, this renders them especially interesting for training with DP-SGD, particularly under tighter privacy bounds and low data regimes, both desirable traits for privacy-preserving techniques.
Our main contributions are summarized as follows:
• We introduce the methodology necessary to train ECNN s with DP-SGD in Sec.3. As part of this contribution, we propose novel normalization layers for discrete and continuous symmetry groups that preserve the equivariance property and fulfill the DP condition.
• By leveraging the notion of designed sparsity, we substantially improve the current SOTA on DP-SGD imaging benchmarks without additional data in Sec.4.2. We experimentally demonstrate ECNN s as a promising architecture that satisfies this concept by offering improved results while requiring fewer parameters than a conventional CNNs. Among others, we show an increase of 9.2%percent 9.2 9.2%9.2 % under (2,10−5)2 superscript 10 5(2,10^{-5})( 2 , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT )-DP on CIFAR-10 and an increase of more than 10%percent 10 10%10 % on Tiny-ImageNet-200.
• We provide insights into model calibration of our approach, since poor calibration is a known weakness of DP-SGD. We find that the proposed equivariant architecture improves model calibration with an on average 17%percent 17 17%17 % lower Brier score across all evaluated datasets compared to the conventional network.
• We experimentally show, that equivariant networks are more robust to key hyperparameter choices from recent SOTA results, in particular augmentations in the input domain and batch size. Additionally, we analyze how equivariant specific hyperparameter choices, such as the symmetry groups, affect training with DP-SGD in Sec.5.1.
2 Background
2.1 Differential Privacy
At its core, differential privacy provides a way to answer questions about a dataset while limiting the risk of revealing sensitive information about specific samples in the dataset. It achieves this by introducing controlled randomness into the data analysis process. In other words, DP is a strong stability condition on randomised algorithms mandating that outputs are approximately invariant under inclusion or exclusion of a single individual from the input database. For a mechanism (randomized algorithm) ℳ ℳ\mathcal{M}caligraphic_M and all datasets D 𝐷 D italic_D and D′superscript 𝐷′D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that differ in one element as well as all measurable subsets S 𝑆 S italic_S of the range of ℳ ℳ\mathcal{M}caligraphic_M, (ε,δ)𝜀 𝛿(\varepsilon,\delta)( italic_ε , italic_δ )-DP requires that:
Pr(ℳ(D)∈S)≤e εPr(ℳ(D′)∈S)+δ,Pr ℳ 𝐷 𝑆 superscript 𝑒 𝜀 Pr ℳ superscript 𝐷′𝑆 𝛿\text{Pr}(\mathcal{M}(D)\in S)\leq e^{\varepsilon},\text{Pr}(\mathcal{M}(D^{% \prime})\in S)+\delta,Pr ( caligraphic_M ( italic_D ) ∈ italic_S ) ≤ italic_e start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT Pr ( caligraphic_M ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_S ) + italic_δ ,(1)
where the privacy guarantees of the algorithm are parameterised by ε≥0 𝜀 0\varepsilon\geq 0 italic_ε ≥ 0 and δ∈[0,1)𝛿 0 1\delta\in[0,1)italic_δ ∈ [ 0 , 1 ). This privacy constraint is in practice typically realised by the addition of noise. For a comprehensive overview of DP, we refer to [3]. Its application to deep learning came with the introduction of DP-SGD by [4]. In our work, we utilise Gaussian noise and the aforementioned DP-SGD algorithm to privatise gradient updates in neural network training. We use Rényi-DP accounting[20, 21] to track the privacy loss throughout the training. Rényi-DP (RDP) is often used with training deep neural networks as it massively facilitates the composition of sequences of private algorithms executed on sub-samples of a larger dataset, such as in SGD, where we iteratively update the model parameters using randomly selected subsamples of the training data. The RDP privacy condition is:
D α(ℳ(D)∥ℳ(D′))≤ρ,subscript 𝐷 𝛼 conditional ℳ 𝐷 ℳ superscript 𝐷′𝜌 D_{\alpha}(\mathcal{M}(D)\parallel\mathcal{M}(D^{\prime}))\leq\rho,italic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( caligraphic_M ( italic_D ) ∥ caligraphic_M ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ italic_ρ ,(2)
where D α subscript 𝐷 𝛼 D_{\alpha}italic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is the Rényi divergence of order α≥1 𝛼 1\alpha\geq 1 italic_α ≥ 1. We note that the Rényi divergence is symmetric in the Gaussian noise setting, as the DP-guarantee is required to be symmetric. RDP can be converted to (ε,δ)𝜀 𝛿(\varepsilon,\delta)( italic_ε , italic_δ )-DP for a given δ 𝛿\delta italic_δ. In the rest of this paper, we will refer to the converted ε¯¯𝜀\bar{\varepsilon}over¯ start_ARG italic_ε end_ARG simply as ε 𝜀\varepsilon italic_ε. We note that we refer to \say sampling rather than \say mini-batches in DP-SGD, since privacy amplification by sampling requires subsets of the training set to be drawn using e.g. a Poisson sampling technique[4].
2.2 Equivariant CNNs
Equivariance describes the mathematical property of a structure-preserving mapping. This means, that there exist two transformations T g subscript 𝑇 𝑔 T_{g}italic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and T g′subscript superscript 𝑇′𝑔 T^{\prime}_{g}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT that lead to the same result when applying a mapping ϕ italic-ϕ\phi italic_ϕ to an input 𝐱 𝐱\mathbf{x}bold_x such that:
ϕ(T g𝐱)=T g′ϕ(𝐱).italic-ϕ subscript 𝑇 𝑔 𝐱 subscript superscript 𝑇′𝑔 italic-ϕ 𝐱\phi\left(T_{g}\mathbf{x}\right)=T^{\prime}_{g}\phi\left(\mathbf{x}\right)\ .italic_ϕ ( italic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT bold_x ) = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_ϕ ( bold_x ) .(3)
For image classification, this formal guarantee was first introduced for rotations and reflections in Regular Group CNNs by [18] and later extended to ECNN s with steerable filters[19, 22, 23]. The general approach to equivariant convolutions is centered around the idea of representations, describing the transformation laws of a given feature space. The feature fields in this space are mappings that transform according to the corresponding representation. Each layer’s input and output space must be compatible with the corresponding transformation law. In order to guarantee this behavior, a convolution kernel K 𝐾 K italic_K is subject to a linear constraint, given by
K(g⋅𝐱)=ρ out(g)K(𝐱)ρ in(g)−1∀g∈G,x∈ℝ n,formulae-sequence 𝐾⋅𝑔 𝐱 subscript 𝜌 out 𝑔 𝐾 𝐱 subscript 𝜌 in superscript 𝑔 1 for-all 𝑔 𝐺 𝑥 superscript ℝ 𝑛 K(g\cdot\mathbf{x})=\rho_{\text{out}}(g)K(\mathbf{x})\rho_{\text{in}}(g)^{-1}% ;;\forall g\in G,;x\in\mathbb{R}^{n},italic_K ( italic_g ⋅ bold_x ) = italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( italic_g ) italic_K ( bold_x ) italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ( italic_g ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∀ italic_g ∈ italic_G , italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,(4)
with group actions g∈G 𝑔 𝐺 g\in G italic_g ∈ italic_G, depending on the associated group representations ρ 𝜌\rho italic_ρ. For our case (planar images), we focused on representations where the group actions are rotations and reflections acting on the parameters of a learned kernel, i.e. the group elements. These group actions can be discrete, with the number of rotations typically denoted as N 𝑁 N italic_N, or continuous in SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) or O(2)𝑂 2 O(2)italic_O ( 2 ). Finite groups are commonly represented through regular representations, where the corresponding transformation matrix has dimensionality equal to the order of the group, e.g.ℝ N superscript ℝ 𝑁\mathbb{R}^{N}blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. There are different ways to approximate continuous groups to work with steerable convolutions. We follow the results of [24], where the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group has been shown to give the most promising results of all continuous groups for non-private planar image classification. The performance of the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group is evaluated using irreducible representations. An irreducible representation ψ 𝜓\psi italic_ψ of a group is a representation that cannot be reduced or decomposed into smaller independent representations. Formally, for our group G 𝐺 G italic_G and the feature space Y 𝑌 Y italic_Y, an irreducible representation of G 𝐺 G italic_G on Y 𝑌 Y italic_Y is a linear map ψ:G↦GL(Y):𝜓 maps-to 𝐺 𝐺 𝐿 𝑌\psi:G\mapsto GL(Y)italic_ψ : italic_G ↦ italic_G italic_L ( italic_Y ) to the general linear group GL 𝐺 𝐿 GL italic_G italic_L that preserves the group structure. In our case, the kernel K 𝐾 K italic_K is a linear map, with its basis given by breaking down a group representation into a direct sum of irreducible representations. This is called a irreducible representation decomposition and allows us to construct a equivariant map between pairs of irreducible representations, i.e.ρ in subscript 𝜌 in\rho_{\text{in}}italic_ρ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT and ρ out subscript 𝜌 out\rho_{\text{out}}italic_ρ start_POSTSUBSCRIPT out end_POSTSUBSCRIPT. To solve the corresponding kernel constraint in our equivariant convolutions, we employ the general solution from [25].
3 Methodology
In this section, we describe our method and introduce novel layers to allow training ECNN s with DP-SGD. The proposed layers preserve the notion of orientation in our ECNN s without violating the DP constraint. Maintaining the equivariance property, not only within a single layer but throughout the whole network, is necessary to fully leverage the additional feature information during training. Our approach is based on the equivariant frameworks established by[26, 25].
3.1 Convolution Layers
To satisfy the kernel constraintEq.4, we define the transformation law of each convolution by its input and output representations ρ in subscript 𝜌 𝑖 𝑛\rho_{in}italic_ρ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and ρ out subscript 𝜌 𝑜 𝑢 𝑡\rho_{out}italic_ρ start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT. For the roto-translational ECNN s proposed in this work, the representations correspond to rotation matrices of a specific symmetry group (i.e.C N subscript 𝐶 𝑁 C_{N}italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, D N subscript 𝐷 𝑁 D_{N}italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT or SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 )). The resulting transformation law of the feature space is a constant linear mapping that only has to be computed once during initialization. While lower-level features intrinsically exhibit a higher degree of symmetry, natural images often have a sense of orientation at a global level. To account for this varying level of equivariance, we choose an initial symmetry group and restrict ρ in subscript 𝜌 𝑖 𝑛\rho_{in}italic_ρ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and ρ out subscript 𝜌 𝑜 𝑢 𝑡\rho_{out}italic_ρ start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT for the last residual block. The order of the finite groups is reduced to N/2 𝑁 2 N/2 italic_N / 2, with N 𝑁 N italic_N being the group order, for the last residual block. When choosing SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) as the initial symmetry group, we restrict the representations in the last residual block to be invariant. Additionally, we adjust the number of channels for each convolutional layer, such that our equivariant networks have a similar number of parameters as their non-equivariant counterparts. This is done by multiplying the number of channels by 1.5N/N 1.5 𝑁 𝑁\sqrt{1.5N}/N square-root start_ARG 1.5 italic_N end_ARG / italic_N to end up with a similar number of parameters for all symmetry groups. Moreover, to improve signal propagation, we apply weight standardization in the convolutional layer[27] and switch the order of the normalization and activation layers as proposed in[6].
3.2 Normalization and Nonlinearities
For DP, we are required to compute per-sample gradients, which are incompatible with the batch normalization typically used in ECNN literature. We thus propose two novel equivariant normalization layers for DP-ECNN training. Firstly, for trivial and regular representations used for discrete groups C N subscript 𝐶 𝑁 C_{N}italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and D N subscript 𝐷 𝑁 D_{N}italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, we introduce the (naïve) equivariant group normalisation layer. The additional pose information is maintained throughout the network by increasing the channel dimension C 𝐶 C italic_C of the feature vector to include the fields of each representation. Before normalization, we split the channel dimension C 𝐶 C italic_C of our 4 4 4 4-dimensional feature vector such that the representation fields of each channel are in a separate dimension. The resulting feature vector (N,Y,C,W,H)𝑁 𝑌 𝐶 𝑊 𝐻(N,Y,C,W,H)( italic_N , italic_Y , italic_C , italic_W , italic_H ), with the spatial dimensions H 𝐻 H italic_H, W 𝑊 W italic_W and the batch axis N 𝑁 N italic_N, is used for normalization. Each feature field 𝐲∈[1,Y]𝐲 1 𝑌\mathbf{y}\in[1,Y]bold_y ∈ [ 1 , italic_Y ], corresponding to a representation’s transformation, is then normalized across a group of channels C 𝐶 C italic_C. This implementation allows us to utilize existing 3 3 3 3-dimensional group normalisation layers. After the group normalization, for consistency, we reduce our feature vector back to 4 4 4 4 dimensions by stacking the feature fields of each channel.
The aforementioned layer is, however, unsuitable for use with continuous groups, which require irreducible representations to function. To thus satisfy the equivariance property for the irreducible representations of the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group, we introduce an i.i.d. instance normalisation layer. To normalize the feature fields, we require an estimate of their expected values and variance when transforming them by a group action. In the following, we derive how to calculate the two values. For the former, we compute the expectation over the whole group G 𝐺 G italic_G that is acting on our features. This expectation can be written by using a normalized Haar measure μ 𝜇\mu italic_μ over G 𝐺 G italic_G,
𝔼 g∈G[ψ(g)]=∫G dμ(g)ψ(g)subscript 𝔼 𝑔 𝐺 delimited-[]𝜓 𝑔 subscript 𝐺 d 𝜇 𝑔 𝜓 𝑔\mathbb{E}{g\in G}[\psi(g)]=\int{G}\text{d}\mu(g)\psi(g)blackboard_E start_POSTSUBSCRIPT italic_g ∈ italic_G end_POSTSUBSCRIPT [ italic_ψ ( italic_g ) ] = ∫ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT d italic_μ ( italic_g ) italic_ψ ( italic_g )(5)
with the irreducible representation ψ 𝜓\psi italic_ψ of G 𝐺 G italic_G[28]. Due to the orthogonality of ψ(g)∀g∈G 𝜓 𝑔 for-all 𝑔 𝐺\psi(g)\ \forall g\in G italic_ψ ( italic_g ) ∀ italic_g ∈ italic_G, the integral is always zero for all non-trivial irreducible representations. As a consequence, the mean of any vector transforming according to those representations is also zero. It follows that the expected value of ρ(g)𝜌 𝑔\rho(g)italic_ρ ( italic_g ) is a null matrix with non-zero values in the diagonal if and only if ψ i subscript 𝜓 𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a trivial representation. In practice, we can thus pre-compute 𝔼 g∈G[ψ(g)]subscript 𝔼 𝑔 𝐺 delimited-[]𝜓 𝑔\mathbb{E}_{g\in G}[\psi(g)]blackboard_E start_POSTSUBSCRIPT italic_g ∈ italic_G end_POSTSUBSCRIPT [ italic_ψ ( italic_g ) ] when initializing our equivariant layers to calculate the estimated mean based on the trivial representations. During training, we then simply subtract it from our calculated feature fields.
To derive the expected variance, we substitute the transformation in Eq.3 with our representation ρ(g)𝜌 𝑔\rho(g)italic_ρ ( italic_g ) to receive the feature field 𝐲=ρ(g)ϕ(𝐱)𝐲 𝜌 𝑔 italic-ϕ 𝐱\mathbf{y}=\rho(g)\phi(\mathbf{x})bold_y = italic_ρ ( italic_g ) italic_ϕ ( bold_x ) of an input 𝐱 𝐱\mathbf{x}bold_x. The covariance of 𝐲 𝐲\mathbf{y}bold_y can then be calculated with
𝔼 𝐲∈Y[𝐲𝐲 T]=𝔼 g∈G[ρ(g)ϕ(𝐱)ϕ(𝐱)Tρ(g)T].subscript 𝔼 𝐲 𝑌 delimited-[]superscript 𝐲𝐲 T subscript 𝔼 𝑔 𝐺 delimited-[]𝜌 𝑔 italic-ϕ 𝐱 italic-ϕ superscript 𝐱 T 𝜌 superscript 𝑔 T\mathbb{E}{\mathbf{y}\in Y}\left[\mathbf{yy}^{\text{T}}\right]=\mathbb{E}{g% \in G}\left[\rho(g),\phi(\mathbf{x})\phi(\mathbf{x})^{\text{T}}\rho(g)^{\text% {T}}\right].blackboard_E start_POSTSUBSCRIPT bold_y ∈ italic_Y end_POSTSUBSCRIPT [ bold_yy start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_g ∈ italic_G end_POSTSUBSCRIPT [ italic_ρ ( italic_g ) italic_ϕ ( bold_x ) italic_ϕ ( bold_x ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_ρ ( italic_g ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ] .(6)
Due to the fact that our representations ρ(g)𝜌 𝑔\rho(g)italic_ρ ( italic_g ) are orthonormal and irreducible, the covariance matrix must be an orthogonal matrix. In addition, the covariance matrix is symmetric and semi-positive definite, making it a multiple of the identity matrix, i.e.𝔼 𝐲∈Y[𝐲𝐲⊤]=λ𝐈 subscript 𝔼 𝐲 𝑌 delimited-[]superscript 𝐲𝐲 top 𝜆 𝐈\mathbb{E}_{\mathbf{y}\in Y}[\mathbf{yy}^{\top}]=\lambda\mathbf{I}blackboard_E start_POSTSUBSCRIPT bold_y ∈ italic_Y end_POSTSUBSCRIPT [ bold_yy start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] = italic_λ bold_I. We can then compute this multiple independent of our representation simply as
dλ=Tr(𝔼 𝐲∈Y[𝐲𝐲 T])=𝔼 𝐲∈Y[∥𝐲∥2 2].𝑑 𝜆 Tr subscript 𝔼 𝐲 𝑌 delimited-[]superscript 𝐲𝐲 T subscript 𝔼 𝐲 𝑌 delimited-[]superscript subscript delimited-∥∥𝐲 2 2 d\lambda=\operatorname{Tr}(\mathbb{E}{\mathbf{y}\in Y}\left[\mathbf{yy}^{% \text{T}}\right])=\mathbb{E}{\mathbf{y}\in Y}\left[\left\lVert\mathbf{y}% \right\rVert_{2}^{2}\right].italic_d italic_λ = roman_Tr ( blackboard_E start_POSTSUBSCRIPT bold_y ∈ italic_Y end_POSTSUBSCRIPT [ bold_yy start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ] ) = blackboard_E start_POSTSUBSCRIPT bold_y ∈ italic_Y end_POSTSUBSCRIPT [ ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .(7)
Finally, we also apply a learnable weight and bias parameter as with most normalization layers.
For trivial and regular representations, standard activation functions, such as Mish and ReLU[29, 30], can be used as non-linearities. When working with the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group, we use an Inverse Fourier Transform on the feature space to then apply the non-linearity in the group domain. Afterwards, we compute the Fourier Transform to again obtain the coefficients of the irreducible representations.
4 Results
4.1 Equivariant Models are Sparse by Design
4.1.1 Fewer Model Parameters
In previous work by Klause et al.[31], the ResNet-9 architecture was shown to be performant with DP-SGD, despite its small size. We are interested on how this efficiency can be further increased and thus use it as our baseline. We find that the equivariance property of the Equivariant-ResNet-9 allows us to further increase the validation accuracy on CIFAR-10 by reducing the number of convolution channels per layer. In fact, Fig.2 indicates that there is an \say optimal model layout of (16,32,64)16 32 64(16,32,64)( 16 , 32 , 64 ) channels, denoting the filters per layer in the three residual blocks. For larger layouts, the performance starts decreasing again. This leads to a total number of optimal parameters of ≈250k absent 250 𝑘\approx 250k≈ 250 italic_k, i.e.ten times fewer than the original (conventional) CNN. In comparison, the validation accuracy of the standard ResNet-9 does not substantially increase when the model size is reduced. This is likely due to the fewer parameters not being able to learn features effectively. The equivariant network, on the other hand, does not have to learn redundant features for different orientations in separate channels due to the additional pose information in our feature space. The experiments indicate that this allows the network to still detect a sufficient amount of information in the image for prediction, even when reducing the parameters in the network. How much of this additional information can be learned per parameter, also called parameter utilization[19], is dependent on the chosen symmetry group. While the equivariance property is better satisfied for the symmetry groups SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) and O(2)𝑂 2 O(2)italic_O ( 2 ), the expressiveness of the trained kernel can suffer due to the corresponding tighter kernel constraints as introduced inSec.2. Our ablation studies in Sec.5.1 show that the dihedral D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT group offers a \say sweet spot regarding performance and accuracy, and is thus used throughout this section. To summarize, we tested and found that the equivariance property allows us to reduce the width of our network while maintaining or even increasing the classification accuracy with DP-SGD by increasing the parameter utilization in the network.
Figure 2: In comparison to the non-equivariant ResNet-9, our Equivariant-ResNet-9 has a sweet spot on CIFAR-10 at an intermediate model size of 16 16 16 16, 32 32 32 32 and 64 64 64 64 channels per layer for the three residual blocks under (8,10−5)8 superscript 10 5(8,10^{-5})( 8 , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT )-DP.
4.1.2 More Parameters Contribute to Model Prediction
In addition to an overall smaller model size, we also evaluate how the sparse design of the architecture affects the gradients and parameters, and their impact on training and prediction correspondingly. Due to the increased parameter sharing, we expect the smaller model to use more of its parameters to make predictions. To analyze this, we utilize the ℓ 0 ϵ superscript subscript ℓ 0 italic-ϵ\ell_{0}^{\epsilon}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT\say norm, i.e. the number of parameters or gradient entries with a magnitude smaller than ϵ italic-ϵ\epsilon italic_ϵ[32]. We note that this is a different ϵ italic-ϵ\epsilon italic_ϵ than the one used to describe DP guarantees. Fig.3 shows, that the percentage of parameters with absolute values ≈0 absent 0\approx 0≈ 0 during training is substantially lower for the equivariant network compared to the standard ResNet-9. Thus, the equivariant network has fewer redundant parameters that do not actually contribute to the network’s prediction. This can be further seen when we compare the Equivariant-ResNet-9 to a Equivariant-WRN-40. Even though both networks are equivariant, the larger WRN-40 still has more unused parameters and subsequently is not able to improve on the smaller Equivariant-ResNet-9 (Eq.-ResNet-9: 16k 16 𝑘 16k 16 italic_k parameters; vs Eq.-WRN-40: 45k 45 𝑘 45k 45 italic_k parameters).
4.1.3 Sparser Gradient Vectors
In addition to the network’s weights, we also investigate how information is transmitted in the model. If the features learned in the network show an increased representational efficiency, they should be able to learn useful information from the available data faster and thus require less time to converge to an optimum. Analyzing the sparsity of the gradient vector in Fig.3, we see the ECNN’s gradient converge faster during training (i.e. more close-to-zero entries), updating fewer weights. This can indicate that the model has incorporated most of the relevant information and has approached a minimum. Moreover, gradient sparsity increasing more quickly during training also indicates a more efficient information transfer at an early stage. This observation can not be made for the conventional ResNet-9, implying less efficient training. The corresponding gradient sparsity increases slower and the model keeps updating more parameters throughout the training regime. We consider additional research on the role of sparse gradient vectors in DP, as e.g. done in[33] and [34], a promising direction for future work. Our current results show that ECNN s with increased designed sparsity have more parameters contributing to predictions and converge faster, with a smaller percentage of parameters to update during training.
Figure 3: The Equivariant-ResNet-9 converges to a better optimum during training, indicated by a higher gradient ℓ 0 ϵ superscript subscript ℓ 0 italic-ϵ\ell_{0}^{\epsilon}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ”norm” (continuous lines), while also having a substantially lower percentage of parameters with values <ϵ absent italic-ϵ<\epsilon< italic_ϵ (dashed lines).
4.2 Improved Performance on Image Classification with Equivariant Convolutional Networks
To validate that our notion of designed sparsity can in fact help to improve performance with DP-SGD, we compare the equivariant network to current SOTA models on multiple common benchmark classification datasets when training from scratch under varying privacy parameters. The GPU hours (h ℎ h italic_h) are measured on an NVIDIA A100 40GB.
Table 1: CIFAR-10 and CIFAR-100 test accuracies of our Equivariant-ResNet-9 with symmetry group D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT trained from scratch compared to the previous state of the art under (ε,10−5)𝜀 superscript 10 5(\varepsilon,10^{-5})( italic_ε , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT )-DP. We report the median and standard deviation calculated across 5 5 5 5 independent runs.
4.2.1 CIFAR-10
We begin with experiments on CIFAR-10, which is still considered a challenging dataset for training with DP-SGD. The previous works’ results were obtained by splitting the training set into 45k 45 𝑘 45k 45 italic_k train and 5k 5 𝑘 5k 5 italic_k validation samples. Equivalently to these works, our stated test accuracy is achieved by training our model on the full training set and evaluating it once on the held-out test set. Our equivariant models are benchmarked against the current SOTA models on CIFAR-10 by [5]. We reproduce the previous SOTA results with the exact same setup and code provided by the authors 1 1 1https://github.com/deepmind/jax_privacy on our hardware for a fair comparison of the prediction results and computation time. The reproduced results have a difference in test accuracy of ⪅3%absent percent 3\lessapprox 3%⪅ 3 % to the results of the original paper, as similarly observed by [6]. The exact hyperparameters and implementation details are summarized with further ablation studies in Sec.A.1 and Sec.A.2.
Table 1 shows, that our equivariant model with the D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT group is able to substantially outperform the current SOTA on CIFAR-10 by up to 9.2%percent 9.2 9.2%9.2 % at ε=2 𝜀 2\varepsilon=2 italic_ε = 2. Most notably, this result is achieved substantially faster with a decrease in computation time by ≈85%absent percent 85\approx 85%≈ 85 % (35h 35 ℎ 35h 35 italic_h). This is in large parts due to the reduced number of augmentations required. Moreover, the corresponding model only consists of 256k 256 𝑘 256k 256 italic_k parameters and is thus 35 35 35 35 times smaller than the previous SOTA model (8.7M 8.7 𝑀 8.7M 8.7 italic_M parameters less than the WRN-40-4). This superior performance of the Equivariant-ResNet-9 is consistent across all evaluated ε 𝜀\varepsilon italic_ε-values, and performs notably even better under a tighter privacy budget as seen inTab.6. This shows, that a sparser model architecture with a lower-dimensional gradient and thus less added noise is particularly beneficial for privacy-preserving applications.
4.2.2 CIFAR-100
The CIFAR-100 dataset is particularly interesting, as it has 10 10 10 10 times fewer images per class compared to CIFAR-10. This allows us to evaluate if the property of ECNN s to learn better from less data also holds under DP. Table 1 confirms, that this is indeed the case. We train both models, the Equivariant-ResNet-9 and the WRN-40-4 under the same setup as for CIFAR-10. This time, however, our equivariant network is not only substantially better for lower ε 𝜀\varepsilon italic_ε-values, but also outperforms the WRN-40-4 by more than 7%percent 7 7%7 % under ε=8 𝜀 8\varepsilon=8 italic_ε = 8. In addition, the computation performance remains superior, with a reduction in computation time from our equivariant network by ≈88%absent percent 88\approx 88%≈ 88 % (67.2h 67.2 ℎ 67.2h 67.2 italic_h). This performance can be attributed to the increased feature efficiency of ECNN s (see Sec.4.1.3), which causes them to detect more relevant input features than non-equivariant models. In addition to the benefits provided by their sparse model architecture (as seen for low ε 𝜀\varepsilon italic_ε-values for CIFAR-10), the feature efficiency of ECNN s is a key characteristic making them useful for training with DP-SGD as privacy concerns mandate using as little data as possible. To summarize, our ECNN s achieve SOTA performance on CIFAR-10 and CIFAR-100 with substantially smaller models and in a fraction of the computation time and are especially dominant under a tighter privacy budget.
Table 2: Top-1 test accuracies on Tiny-ImageNet-200 and ImageNette for the Equivariant-ResNet-9 and a PyTorch reproduction of the WRN-40-4 from De et al.[5] compared to the previous SOTA from [31] for (ε,8⋅10−7)𝜀⋅8 superscript 10 7(\varepsilon,8\cdot 10^{-7})( italic_ε , 8 ⋅ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT )-DP.
4.2.3 Tiny-ImageNet-200
Large-scale image classification with DP has recently come into focus, and promising initial results have lately been demonstrated on the ImageNet dataset[35, 5, 36]. The key drawback of the aforementioned works is the fact that DP training on ImageNet is exceedingly costly in terms of computational budget required. Efficient, yet accurate approaches are thus of great interest. Even though our equivariant networks are not able to fully solve this issue, the previous experiments indicate that they offer a first step at achieving similar or better results to previous SOTA approaches while reducing computation time. The Tiny-ImageNet-200 dataset is particularly interesting, as it has fewer samples per class than the larger ImageNet-1k. In addition to comparing to the previous SOTA model[31], we also reproduce the approach from De et al. with the WRN-40-4 from[5]. To compare, we use the substantially smaller Equivariant-ResNet-9 with the D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT group and train both models for 12000 12000 12000 12000 update steps. We also adapt the learning rate to 4 4 4 4, due to the larger batch size. For the experiment, we use the official validation set as the unseen test set. Table 2 shows, that without further adaptations and hyperparameter tuning, the Equivariant-ResNet-9 achieves a Top-1 test accuracy of 34.1%percent 34.1 34.1%34.1 %, beating our baseline by 6.3%percent 6.3 6.3%6.3 % and increasing the previous SOTA by [31] by more than 14.7%percent 14.7 14.7%14.7 % under a tighter privacy budget. This is achieved while reducing the computation time by ≈77%absent percent 77\approx 77%≈ 77 % (540h 540 ℎ 540h 540 italic_h). We additionally also construct an equivariant version of the WRN-40 model, the architecture previously used for SOTA results[5]. As equivariant convolutions can be used in arbitrary network architectures, we only adapt the width of the equivariant model to 1 1 1 1 instead of 4 4 4 4, equivalent to the results inSec.4.1 for the ResNet-9. The model is also able to outperform the previous SOTA result and the baseline model, but does not reach the performance of the smaller Equivariant-ResNet-9 (Eq.-WRN-40: 28.1%percent 28.1 28.1%28.1 %). Looking at the ℓ 0 ϵ superscript subscript ℓ 0 italic-ϵ\ell_{0}^{\epsilon}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ”norm”, we can see that both models use almost all of their parameters (Eq.-ResNet-9: 2.5k 2.5 𝑘 2.5k 2.5 italic_k; Eq.-WRN-40: 7k 7 𝑘 7k 7 italic_k). Simply increasing the number of parameters in the equivariant models thus does not improve training proportionally. Instead, to boost performance with DP-SGD, future research could explore other designs that use their additional parameters efficiently. These architectures with even greater designed sparsity may offer a promising approach for surpassing the performance of the proposed Equivariant-ResNet-9 model.
4.2.4 ImageNette
The preceding subsection demonstrated that ECNN s, despite their compact model size, possess the capacity to learn complex features. To assess their effectiveness as the image size is increased, we conduct further experiments on an additional ImageNet subset, ImageNette, featuring images with a dimension of 160 160 160 160 by 160 160 160 160 pixels. As before, we compare the Equivariant-ResNet-9 to the previous SOTA model by [31] and a custom baseline that reproduces the WRN-40-4 setup from[5]. In accordance with previous results, our equivariant network is able to outperform the previous SOTA and our baseline in Tab.2, by more than 5%percent 5 5%5 %. In addition, the Equivariant-ResNet-9 takes less than half of the computation time compared to the baseline approach (reduction of 26h 26 ℎ 26h 26 italic_h). Our proposed Equivariant-ResNet-9 thus outperforms all previous approaches independent of image size and number of images per class. The accuracy improvement is particularly substantial under a tighter privacy budget, making the notion of designed sparsity a promising direction for further research on solving the privacy-utility trade-off.
Table 3: Compared to the standard ResNet-9, the Equivariant-ResNet-9 exhibits better model calibration, shown by a lower Brier score averaged across the test set under (8,10−5)8 superscript 10 5(8,10^{-5})( 8 , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT )-DP.
4.3 Privacy-Calibration Trade-Off
While privacy guarantees are important in sensitive domains, evaluating other model characteristics should not be neglected in fields where trust in a model’s predictions is essential. In particular, it has been shown that DP-SGD has a negative effect on a model’s uncertainty calibration[37, 38]. A model’s overconfidence in its predictions when trained with DP-SGD can give a false sense of accuracy and prevent a reliable estimate of potential errors in the results. We therefore exemplarily examine whether our proposed techniques are able to provide improved calibration in addition to higher accuracy by looking at the Brier score[39]. Indeed, on the CIFAR-10 test set, we found that the Equivariant-ResNet-9 had a ≈27%absent percent 27\approx 27%≈ 27 % lower Brier score compared to the non-equivariant counterpart (Tab.3). This result is consistent under the other evaluated datasets, with a reduction of on average 17%percent 17 17%17 %. The prediction improvements of the ECNN thus do not further increase the overconfidence arising from DP-SGD. Instead, the results indicate superior model calibration to standard convolutional models. While an in-depth investigation of the calibration of equivariant models is outside the scope of our current study, we consider this finding encouraging and intend to expand upon it in future work.
5 Ablation Studies
5.1 Training Adaptations for Improved Performance of Sparse Models under DP
To analyze the performance advantage of ECNN s in more detail, we evaluate the impact of different hyperparameter choices in this section. The results demonstrate key benefits of ECNN s compared to conventional CNNs used for DP training in previous works.
Figure 4: The Equivariant-ResNet-9 benefits from increased batch sizes similar to the non-equivariant WRN-16-4 from De et al.[5], but has a substantially better validation accuracy across the board under (8,10−5)8 superscript 10 5(8,10^{-5})( 8 , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT )-DP.
5.1.1 Improved Accuracy Across All Batch Sizes
Previous work has generally shown that (very) large batch sizes lead to substantial improvements in accuracy for DP training [7, 36]. We thus investigate whether the superior performance of ECNN s is maintained across batch sizes or whether non-equivariant training is able to match our performance by batch size tuning. We find that the latter is not the case. In fact, our ECNN s consistently outperformed the non-equivariant SOTA models, whereby larger batch sizes generally led to an increased accuracy, similar to the non-equivariant CNNs. Interestingly, the accuracy gains through a batch size increase were \say steeper for ECNN s compared to the baseline. This is indicated by the slope of the curve between 256 256 256 256 and 2048 2048 2048 2048 in Fig.4, which is ≈15%absent percent 15\approx 15%≈ 15 % steeper for the equivariant network. We attribute this finding to the robustness of the features learned by equivariant kernels, which is synergistic with large batch sizes in making the gradient resilient to clipping and noise addition. In summary, the equivariant architecture outperformed the non-equivariant model across all batch sizes with an increase in validation accuracy of –on average– ≈5%absent percent 5\approx 5%≈ 5 %.
Figure 5: The advantage of the equivariant network is maintained when adding augmentation multiplicities up to a value of 16 16 16 16. Further increasing the number of augmentations only increases computation time but actually decreases performance.
Test Accuracy [%] Model Group ε 𝜀\varepsilon italic_ε Median Std. Dev.Parameters GPU Hours Dörmann et al. (2021){e}𝑒{e}{ italic_e }1.93 58.6(0.38)–– De et al. (2022){e}𝑒{e}{ italic_e }2 65.9(0.5)8.9M 8.9 𝑀 8.9M 8.9 italic_M– De et al. (reproduction){e}𝑒{e}{ italic_e }2 62.6(0.62)8.9M 8.9 𝑀 8.9M 8.9 italic_M 42.27 Klause et al. (2022){e}𝑒{e}{ italic_e }2.89 65.6–2.4M 2.4 𝑀 2.4M 2.4 italic_M– Tramèr and Boneh (2021){e}𝑒{e}{ italic_e }3 69.3(0.2)187k 187 𝑘 187k 187 italic_k– Equivariant-ResNet-9 (ours)C 4 subscript 𝐶 4 C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 2 69.57(0.48)258k 258 𝑘 258k 258 italic_k 4.5 C 8 subscript 𝐶 8 C_{8}italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 2 68.97(0.19)256k 256 𝑘 256k 256 italic_k 5.8 C 16 subscript 𝐶 16 C_{16}italic_C start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT 2 66.31(0.46)244k 244 𝑘 244k 244 italic_k 8.9 D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 2 71.86(0.71)256k 256 𝑘 256k 256 italic_k 5.8 D 8 subscript 𝐷 8 D_{8}italic_D start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 2 69.04(0.30)244k 244 𝑘 244k 244 italic_k 8.7 D 16 subscript 𝐷 16 D_{16}italic_D start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT 2 67.68(0.39)238k 238 𝑘 238k 238 italic_k 14.8 SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 )2 45.65(1.14)309k 309 𝑘 309k 309 italic_k 6.7
Table 4: CIFAR-10 test accuracy of our Equivariant-ResNet-9 with different symmetry groups trained from scratch compared to the previous state of the art without equivariance (i.e.{e}𝑒{e}{ italic_e }). We report the median and the standard deviation calculated across 5 5 5 5 independent runs. The GPU hours are measured on a NVIDIA A100 40GB. The highest accuracy is observed using the Equivariant-ResNet-9 and the D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT group at only 256k 256 𝑘 256k 256 italic_k parameters requiring only 5.8 5.8 5.8 5.8 GPU hours of computation.
5.1.2 Fewer Augmentations Required
One of the main ingredients in the technique proposed by [5] is the utilization of augmentation multiplicity, i.e. performing multiple augmentations per sample and averaging the resulting gradients before privatization. In this section, we address the question whether ECNN s are able to supplant this technique (and thus drastically reduce the required computation time, which scales with the number of simultaneous augmentations). After all, equivariant convolutions are able to learn information decoupled from a feature’s pose and thus reduce the necessity of augmentations, especially of rotations and reflections, leading to an immediate reduction in the intense computational burden of augmentation multiplicity. Fig.5 shows, that the equivariant network still benefits from augmentations to some extent even though it does already incorporate pose information in its features. This corroborates the finding by Weiler et al.[24] in the non-DP setting, who show that augmentations can further improve prediction performance of equivariant networks for dihedral groups. Compared to the current state of the art on CIFAR-10, however, the equivariant models reach close-to-optimal performance with an augmentation multiplicity of only 4 4 4 4, whereas the SOTA model requires 8 8 8 8 times as many simultaneous augmentations to achieve the same accuracy. In fact, our results at 4 4 4 4 augmentations outperform the 32 32 32 32 augmentation multiplicities of [5] while drastically reducing the computation time by the same factor of 8 8 8 8. Corroborating the theoretical advantages of ECNN s, diminishing returns set in earlier than for conventional CNNs.
5.2 Hyperparameter Choices for Equivariant Layer
As mentioned above, models utilizing the D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT group exhibited the highest accuracy and were used for the experimental results above. To demonstrate how we arrived at this choice and how the chosen symmetry group influences the accuracy of the model, we evaluated our equivariant models with commonly used symmetry groups from literature [24] under (ε,δ)=(2,10−5)𝜀 𝛿 2 superscript 10 5(\varepsilon,\delta)=(2,10^{-5})( italic_ε , italic_δ ) = ( 2 , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT )-DP. As introduced in Sec.2, symmetry groups refer to mathematical groups that capture the symmetries or transformations that are applied on an input. The symmetry group C 8 subscript 𝐶 8 C_{8}italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT thus describes the group of rotations of 45 45 45 45°on a planar space. The continuous group SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) extends this to all rotations in the 2-dimensional space. We use irreducible representations and an angular frequency of 1 1 1 1 to describe the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group convolutions. For an in-depth analysis of other frequency groups of the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group, we refer toSec.A.2.2. Based on the results fromSec.5.1, we reduce the model layout from (64,128,256)64 128 256(64,128,256)( 64 , 128 , 256 ) channels to (16,32,64)16 32 64(16,32,64)( 16 , 32 , 64 ). To maintain comparability, the number of channels per layer and thus total number of parameters is fixed according to our approach described in Sec.3. The results in Tab.4 show that we achieve our best median test accuracy on CIFAR-10 with 71.86%percent 71.86 71.86%71.86 % for the dihedral group D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The dihedral groups perform (on average) slightly better than the cyclic groups, probably due to the intrinsic horizontal symmetry of the images in the dataset which are better captured by its capability to represent reflections. This is corroborated by the fact that increasing the rotation order N 𝑁 N italic_N did not improve prediction performance. As the network uses kernels of size 3×3 3 3 3\times 3 3 × 3, this phenomenon could also be attributed to the restrictions in discretizing small rotations on this kernel size without losing information. We observe a similar effect with the continuous rotations in SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ), which perform substantially worse than all other groups. This is in line with the non-private experiments conducted by[24], indicating that the kernel constraint is too restrictive and the lack of expressiveness cannot be compensated by the more pronounced equivariant properties of the kernel. We consider combining our technique with larger receptive fields, e.g. through atrous (dilated) convolutions, a promising future work direction. Notably, all discrete groups C N subscript 𝐶 𝑁 C_{N}italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and D N subscript 𝐷 𝑁 D_{N}italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT were able to outperform previous SOTA models. The C 4 subscript 𝐶 4 C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT groups in particular showed the best results, offering the best trade-off between accuracy and computation time out of all candidates. We thus recommend favoring these groups over continuous groups in practice.
6 Conclusion
The broad application of private deep learning has –so far– been impeded by privacy-utility trade-offs. Recent works have partially addressed these limitations and presented techniques to bridge the accuracy gap but introduced a new trade-off between accuracy and efficiency. Ultimately, we contend that both trade-offs must be addressed to facilitate large-scale research in DP deep learning. The remarkable performance gains that Equivariant CNNs enable are an important step towards this goal. Their capability to outperform previous approaches in a low-data regime and under a tight privacy budget, as well as their improved calibration and capability to capture intrinsic image symmetries, renders them particularly interesting for DP-SGD.
With extensive benchmark experiments, we showed that sparse model designs are promising to also overcome computational overhead concerns in DP. In a time such as the present, where breakthroughs are usually achieved through solving engineering problems on large-scale systems, targeting ways to render the foundations of deep learning itself more efficient is in our opinion a promising and sustainable direction for solving long-term challenges. The introduction of additional structural prior information, such as the presence of symmetries in images, tackles exactly such challenges. In addition, we regard the provision of formal guarantees of model behavior, which both equivariance and DP represent, as a solid foundation for systems that fulfill the notion of \say trustworthy AI.
References
- [1]B.Balle, G.Cherubin, and J.Hayes, “Reconstructing Training Data with Informed Adversaries,” 2022 IEEE Symposium on Security and Privacy (SP), pp. 1138–1156, 2022.
- [2] N.Carlini, F.Tramèr, E.Wallace, M.Jagielski, A.Herbert-Voss, K.Lee, A.Roberts, T.B. Brown, D.X. Song, Ú.Erlingsson, A.Oprea, and C.Raffel, “Extracting Training Data from Large Language Models,” in 30th USENIX Security Symposium (USENIX Security 21).USENIX Association, Aug. 2021, pp. 2633–2650. [Online]. Available: https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
- [3] C.Dwork and A.Roth, “The Algorithmic Foundations of Differential Privacy,” Found. Trends Theor. Comput. Sci., vol.9, pp. 211–407, 2014.
- [4] M.Abadi, A.Chu, I.Goodfellow, H.B. McMahan, I.Mironov, K.Talwar, and L.Zhang, “Deep Learning with Differential Privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’16.New York, NY, USA: Association for Computing Machinery, 2016, p. 308–318. [Online]. Available: https://doi.org/10.1145/2976749.2978318
- [5] S.De, L.Berrada, J.Hayes, S.L. Smith, and B.Balle, “Unlocking High-Accuracy Differentially Private Image Classification through Scale,” ArXiv, vol. abs/2204.13650, 2022.
- [6] T.Sander, P.Stock, and A.Sablayrolles, “TAN without a Burn: Scaling Laws of DP-SGD,” ArXiv, vol. abs/2210.03403, 2022.
- [7] F.Dörmann, O.Frisk, L.N. Andersen, and C.F. Pedersen, “Not All Noise is Accounted Equally: How Differentially Private Learning Benefits from Large Sampling Rates,” 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, 2021.
- [8] E.Hoffer, T.Ben-Nun, I.Hubara, N.Giladi, T.Hoefler, and D.Soudry, “Augment Your Batch: Improving Generalization through Instance Repetition,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8126–8135, 2020.
- [9] S.Fort, A.Brock, R.Pascanu, S.De, and S.L. Smith, “Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error,” ArXiv, vol. abs/2105.13343, 2021.
- [10] B.Polyak and A.B. Juditsky, “Acceleration of Stochastic Approximation by Averaging,” Siam Journal on Control and Optimization, vol.30, pp. 838–855, 1992.
- [11] A.Golatkar, A.Achille, Y.-X. Wang, A.Roth, M.Kearns, and S.Soatto, “Mixed Differential Privacy in Computer Vision,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8366–8376, 2022.
- [12] D.Yu, H.Zhang, W.Chen, and T.-Y. Liu, “Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning,” ArXiv, vol. abs/2102.12677, 2021.
- [13] F.Tramèr, G.Kamath, and N.Carlini, “Considerations for Differentially Private Learning with Large-Scale Public Pretraining,” ArXiv, vol. abs/2212.06470, 2022.
- [14] K.He, R.Girshick, and P.Dollar, “Rethinking ImageNet Pre-Training,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 4917–4926.
- [15] X.Mei, Z.Liu, P.M. Robson, B.Marinelli, M.Huang, A.Doshi, A.Jacobi, C.Cao, K.E. Link, T.Yang, Y.Wang, H.Greenspan, T.Deyer, Z.A. Fayad, and Y.Yang, “RadImageNet: An Open Radiologic Deep Learning Research Dataset for Effective Transfer Learning,” Radiology: Artificial Intelligence, vol.0, no.ja, p. e210315, 0. [Online]. Available: https://doi.org/10.1148/ryai.210315
- [16] T.Hoefler, D.Alistarh, T.Ben-Nun, N.Dryden, and A.Peste, “Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks,” J. Mach. Learn. Res., vol.22, pp. 241:1–241:124, 2021.
- [17] F.Tramèr and D.Boneh, “Differentially Private Learning Needs Better Features (or Much More Data),” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- [18] T.Cohen and M.Welling, “Group Equivariant Convolutional Networks,” in Proceedings of The 33rd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M.F. Balcan and K.Q. Weinberger, Eds., vol.48.New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 2990–2999. [Online]. Available: https://proceedings.mlr.press/v48/cohenc16.html
- [19] ——, “Steerable CNNs,” in International Conference on Learning Representations, 2017.
- [20] I.Mironov, “Rényi Differential Privacy,” 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275, 2017.
- [21] I.Mironov, K.Talwar, and L.Zhang, “Rényi Differential Privacy of the Sampled Gaussian Mechanism,” ArXiv, vol. abs/1908.10530, 2019.
- [22] T.Cohen, M.Geiger, J.Köhler, and M.Welling, “Spherical CNNs,” ArXiv, vol. abs/1801.10130, 2018.
- [23] M.Weiler, F.A. Hamprecht, and M.Storath, “Learning Steerable Filters for Rotation Equivariant CNNs,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 849–858, 2018.
- [24] M.Weiler and G.Cesa, “General E(2)-Equivariant Steerable CNNs,” Advances in Neural Information Processing Systems, vol.32, 2019.
- [25] G.Cesa, L.Lang, and M.Weiler, “A Program to Build E(N)-Equivariant Steerable CNNs,” in ICLR, 2022.
- [26] M.Geiger and T.Smidt, “e3nn: Euclidean Neural Networks,” 2022. [Online]. Available: https://arxiv.org/abs/2207.09453
- [27] S.Qiao, H.Wang, C.Liu, W.Shen, and A.L. Yuille, “Weight Standardization,” ArXiv, vol. abs/1903.10520, 2019.
- [28] G.Chirikjian, A.Kyatkin, and A.Buckingham, “Engineering Applications of Noncommutative Harmonic Analysis: With Emphasis on Rotation and Motion Groups,” ASME. Appl. Mech. Rev., vol.54, no.6, p. B97–B98, November 2001.
- [29] D.Misra, “Mish: A Self Regularized Non-Monotonic Activation Function,” in BMVC, 2020.
- [30] A.F. Agarap, “Deep Learning using Rectified Linear Units (ReLU),” ArXiv, vol. abs/1803.08375, 2018.
- [31] H.Klause, A.Ziller, D.Rueckert, K.Hammernik, and G.Kaissis, “Differentially Private Training of Residual Networks with Scale Normalisation,” in ICML Theory and Practice of Differential Privacy Workshop, 2022.
- [32] D.L. Donoho and M.Elad, “Maximal Sparsity Representation via l1 Minimization,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no.5, pp. 2197–2202, 2003.
- [33] J.Zhu and M.B. Blaschko, “Differentially Private SGD with Sparse Gradients,” in arXiv preprint arXiv:2112.00845, 2021.
- [34] R.Ito, S.P. Liew, T.Takahashi, Y.Sasaki, and M.Onizuka, “Scaling Private Deep Learning with Low-Rank and Sparse Gradients,” ArXiv, vol. abs/2207.02699, 2022.
- [35] P.Chrabaszcz, I.Loshchilov, and F.Hutter, “A Downsampled Variant of ImageNet as an Alternative to the CIFAR Datasets,” ArXiv, vol. abs/1707.08819, 2017.
- [36] A.Kurakin, S.Chien, S.Song, R.Geambasu, A.Terzis, and A.Thakurta, “Toward Training at ImageNet Scale with Differential Privacy,” ArXiv, vol. abs/2201.12328, 2022.
- [37] M.Knolle, A.Ziller, D.Usynin, R.F. Braren, M.R. Makowski, D.Rueckert, and G.Kaissis, “Differentially Private Training of Neural Networks with Langevin Dynamics for Calibrated Predictive Uncertainty,” ArXiv, vol. abs/2107.04296, 2021.
- [38] H.Zhang, X.Li, P.Sen, S.Roukos, and T.Hashimoto, “A Closer Look at the Calibration of Differentially Private Learners,” ArXiv, vol. abs/2210.08248, 2022.
- [39] G.W. Brier, “Verification of Forecasts Expressed in Terms of Probability,” Monthly Weather Review, vol.78, pp. 1–3, 1950.
Appendix A Appendix
A.1 Implementation Details
All experiments in this paper except for the reproduction of[5] are implemented in PyTorch. The privacy accounting is done with Opacus and additional performance improvements are achieved through vectorization with functorch. The aforementioned reproduction is implemented with the code taken from the official repository of the original authors in JAX/Haiku. The hyperparameters used for our equivariant networks and the reproduction are summarized inTab.5. The equivariant models are trained for 2160 2160 2160 2160 update steps with DP-SGD and a clipping norm and learning rate of 2.0 2.0 2.0 2.0, an exponential moving average decay of 0.999 0.999 0.999 0.999, a batch size of 8192 8192 8192 8192 and 4 4 4 4 augmentation multiplicities (original image + 3 augmentations). We use random reflections and cropping with a two-sided reflection padding of 4 4 4 4 pixels for our augmentations.
Table 5: Hyperparameters used for measuring the accuracy when training from scratch in all of our equivariant experiments on test sets.
A.2 Additional Experimental Results
A.2.1 Results Across Different ε 𝜀\varepsilon italic_ε-Values
For easier reproducibility, the exact results for different ε 𝜀\varepsilon italic_ε-values are given in Tab.6. All experiments are run with the setting described in Sec.4.2 and hyperparameters summarized inTab.5. The standard deviation for our equivariant network is measured across 5 5 5 5 independent runs, while the reproduction was run 3 3 3 3 times.
Table 6: CIFAR-10 median test accuracies under different (ε,10−5)𝜀 superscript 10 5(\varepsilon,10^{-5})( italic_ε , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT ) values measured across 5 5 5 5 independent runs for the Equivariant-ResNet-9 with symmetry group D 4 subscript 𝐷 4 D_{4}italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and 3 3 3 3 independent runs for the WRN-40-4 from [5].
A.2.2 Frequencies of the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) Group
For the continuous SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group, instead of the number of discrete rotations, we have to choose the angular frequency used to construct the kernel as an additional hyperparameter. The choice not only changes the expressiveness of the kernel but also impacts the number of parameters that scale consistently with the angular frequency. As in earlier sections, we thus adapt the model width to have a comparable number of parameters. Tab.7 shows, that when training under DP, the increased expressiveness does slightly increase prediction performance, with a frequency of 3 3 3 3 performing best under (2,10−5)2 superscript 10 5(2,10^{-5})( 2 , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT )-DP. However, conforming with the results in Sec.5.2, the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group performs worse than the discrete groups independent of the chosen angular frequency. Our results thus suggest to use the discrete C N subscript 𝐶 𝑁 C_{N}italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and D N subscript 𝐷 𝑁 D_{N}italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT groups when training ECNN s with DP-SGD.
Table 7: CIFAR-10 test accuracies for different angular frequencies of the steerable kernel for the SO(2)𝑆 𝑂 2 SO(2)italic_S italic_O ( 2 ) group for an adapted layout with similar number of parameters.
Xet Storage Details
- Size:
- 84 kB
- Xet hash:
- 4b7e793037d60659e806b5e48527fcbd4067bd4ee9bd1789747d00c96d51665b
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.




