Buckets:

huggingchat
/

papers-content

Files

xet

huggingchat/papers-content / 2404 /2404.00638.md

mishig

about 2 months ago

preview code

download

raw

129 kB

Title: HypeBoy: Generative Self-Supervised Representation Learning on Hypergraphs

URL Source: https://arxiv.org/html/2404.00638

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Related Work 3Proposed Task and Theoretical Analysis 4Proposed Method for Hyperedge Filling 5Experimental Results 6Conclusion References License: arXiv.org perpetual non-exclusive license arXiv:2404.00638v1 [cs.LG] 31 Mar 2024 HypeBoy: Generative Self-Supervised Representation Learning on Hypergraphs Sunwoo Kim2 Shinhwan Kang2 Fanchen Bu3 Soo Yong Lee2 Jaemin Yoo3 Kijung Shin2 3 2 Kim Jaechul Graduate School of AI, 3 School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) {kswoo97, shinhwan.kang, boqvezen97, syleetolow, jaemin, kijungs}@kaist.ac.kr Abstract

Hypergraphs are marked by complex topology, expressing higher-order interactions among multiple nodes with hyperedges, and better capturing the topology is essential for effective representation learning. Recent advances in generative self-supervised learning (SSL) suggest that hypergraph neural networks learned from generative self-supervision have the potential to effectively encode the complex hypergraph topology. Designing a generative SSL strategy for hypergraphs, however, is not straightforward. Questions remain with regard to its generative SSL task, connection to downstream tasks, and empirical properties of learned representations. In light of the promises and challenges, we propose a novel generative SSL strategy for hypergraphs. We first formulate a generative SSL task on hypergraphs, hyperedge filling, and highlight its theoretical connection to node classification. Based on the generative SSL task, we propose a hypergraph SSL method, HypeBoy. HypeBoy learns effective general-purpose hypergraph representations, outperforming 16 baseline methods across 11 benchmark datasets. Code and datasets are available at https://github.com/kswoo97/hypeboy.

1Introduction

Many real-world interactions occur among multiple entities, such as online group discussions on social media, academic collaboration of researchers, and joint item purchases (Benson et al., 2018; Kim et al., 2023a; Lee et al., 2024). Representing such higher-order interactions with an ordinary graph can cause topological information loss (Dong et al., 2020; Yoon et al., 2020). Thus, hypergraphs have emerged as an effective tool for representing high-order interactions in various domains, including recommender systems (Xia et al., 2022; Yu et al., 2021), financial analysis (Sawhney et al., 2021; 2020), and drug analysis (Ruan et al., 2021; Saifuddin et al., 2023).

For representation learning on such a complex topology, hypergraph neural networks (HNNs) have been developed. Training HNNs has primarily relied on task-related label supervision, such as external node labels. However, simply learning from external supervision may limit HNNs in capturing more complex patterns in hypergraph topology. Incorporating self-supervision related to topology, hence, can substantially improve HNNs’ representation learning.

Particularly, generative self-supervised learning (SSL) holds promise for effective hypergraph representation learning. Generative SSL has recently shown remarkable success in encoding complex patterns in multiple domains. GPT (OpenAI, 2023) in natural language processing and Masked Autoencoder (He et al., 2022) in computer vision are notable examples. With generative self-supervision, HNNs may encode complex topology more effectively, leading to improved representation learning.

However, designing a generative SSL strategy for hypergraphs is not straightforward. First, questions remain unanswered for generative SSL tasks: (1.a) What should be the target generative SSL task for HNNs? (1.b) How does the generative SSL task relate to downstream tasks (e.g., node classification with external labels)? Second, even after determining the task, details of the SSL method (based on the task) remain unspecified. (2.a) What empirical properties should the method aim to satisfy? (2.b) How can the method achieve effective general-purpose hypergraph representations? Moreover, if not carefully designed, the generative SSL strategy can suffer from severe computational burden, as the number of potential hyperedges grows exponentially with the number of nodes.

In light of these promises and challenges, we systematically investigate and propose a hypergraph generative SSL strategy. Our contributions and the rest of the paper are organized as follows:

•

SSL Task: We formulate a generative SSL task, hyperedge filling, for hypergraph representation learning. Notably, we establish its theoretical connections to node classification (Section 3).

•

SSL Method: Based on the hyperedge filling task, we propose HypeBoy, a novel hypergraph SSL method. HypeBoy is designed to satisfy desirable properties of hypergraph SSL, mitigating (1) over-emphasized proximity, (2) dimensional collapse, and (3) non-uniformity/-alignment of learned representations (Section 4).

•

Experiments: We demonstrate that HypeBoy learns effective general-purpose hypergraph representations. It significantly outperforms SSL-based HNNs in both node classification and hyperedge prediction across 11 benchmark hypergraph datasets (Section 5; code and datasets are available at https://github.com/kswoo97/hypeboy).

2Related Work

In this section, we review the literature on hypergraph neural networks and self-supervised learning.

Hypergraph neural networks (HNNs). HNNs learn hypergraph representations. Converting hyperedges into cliques (fully connected subgraphs) allows graph neural networks to be applied to hypergraphs (Feng et al., 2019; Yadati et al., 2019). Such conversion, however, may result in topological information loss, since high-order interactions (hyperedges) are reduced to pair-wise interactions (edges). As such, most HNNs pass messages through hyperedges to encode hypergraphs. Some notable examples include HNHN (Dong et al., 2020) with hyperedge encoders, UniGNN (Huang & Yang, 2021) with generalized message passing functions for graphs and hypergraphs, AllSet (Chien et al., 2022) with set encoders, ED-HNN (Wang et al., 2023a) with permutation-equivariant diffusion operators, and PhenomNN (Wang et al., 2023b) with hypergraph-regularized energy functions.

Self-supervised learning (SSL). SSL strategies aim to learn representation from the input data itself, without relying on external labels. They can largely be categorized into contrastive or generative types. Contrastive SSL aims to maximize the agreement between data obtained from diverse views (Chen et al., 2020; Grill et al., 2020; You et al., 2020). Generative SSL, on the other hand, predicts or reconstructs parts of the input data. The success of generative SSL demonstrates its strengths in learning complex input data, in domains including natural language processing (Devlin et al., 2019; OpenAI, 2023) and computer vision (He et al., 2022; Tong et al., 2022). Recently, generative SSL for graphs has gained significant attention, with their main focuses on reconstructing edges (Tan et al., 2023; Li et al., 2023) or node features (Hou et al., 2022; 2023).

Self-supervised learning on hypergraphs. The interest in SSL for hypergraphs is on the rise. Early hypergraph SSL strategies mainly targeted specific downstream tasks, such as group (Zhang et al., 2021) and session-based recommendation (Xia et al., 2022). Recent ones aim to obtain general-purpose representation. TriCL (Lee & Shin, 2023a) utilizes a tri-directional contrastive loss, which consists of node-, hyperedge-, and membership-level contrast. Kim et al. (2023c) enhances the scalability of TriCL with a partitioning technique. HyperGCL (Wei et al., 2022) utilizes neural networks to generate views for contrast and empirically demonstrates its superiority in node classification, outperforming rule-based view generation methods. HyperGRL (Du et al., 2022) is a generative SSL method sharing similar spirits with our approach; however, the underlying motivations and methodological designs are markedly distinct. We provide a detailed comparison in Appendix C.1.

3Proposed Task and Theoretical Analysis (a)Hyperedge filling task (b)Visual illustration of HypeBoy Figure 1: Overview of (a) the hyperedge filling task and (b) HypeBoy, our proposed SSL method based on the task. The goal of the task is to find the missing node for a given query subset (i.e., the other nodes in a hyperedge). HypeBoy trains HNNs aiming to correctly predict the missing node.

In this section, after providing some preliminaries, we formulate hyperedge filling, our generative SSL task on hypergraphs. Then, we establish a theoretical connection between hyperedge filling and node classification, which is a widely-considered important downstream task.

Preliminaries. A hypergraph 𝒢

( 𝒱 , ℰ ) is defined by a node set 𝒱 and a hyperedge set ℰ . Each hyperedge 𝑒 𝑗 ∈ ℰ is a non-empty set of nodes (i.e., ∅ ≠ 𝑒 𝑗 ⊆ 𝒱 , ∀ 𝑒 𝑗 ∈ ℰ ). Each node 𝑣 𝑖 ∈ 𝒱 is equipped with a feature vector 𝒙 𝑖 ∈ ℝ 𝑑 , and 𝐗 ∈ ℝ | 𝒱 | × 𝑑 denotes the node feature matrix where the 𝑖 -th row 𝐗 𝑖 corresponds to 𝒙 𝑖 .

A hypergraph neural network (HNN) 𝑓 𝜃 is a function that receives a node feature matrix 𝐗 and a set of hyperedges ℰ as inputs to return node embeddings 𝐙 ∈ ℝ | 𝒱 | × 𝑘 (i.e., 𝐙

𝑓 𝜃 ⁢ ( 𝐗 , ℰ ) , where 𝜃 is a set of learnable parameters) 1.

3.1Proposed task: Hyperedge filling

We formulate hyperedge filling, a generative SSL task for hypergraph representation learning. We first define the task and discuss the superiority of the hyperedge filling task over alternatives. An illustration of the hyperedge filling task is provided in Figure 1(a).

Task definition. Given a set of nodes, hyperedge filling aims to predict a node that is likely to form a hyperedge together. Specifically, for each hyperedge 𝑒 𝑗 ∈ ℰ , we divide it into a (missing) node 𝑣 𝑖 ∈ 𝑒 𝑗 and a (query) subset 𝑞 𝑖 ⁢ 𝑗

𝑒 𝑗 ∖ { 𝑣 𝑖 } . The target of the task is to correctly fill the missing node 𝑣 𝑖 for each given subset 𝑞 𝑖 ⁢ 𝑗 . This can be formalized by maximizing the probability of 𝑣 𝑖 correctly completing 𝑞 𝑖 ⁢ 𝑗 , which is denoted as 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) , where Θ is a set of parameters we aim to optimize in this task. We will further elaborate on our design of 𝑝 ( 𝐗 , ℰ , Θ ) ⁢ ( ⋅ ) in Section 4.3.

Advantage over alternatives. Potential alternatives include naive extensions of generative SSL tasks for ordinary graphs: (a) generating hyperedges from scratch and (b) classifying given sets of nodes into hyperedges and non-hyperedges. Compared to (a), by shifting the focus of prediction from the set level (hyperedge itself) to the node level, the hyperedge filling task reduces the prediction space from computationally prohibitive 𝑂 ⁢ ( 2 | 𝒱 | ) to affordable 𝑂 ⁢ ( | 𝒱 | ) . Compared to (b), the hyperedge filling task provides richer and more diversified generative SSL signals. Specifically, for each hyperedge 𝑒 𝑗 , our task offers | 𝑒 𝑗 | distinct node-subset combinations that can serve as SSL signals. In contrast, classifying the mere existence of 𝑒 𝑗 yields a singular and, thus, limited signal.

3.2Theoretical results on hyperedge filling

To demonstrate the effectiveness of hyperedge filling as a general SSL task, we present its theoretical connection to node classification. In essence, we demonstrate that node representations optimized for the hyperedge filling task can improve node classification accuracy.

3.2.1Basic setting

First, we assume a data model of a hypergraph 𝒢

( 𝒱 , ℰ ) where (1) each node belongs to a single class, (2) the features of each node are generated from a Gaussian distribution, and (3) each hyperedge is generated according to a given hyperedge affinity parameter 𝒫 ∈ [ 0 , 1 ] .

Assumption 1 (Node classes and features).

Assume that there are 2 ⁢ 𝑁 nodes and node classes 𝐶 1 and 𝐶 0 such that 𝐶 1 ∪ 𝐶 0

𝒱 , 𝐶 1 ∩ 𝐶 0

∅ , and | 𝐶 1 |

| 𝐶 0 |

𝑁 . Each node feature vector 𝐱 𝑖 is independently generated from 𝒩 ⁢ ( 𝐱 ; 𝛍 1 , 𝚺 ) if 𝑣 𝑖 ∈ 𝐶 1 , and 𝒩 ⁢ ( 𝐱 ; 𝛍 0 , 𝚺 ) if 𝑣 𝑖 ∈ 𝐶 0 . For simplicity, we assume 𝛍 1

( 0.5 ) 𝑖

1 𝑑 , 𝛍 0

( − 0.5 ) 𝑖

1 𝑑 , and 𝚺

𝐈 , where 𝐈 is the 𝑑 -by- 𝑑 identity matrix.

Assumption 2 (Hypergraph topology).

Assume that the number of hyperedges and the size of each hyperedge are given, and there is no singleton hyperedge (i.e., | 𝑒 𝑗 | ≥ 2 , ∀ 𝑒 𝑗 ∈ ℰ ). Let 𝐵 denote the binomial distribution. Each hyperedge 𝑒 𝑗 ∈ ℰ has a class membership 𝑐 𝑗 ∈ { 0 , 1 } , where 𝑐 𝑗 ∼ 𝐵 ⁢ ( 1 , 0.5 ) . Given the number | 𝑒 𝑗 | of nodes and the class 𝑐 𝑗 ∈ { 0 , 1 } of a hyperedge, the number of its members belonging to 𝐶 1 (i.e., | 𝑒 𝑗 ∩ 𝐶 1 | ) ∼ 𝐵 ⁢ ( | 𝑒 𝑗 | , 𝒫 𝑐 𝑗 ⁢ ( 1 − 𝒫 ) 1 − 𝑐 𝑗 ) .

Note that the tendency of each hyperedge containing nodes of the same class is symmetric about 𝒫

0.5 under the binary class setting. In addition, when 𝒫 approaches 1 , each hyperedge 𝑒 𝑗 is more likely to contain nodes of the same class (spec., node class 𝑐 𝑗 ); as 𝒫 approaches 0 , each hyperedge 𝑒 𝑗 is again more likely to contain nodes of the same class (spec., node class 1 − 𝑐 𝑗 ).

Second, we describe how node representations are updated via the hyperedge filling task. In this theoretical analysis, we define the updating process of node representations as follows:

(F1)

Filling probability 𝑝 ( 𝐗 , ℰ , Θ ) ⁢ ( ⋅ ) is defined on each node-subset pair as follows:

𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) := exp ⁡ ( 𝒙 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒙 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) .

(1) (F2)

Node representation 𝒛 𝑖 is obtained via gradient descent with respect to 𝒙 𝑖 from ℒ , which is the negative log-likelihood of Eq. (1), (i.e., ℒ

− log 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) ). For ease of analysis, we assume 𝒛 𝑖

𝒙 𝑖 − 𝛾 ⁢ ∇ 𝒙 𝑖 ℒ , where 𝛾 ∈ ℝ + is a fixed constant.

At last, we assume a Gaussian naive Bayes classifier ℱ (Bishop, 2006), which is defined as:

ℱ ⁢ ( 𝒙 𝑖 )

arg ⁢ max 𝑘 ∈ { 0 , 1 } ⁡ 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 𝑘 , 𝐈 ) , where 𝑓 is the probability density function of 𝒩 ⁢ ( 𝒙 ; 𝝁 𝑘 , 𝐈 ) .

(2) 3.2.2Hyperedge filling helps node classification

Our goal is to show that for accurate classification of 𝑣 𝑖 , the representation 𝒛 𝑖 , which is obtained for hyperedge filling as described in 1 and (F2), is more effective than the original feature 𝒙 𝑖 . First, we assume a node 𝑣 𝑖 belonging to the class 𝐶 1 (i.e., 𝑣 𝑖 ∈ 𝐶 1 ), and we later generalize the result to 𝐶 0 . The effectiveness of an original feature is defined as the expected accuracy of a classifier ℱ with 𝒙 𝑖 (i.e., 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ] := 𝑃 𝒙 ⁢ ( 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 1 , 𝐈 ) > 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 0 , 𝐈 ) ) ). Similarly, that with a derived representation is defined as 𝔼 𝒛 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

1 ]

𝑃 𝒛 ⁢ ( 𝑓 ⁢ ( 𝒛 𝑖 ; 𝝁 1 , 𝐈 )

𝑓 ⁢ ( 𝒛 𝑖 ; 𝝁 0 , 𝐈 ) ) . Below, we show that the effectiveness of a derived representation 𝒛 𝑖 is greater than that of an original feature 𝒙 𝑖 under a certain condition (Theorem 1).

Theorem 1 (Improvement in effectiveness).

Assume a hyperedge 𝑒 𝑗 s.t. 𝑒 𝑗 ∩ 𝐶 1 ≠ ∅ and node features 𝐗 that are generated under Assumption 1. For any node 𝑣 𝑖 ∈ 𝑒 𝑗 ∩ 𝐶 1 , the following holds:

( 𝟏 → 𝑇 ⁢ ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 > 0 ) ⇒ 𝔼 𝒛 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

1 ] > 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ] , where 𝟏 → denotes ( 1 ) 𝑘

1 𝑑 .

(3) Proof.

A full proof is provided in Appendix A.1. ∎

Theorem 1 states that when a certain condition (specified in the parentheses in Eq. (3)) is met, the effectiveness of 𝒛 𝑖 is greater than that of 𝒙 𝑖 . This result implies that node representations, when refined using the objective function associated with the hyperedge filling task, are more proficient in performing accurate node classification compared to the original node features.

While the finding in Theorem 1 demonstrates the usefulness of the hyperedge filling task in node classification, its validity relies on the condition (in the parentheses in Eq. (3)). We further analyze the probability that the condition is met by stochastic 𝐺 under Assumptions 1 and 2 for a given 𝒫 .

Theorem 2 (Realization of condition).

Assume node features 𝐗 and a hyperedge 𝑒 𝑗 s.t. (i) generated under Assumption 1 and 2 respectively, and (ii) 𝑒 𝑗 ∩ 𝐶 1 ≠ ∅ . For any 𝑞 𝑖 ⁢ 𝑗 where 𝑣 𝑖 ∈ 𝑒 𝑗 ∩ 𝐶 1 , the following holds:

𝑃 𝒙 , 𝑒 ( 𝟏 𝑇 ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘

0 | 𝒫 ) ≥ 0.5 , ∀ 𝒫 ∈ [ 0 , 1 ] .

𝑃 𝒙 , 𝑒 ( 𝟏 𝑇 ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘

0 | 𝒫 ) is a strictly decreasing function w.r.t. 𝒫 ∈ [ 0 , 0.5 ] and strictly increasing function w.r.t. 𝒫 ∈ [ 0.5 , 1 ] .

Proof.

A full proof is provided in Appendix A.2. ∎

Theorem 2 states that the probability of the condition being satisfied is at least 0.5 for any affinity parameter 𝒫 ∈ [ 0 , 1 ] . Moreover, the probability strictly increases w.r.t. 𝒫 ∈ [ 0.5 , 1 ] and that strictly decreases w.r.t. 𝒫 ∈ [ 0 , 0.5 ] . Thus, it can be inferred that as the probability of each hyperedge including nodes of the same class increases, so does the likelihood of the condition being met. Notably, many real-world group interactions exhibit homophilic traits (Laakasuo et al., 2020; Khanam et al., 2023). Therefore, the hyperedge filling task can improve node classification in many real-world scenarios, as evidenced by our theoretical findings and real-world characteristics.

Generalization to the class 𝐶 0 . The above results can be easily generalized to the class 𝐶 0 due to the symmetry. Specifically, for each node 𝑣 𝑖 ∈ 𝐶 0 , the effectiveness (spec., the expected accuracy of the classifier ℱ ) of a derived representation 𝒛 𝑖 is greater than that of an original feature 𝒙 𝑖 under a certain condition. The probability for such a condition to hold strictly increases w.r.t. 𝒫 ∈ [ 0.5 , 1 ] and strictly decreases w.r.t. 𝒫 ∈ [ 0 , 0.5 ] . We theoretically show this in Appendix A.1 and A.2.

4Proposed Method for Hyperedge Filling

In this section, we present HypeBoy (Hypergraphs, build own hyperedges), a hypergraph generative SSL method based on the hyperedge filling task (Section 3). HypeBoy exhibits the following desirable properties of hypergraph generative SSL, as empirically demonstrated later:

Property 1. Does not over-emphasize proximity information (Veličković et al., 2019).

Property 2. Avoids dimensional collapse (Jing et al., 2022) in node representations.

Property 3. Learns node representation to be aligned and uniform (Wang & Isola, 2020).

Our proposed method, HypeBoy, is illustrated in Figure 1(b). After presenting each of its steps, we discuss its role in satisfying the above properties. Lastly, we introduce a two-stage training scheme for further enhancing HypeBoy.

(a)Representation spectrum analysis. A sudden singular-value drop implies dimensional collapse. (b)Representations on the unit hypersphere. Uniformly distributed representations achieve uniformity, and representations of nodes of the same class located close to each other achieve alignment. Figure 2: Analysis regarding Property 2 (prevention of dimensional collapse) and Property 3 (representation uniformity and alignment) of HypeBoy. As shown in (a), while HypeBoy without projection heads (red) suffers from dimensional collapse, HypeBoy (blue) does not, demonstrating the necessity of the projection head. Furthermore, as shown in (b), representations from an HNN trained by HypeBoy meet both uniformity and alignment, justifying our design choice of the loss function. Experiments are conducted on the Cora dataset. 4.1Step 1: Hypergraph augmentation

HypeBoy first obtains augmented feature matrix 𝐗 ′ and hyperedge set ℰ ′ by using augmentation functions 𝝉 𝒙 and 𝝉 ℰ , respectively. The feature augmentation function 𝝉 𝒙 : ( 𝐗 , 𝑝 𝑣 ) ↦ 𝐗 ′ masks certain entries of 𝐗 based on Bernoulli sampling (spec., 𝐗 ′

𝐗 ⊙ 𝐌 , where ⊙ is an element-wise product and 𝐌 𝑖 ⁢ 𝑗 ∼ 𝐵 ⁢ ( 1 , 1 − 𝑝 𝑣 ) , ∀ 𝑖 ∈ [ | 𝒱 | ] , 𝑗 ∈ [ 𝑑 ] ). The topology augmentation function 𝝉 ℰ : ( ℰ , 𝑝 𝑒 ) ↦ ℰ ′ samples ⌈ | ℰ | ⁢ ( 1 − 𝑝 𝑒 ) ⌉ hyperedges uniformly at random from ℰ . Note that the magnitudes of feature and topology augmentations are proportional to 𝑝 𝑣 , 𝑝 𝑒 ∈ [ 0 , 1 ] , respectively.

Role. For hypergraph SSL, augmentations are crucial for mitigating overly-emphasized proximity information. That is, using all hyperedges and/or features for both message passing and objective-function construction may risk HNNs to heavily rely on direct neighborhood information (Tan et al., 2023). Many graph SSL strategies mask certain input edges and/or node features, preventing their encoder models from overfitting to the neighbor distribution and features (Hou et al., 2022; Li et al., 2023; Tan et al., 2023). Motivated by their findings, we use the augmentation step. In Appendix F.1, we demonstrate that augmentation enhances node classification performance of HypeBoy.

4.2Step 2: Hypergraph encoding

After augmentation, HypeBoy obtains node and query subset representations. First, HypeBoy obtains node embeddings 𝐙 ∈ ℝ | 𝒱 | × 𝑑 ′ by feeding the augmented hypergraph into an encoder HNN: 𝐙

𝑓 𝜃 ⁢ ( 𝐗 ′ , ℰ ′ ) . Then, HypeBoy obtains projected representations of query subsets (i.e., 𝑞 𝑖 ⁢ 𝑗 , 𝑣 𝑖 ∈ 𝑒 𝑗 , 𝑒 𝑗 ∈ ℰ ) and nodes. To this end, we utilize a node projection head 𝑓 𝜙 ′ and a set projection head 𝑓 𝜓 ′′ . Specifically, projected embeddings of node 𝑣 𝑖 and query subset 𝑞 𝑖 ⁢ 𝑗 are 𝒉 𝑖

𝑓 𝜙 ′ ⁢ ( 𝒛 𝑖 ) ∈ ℝ 𝑘 ⁢ and ⁢ 𝒒 𝑖 ⁢ 𝑗

𝑓 𝜌 ′′ ⁢ ( ∑ 𝑣 𝑡 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒛 𝑡 ) ∈ ℝ 𝑘 , respectively. Here, the design choice of the set projection head is motivated by Deep Sets (Zaheer et al., 2017).

Role. We investigate the role of the projection heads, which are non-trivial components, in the context of the dimensional collapse of embeddings. Dimensional collapse is a phenomenon in which embedding vectors occupy only the lower dimensional sub-space of their full dimension (Jing et al., 2022). This is identified by observing whether or not certain singular values of the embedding covariance matrix drop to zero. To prevent dimensional collapse, we employ projection heads in HypeBoy, and this is in line with the prior findings of Jing et al. (2022) and Song et al. (2023). Figure 2(a) illustrates that an HNN trained using HypeBoy avoids dimensional collapse, whereas an HNN trained with a HypeBoy variant (without projection heads) does not. Results on more datasets are in Appendix F.2. Furthermore, we provide a theoretical analysis of why HypeBoy without projection heads may suffer from dimensional collapse in Appendix B.2. Note that this distinction leads to a performance discrepancy in node classification (Section 5.3).

4.3Step 3: Hyperedge filling loss

The last step is to compute the SSL loss based on the hyperedge filling probability. We design 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑘 | 𝑞 𝑖 ⁢ 𝑗 ) to be normalized over 𝒱 (i.e., ∑ 𝑣 𝑘 ∈ 𝒱 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑘 | 𝑞 𝑖 ⁢ 𝑗 )

1 ). To this end, we utilize a Softmax function to model probabilities. In sum, with projected embeddings 𝒉 𝑖 and 𝒒 𝑖 ⁢ 𝑗 (Section 4.2), the probability of a node 𝑣 𝑖 completing a query subset 𝑞 𝑖 ⁢ 𝑗 is defined as follows:

𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) := exp ⁡ ( sim ⁢ ( 𝒉 𝑖 , 𝒒 𝑖 ⁢ 𝑗 ) ) ∑ 𝑣 𝑘 ∈ 𝒱 exp ⁡ ( sim ⁢ ( 𝒉 𝑘 , 𝒒 𝑖 ⁢ 𝑗 ) ) ,

(4)

where sim is the cosine similarity function (other similarity functions are also applicable). Lastly, HypeBoy is optimized for all possible hyperedge filling cases (i.e., Π 𝑒 𝑗 ∈ ℰ Π 𝑣 𝑖 ∈ 𝑒 𝑗 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) ). To sum up, HypeBoy minimizes the negative log-likelihood of all possible cases as follows:

ℒ := − ∑ 𝑒 𝑗 ∈ ℰ ∑ 𝑣 𝑖 ∈ 𝑒 𝑗 log ⁡ exp ⁡ ( sim ⁢ ( 𝒉 𝑖 , 𝒒 𝑖 ⁢ 𝑗 ) ) ∑ 𝑣 𝑘 ∈ 𝒱 exp ⁡ ( sim ⁢ ( 𝒉 𝑘 , 𝒒 𝑖 ⁢ 𝑗 ) ) .

(5)

Note that the set Θ of all parameters consists of the parameters of the encoder HNN 𝑓 𝜃 , the node projection head 𝑓 𝜙 ′ , and the set projection head 𝑓 𝜌 ′′ (i.e., Θ

( 𝜃 , 𝜙 , 𝜌 ) ). They are updated by gradient descent, aiming to minimize the loss ℒ defined in Eq. (5).

Role. Our design choice of 𝑝 ( 𝐗 , ℰ , Θ ) ⁢ ( ⋅ ) ensures that representations learned by HypeBoy achieve both alignment and uniformity (Wang & Isola, 2020). In our case, alignment indicates that node embeddings belonging to the same class are closely located to each other, and uniformity indicates that embeddings are uniformly distributed over the embedding space. Through the numerator of Eq. (5), HypeBoy pulls representations of a missing node and a query subset, promoting the alignment as discussed in our theoretical analysis (Section 3.2). At the same time, the denominator of Eq. (5) pushes away every node representation from each query subset representation, encouraging that representations are uniformly distributed. This intuition is supported by the findings of Wang & Isola (2020): the denominators of their contrastive loss, which push away representations from each other, encourage uniformity. As shown in Figure 2(b), we verify that node representations from HypeBoy achieve both alignment and uniformity. Results on more datasets are in Appendix F.3.

4.4Two-stage training scheme for further enhancement

In our preliminary study, we observed that with a randomly initialized encoder, HypeBoytends to rely too heavily on the projection heads rather than the encoder parameters, leading to sub-optimal training. Thus, we propose a two-stage training scheme to enhance the effectiveness of HypeBoy. In a nutshell, we first train an HNN 𝑓 𝜃 to reconstruct masked node features (inspired by Hou et al. (2022)). Then, we use the trained 𝑓 𝜃 parameters for the initialization of HypeBoy’s encoder to learn hyperedge filling (Sections 4.1-4.3). This feature warm-up for the HNN encoder repeats the following three steps for a fixed number of epochs, which we set to 300 in all our experiments.

Step 1: Encoding. For each feature-reconstruction training epoch, we sample a certain number of nodes uniformly at random from 𝒱 , which we denote as 𝒱 ′ ⊆ 𝒱 . Then, we mask the features of sampled nodes by using a learnable input token 𝒎 ( 𝐼 ) ∈ ℝ 𝑑 and use the resulting masked node feature matrix 𝐗 ′′ ∈ ℝ | 𝒱 | × 𝑑 , as the input node feature matrix (i.e., 𝐗 𝑖 ′′

𝒎 ( 𝐼 ) , ∀ 𝑣 𝑖 ∈ 𝒱 ′ and 𝐗 𝑗 ′′

𝐗 𝑗 , ∀ 𝑣 𝑗 ∈ 𝒱 ∖ 𝒱 ′ ). We also obtain augmented hyperedges ℰ ′ by using the strategy described in Section 4.1. Then, we obtain node embeddings 𝐙 ′ ∈ ℝ | 𝒱 | × 𝑑 ′ as follows: 𝐙 ′

𝑓 𝜃 ⁢ ( 𝐗 ′′ , ℰ ′ ) .

Step 2: Decoding. We again mask the embeddings of the nodes that we sampled earlier with a learnable embedding token 𝒎 ( 𝐸 ) ∈ ℝ 𝑑 ′ , obtaining the masked node embedding matrix 𝐙 ′′ (i.e., 𝐙 𝑖 ′′

𝒎 ( 𝐸 ) , ∀ 𝑣 𝑖 ∈ 𝒱 ′ and 𝐙 𝑗 ′′

𝐙 𝑗 ′ , ∀ 𝑣 𝑗 ∈ 𝒱 ∖ 𝒱 ′ ). Then, we acquire reconstructed node features 𝐗 ^ ∈ ℝ | 𝒱 | × 𝑑 by using a decoder HNN 𝑔 𝜓 as follows: 𝐗 ^

𝑔 𝜓 ⁢ ( 𝐙 ′′ , ℰ ′ ) .

Step 3: Updating. After gaining reconstructed node features 𝐗 ^ , we maximize the similarity between the original and reconstructed features of the masked nodes. To this end, we minimize the following loss function, which is proposed by Hou et al. (2022): ℒ ′

( 1 / | 𝒱 ′ | ) ⁢ ∑ 𝑣 𝑖 ∈ 𝒱 ′ ( 1 − ( 𝐗 ^ 𝑖 𝑇 ⁢ 𝐗 𝑖 / ∥ 𝐗 ^ 𝑖 ∥ ⁢ ∥ 𝐗 𝑖 ∥ ) ) . Using gradient descent to minimize ℒ ′ , we update the parameters of the encoder HNN 𝑓 𝜃 , the decoder HNN 𝑔 𝜓 , the input token 𝒎 ( 𝐼 ) , and the embedding token 𝒎 ( 𝐸 ) .

5Experimental Results

We now evaluate the efficacy of HypeBoy as techniques for (1) pre-training hypergraph neural networks (HNNs) for node classification (Section 5.1) and (2) learning general-purpose representations (Section 5.2). Then, we justify each of its components through an ablation study (Section 5.3).

Datasets. For experiments, we use 11 benchmark hypergraph datasets. The hypergraph datasets are from diverse domains, expressing co-citation, co-authorship, computer graphics, movie-actor, news, and political membership relations. In Appendix D, we detail their statistics and descriptions.

Baselines methods. We utilize 16 baseline methods. They include (a) 10 (semi-)supervised HNNs, including 2 state-of-the-art HNNs (ED-HNN (Wang et al., 2023a) and PhenomNN (Wang et al., 2023b)), (b) 2 generative SSL strategies for ordinary graphs (GraphMAE2 (Hou et al., 2023) and MaskGAE (Li et al., 2023)), and (c) 4 SSL strategies for hypergraph (TriCL (Lee & Shin, 2023a), HyperGCL (Wei et al., 2022), HyperGRL (Du et al., 2022), and H-GD, which is a direct extension of an ordinary graph SSL method (Zheng et al., 2022) to hypergraphs). We use UniGCNII (Huang & Yang, 2021) 2 and GCN (Kipf & Welling, 2017) as the backbone encoders for hypergraph SSL methods and ordinary graph SSL methods, respectively 3. In Appendix E, we provide their details, including their implementations, training, and hyperparameters.

HypeBoy. We utilize UniGCNII as an encoder of HypeBoy, which is the same as that of other hypergraph SSL methods. For both node- and set projection heads, we use an MLP. Further details of HypeBoy, including its implementations and hyperparameters, are provided in Appendix E.4.

Table 1:Efficacy as pre-training techniques: AVG and STD of accuracy values in node classification under the fine-tuning protocol. The best and second-best performances are colored green and yellow, respectively. R.G. and A.R. denote the random guess and average ranking among all methods, respectively. O.O.T. means that training is not completed within 24 hours. HypeBoy outperforms all the baseline methods in 8 datasets, and overall, it obtains the best average ranking. Method Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House A.R.

(Semi-)Supervised R.G. 18.1 (0.9) 17.4 (1.0) 36.0 (0.7) 17.8 (0.7) 18.9 (0.2) 25.6 (1.0) 10.2 (0.2) 33.9 (0.7) 3.7 (0.2) 50.1 (1.3) 26.7 (0.3) 17.8 MLP 32.5 (7.0) 27.9 (7.0) 62.1 (3.7) 34.8 (5.1) 73.5 (1.0) 56.0 (4.9) 22.3 (1.7) 39.1 (2.4) 89.4 (1.5) 73.1 (1.4) 72.2 (3.9) 14.2 HGNN 41.9 (7.8) 50.0 (7.2) 72.9 (5.0) 50.2 (5.7) 85.3 (0.8) 67.1 (6.0) 30.3 (2.5) 42.2 (2.9) 88.0 (1.4) 76.4 (1.9) 52.7 (3.8) 10.7 HyperGCN 31.4 (9.5) 33.1 (10.2) 63.5 (14.4) 37.1 (9.1) 53.5 (11.6) 68.2 (14.4) 26.4 (3.6) 37.9 (4.5) 55.1 (7.8) 67.0 (9.4) 49.8 (3.5) 15.6 HNHN 43.1 (8.7) 50.0 (7.9) 72.1 (5.4) 48.3 (6.2) 84.6 (0.9) 62.6 (4.8) 30.0 (2.4) 42.3 (3.4) 86.1 (1.6) 74.2 (1.5) 49.7 (2.2) 12.6 UniGCN 44.2 (8.1) 49.1 (8.4) 74.4 (3.9) 51.3 (6.3) 86.9 (0.6) 65.1 (4.7) 32.7 (1.8) 41.6 (3.5) 89.1 (1.0) 77.2 (1.2) 51.1 (2.4) 9.6 UniGIN 40.4 (9.1) 47.8 (7.7) 69.8 (5.6) 48.3 (6.1) 83.4 (0.8) 63.4 (5.1) 30.2 (1.4) 41.4 (2.7) 88.2 (1.8) 70.6 (1.8) 51.1 (3.0) 13.9 UniGCNII 44.2 (9.0) 48.5 (7.4) 74.1 (3.9) 54.8 (7.5) 87.4 (0.6) 65.8 (3.9) 32.5 (1.7) 42.5 (3.9) 90.8 (1.1) 70.9 (1.0) 50.8 (4.3) 9.4 AllSet 43.5 (8.0) 47.6 (4.2) 72.4 (4.5) 57.5 (5.7) 85.9 (0.6) 65.3 (3.9) 29.3 (1.2) 42.3 (2.4) 92.1 (0.6) 71.9 (2.5) 54.1 (3.4) 10.4 ED-HNN 40.3 (8.0) 47.6 (7.7) 72.7 (4.7) 54.8 (5.4) 86.2 (0.8) 65.8 (4.8) 30.0 (2.1) 41.4 (3.0) 90.7 (0.9) 76.2 (1.2) 71.3 (3.7) 10.4 PhenomNN 49.8 (9.6) 56.4 (9.6) 76.1 (3.5) 60.8 (6.2) 88.1 (0.4) 72.3 (4.1) 33.8 (2.0) 44.1 (3.7) 95.9 (0.8) 74.0 (1.5) 70.4 (5.6) 3.9

Self-supervised GraphMAE2 41.1 (10.0) 49.3 (8.3) 72.9 (4.2) 55.4 (8.4) 86.6 (0.6) 69.5 (4.4) 32.8 (1.9) 43.3 (2.7) 90.1 (0.7) 71.9 (1.3) 52.8 (3.5) 8.9 MaskGAE 49.6 (10.1) 57.1 (8.8) 72.8 (4.3) 57.8 (5.9) 86.3 (0.5) 74.8 (3.1) 33.7 (1.6) 44.5 (2.5) 90.0 (0.9) O.O.T. 51.8 (3.3) 7.7 TriCL 51.7 (9.8) 60.2 (7.9) 76.2 (3.6) 64.3 (5.5) 88.0 (0.4) 79.7 (2.9) 33.1 (2.2) 46.9 (2.9) 90.3 (1.0) 77.2 (1.0) 69.7 (4.9) 3.4 HyperGCL 47.0 (9.2) 60.3 (7.4) 76.8 (3.7) 62.0 (5.1) 87.6 (0.5) 79.7 (3.8) 33.2 (1.6) 43.9 (3.6) 91.2 (0.8) 77.8 (0.8) 69.2 (4.9) 3.5 H-GD 45.4 (9.9) 50.6 (8.2) 74.5 (3.5) 58.8 (6.2) 87.3 (0.5) 75.1 (3.6) 32.6 (2.2) 43.0 (3.3) 90.0 (1.0) 77.2 (1.0) 69.7 (5.1) 6.1 HyperGRL 42.3 (9.3) 49.1 (8.8) 73.0 (4.3) 55.8 (8.0) 86.7 (0.6) 70.8 (3.7) 33.0 (1.8) 43.1 (2.7) 90.1 (0.8) O.O.T. 52.5 (3.3) 9.2 HypeBoy 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) 80.6 (2.3) 34.1 (2.2) 47.6 (2.5) 90.4 (0.9) 77.6 (0.9) 70.4 (4.8) 1.7 5.1Efficacy as a pre-training technique (fine-tuned evaluation)

Setup. Following Wei et al. (2022), we randomly split the nodes into training/validation/test sets with the ratio of 1%/1%/98%, respectively.4 For reliability, we assess each method on 20 data splits across 5 random model initializations, following Lee & Shin (2023a). We report the average (AVG) and standard deviation (STD) of test accuracy values on each dataset. Specifically, for each SSL strategy, including HypeBoy, we pre-train a backbone encoder with the corresponding SSL scheme and then fine-tune the encoder in a (semi-)supervised manner.

Results. As shown in Table 1, HypeBoy shows the best average ranking among all 18 methods. Two points stand out. First, pre-training an HNN with HypeBoy generally improves node classification. Compared to the performance of UniGCNII (HypeBoy’s backbone encoder), HypeBoy obtains performance gains up to 12.5 points in 10 out of 11 datasets. Second, HypeBoy outperforms all other SSL strategies. Specifically, compared to the second-best method (TriCL), the accuracy gap is up to 5.0 points. In addition, the suboptimal performance of the state-of-the-art generative SSL strategies for graphs (i.e. GraphMAE2 and MaskGAE) implies the importance of preserving higher-order interactions in learning hypergraph representations. In summary, HypeBoy serves as an effective SSL strategy to pre-train HNNs for node classification.

5.2Efficacy as a general-purpose embedding technique (linear evaluation)

Setup. We assess the generalizability of learned representations from HypeBoy in two downstream tasks: node classification and hyperedge prediction. Considering this objective, we limit the baseline methods to SSL strategies, which yield embeddings independent of downstream tasks, and the original node features (i.e., naive X). We use the linear evaluation protocol (i.e., the embeddings are used as fixed inputs to the classifiers for each task). For node classification, we use the same settings described in Section 5.1. For hyperedge prediction, we split hyperedges into training/validation/test sets by the ratio of 60%/20%/20%. For its evaluation, we obtain the same number of negative hyperedge samples as that of the ground-truth hyperedges. We report the average (AVG) and standard deviation (STD) of test AUROC values on each dataset. Further experimental details about hyperedge prediction, including negative hyperedge sampling, are provided in Appendix E.3.

Results. As shown in Table 2, HypeBoy has the best average ranking in both node classification and hyperedge prediction. Specifically, in node classification, compared to the second-best method (TriCL), the accuracy gap is up to 6.3 points. This demonstrates that HypeBoy is more effective in learning general-purpose hypergraph representations than the other SSL strategies on hypergraphs.

Table 2: Efficacy as general-purpose embedding techniques: AVG and STD of accuracy/AUROC values in node classification and hyperedge prediction tasks under the linear evaluation protocol. In each downstream task, the best and second-best performances are colored green and yellow, respectively. A.R. denotes the average ranking among all methods. O.O.T. means that training is not completed within 24 hours. HypeBoy obtains the best average ranking in both downstream tasks. Method Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House A.R.

Node classification Naive 𝐗 27.8 (7.0) 32.4 (4.6) 62.8 (2.8) 31.9 (5.5) 69.4 (0.7) 54.7 (4.7) 21.4 (1.2) 38.1 (1.9) 91.9 (1.1) 70.6 (1.9) 71.3 (5.4) 5.5 GraphMAE2 29.2 (6.5) 37.5 (7.0) 55.5 (9.5) 38.2 (9.1) 75.6 (1.7) 57.5 (5.6) 27.3 (2.7) 36.6 (3.5) 89.1 (1.8) 62.3 (2.3) 51.7 (3.5) 6.3 MaskGAE 47.2 (11.1) 56.8 (9.3) 62.6 (5.5) 56.0 (4.8) 84.8 (0.7) 75.1 (3.5) 33.2 (2.0) 44.1 (3.9) 90.5 (0.9) O.O.T. 50.0 (2.8) 4.5 TriCL 53.3 (10.0) 62.1 (8.8) 74.5 (4.1) 63.6 (5.2) 87.1 (0.7) 80.9 (3.2) 35.0 (3.6) 48.0 (3.2) 80.0 (5.1) 67.2 (4.0) 69.1 (5.5) 2.6 HyperGCL 42.6 (8.6) 61.8 (8.3) 67.6 (8.0) 58.1 (6.3) 56.6 (5.2) 79.8 (3.8) 33.3 (2.2) 47.5 (2.8) 84.1 (2.8) 71.2 (3.4) 67.1 (5.4) 4 H-GD 35.6 (7.8) 37.6 (6.8) 58.0 (8.2) 48.6 (7.4) 73.3 (1.3) 74.0 (3.3) 33.8 (5.0) 35.2 (2.9) 76.6 (4.4) 54.8 (7.4) 68.3 (5.7) 5.5 HyperGRL 35.3 (8.2) 35.4 (8.8) 50.2 (8.7) 39.4 (8.1) 78.7 (1.2) 62.7 (5.1) 28.0 (2.8) 34.8 (3.0) 89.4 (1.5) O.O.T. 52.0 (3.7) 6.1 HypeBoy 59.6 (9.9) 63.5 (9.4) 75.0 (3.4) 66.0 (4.6) 87.9 (0.5) 81.2 (2.7) 34.3 (3.2) 48.8 (1.8) 89.2 (2.2) 75.7 (2.1) 69.4 (5.4) 1.5

Hyperedge prediction Naive 𝐗 63.3 (2.1) 75.5 (1.6) 88.3 (0.6) 55.0 (1.9) 90.0 (0.4) 72.1 (1.3) 80.0 (1.1) 39.5 (1.9) 99.5 (0.1) 97.7 (2.9) 54.8 (5.0) 6.4 GraphMAE2 73.3 (2.7) 76.4 (1.7) 81.6 (1.1) 76.3 (3.1) 85.2 (0.4) 68.3 (1.8) 80.7 (0.9) 53.7 (2.6) 99.5 (0.1) 90.1 (5.7) 62.9 (3.8) 6.1 MaskGAE 86.1 (1.6) 88.5 (1.4) 92.9 (0.5) 81.8 (2.7) 93.2 (0.5) 79.3 (2.0) 84.6 (0.1) 58.1 (2.5) 99.3 (0.1) O.O.T. 87.0 (3.4) 4.1 TriCL 90.5 (1.2) 90.7 (1.3) 91.9 (0.5) 87.8 (1.5) 94.8 (0.2) 87.9 (1.4) 90.4 (0.6) 58.9 (2.1) 99.6 (0.1) 98.2 (3.0) 90.0 (2.6) 1.8 HyperGCL 73.9 (2.6) 85.4 (1.5) 89.6 (0.5) 81.1 (1.9) 83.6 (0.6) 83.5 (1.0) 82.1 (7.6) 53.8 (2.4) 99.4 (0.1) 96.7 (7.0) 76.3 (6.3) 5 H-GD 72.2 (5.0) 71.9 (3.1) 87.2 (0.7) 73.2 (4.0) 91.6 (1.0) 81.4 (1.9) 84.9 (2.1) 53.1 (1.8) 99.5 (0.1) 83.9 (2.1) 87.9 (3.1) 5.3 HyperGRL 83.2 (8.0) 84.0 (1.7) 78.3 (1.8) 80.0 (3.3) 88.2 (0.3) 80.8 (1.4) 81.5 (0.7) 54.7 (2.8) 98.8 (0.4) O.O.T. 88.2 (3.3) 5.5 HypeBoy 91.1 (1.1) 91.9 (1.1) 95.1 (0.3) 88.1 (1.4) 95.5 (0.1) 87.3 (1.3) 89.8 (0.5) 59.4 (2.1) 99.7 (0.1) 99.0 (1.6) 87.0 (2.8) 1.5 Table 3:The ablation study with four variants of HypeBoy on node classification under the fine-tuning protocol. The best and second-best performances are colored green and yellow, respectively. F.R., H.F., and P.H. denote Feature Reconstruction, Hyperedge Filling, and Projection Heads, respectively. A.R. denotes the average ranking among all methods. NA denotes no pre-training. HypeBoy outperforms others in most datasets, justifying each of its components. F. R. H. F. P. H. Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House A.R. NA ✗ ✗ ✗ 44.2 (9.0) 48.5 (7.4) 74.1 (3.9) 54.8 (7.5) 87.4 (0.6) 65.8 (3.9) 32.5 (1.7) 42.5 (3.9) 90.8 (1.1) 70.9 (1.0) 50.8 (4.3) 5.5 V1 ✗ ✔ ✗ 51.6 (11.2) 60.7 (8.2) 76.2 (3.6) 63.5 (6.0) 88.1 (0.5) 78.5 (2.9) 33.5 (2.8) 46.8 (3.1) 90.0 (1.1) 77.4 (0.9) 68.5 (4.5) 4.3 V2 ✗ ✔ ✔ 52.7 (9.6) 59.7 (9.2) 76.7 (3.2) 63.5 (6.0) 88.2 (0.5) 79.1 (2.5) 33.8 (2.2) 46.9 (3.3) 90.6 (1.0) 77.0 (0.9) 69.6 (4.9) 3.3 V3 ✔ ✗ ✗ 52.0 (9.3) 58.9 (8.2) 74.1 (3.9) 61.2 (6.6) 87.8 (0.4) 79.9 (2.3) 33.9 (2.1) 46.3 (2.7) 91.4 (0.9) 77.5 (0.9) 70.1 (4.8) 3.6 V4 ✔ ✔ ✗ 56.0 (9.9) 61.8 (8.5) 76.5 (3.1) 65.3 (4.3) 88.0 (0.4) 80.3 (2.4) 34.0 (2.0) 47.5 (2.3) 90.8 (1.0) 77.4 (1.0) 69.3 (5.0) 2.5 Ours ✔ ✔ ✔ 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) 80.6 (2.3) 34.1 (2.2) 47.6 (2.5) 90.4 (0.9) 77.6 (0.9) 70.4 (4.8) 1.4 5.3Ablation study

We analyze the necessity for each component of HypeBoy, specifically, (a) the hyperedge filling task (Section 3.1), (b) projection heads (Section 4.2), and (c) feature reconstruction warm-up (Section 4.4). To this end, we utilize four variants of HypeBoy: (V1) without feature reconstruction warm-up and projection heads, (V2) without feature reconstruction warm-up 5, (V3) without the hyperedge filling process, and (V4) without projection heads. Here, projection heads are used only for methods with the hyperedge filling process. Note that V3 is a method that only utilizes the feature reconstruction process, which is described in Section 4.4.

As shown in Table 3, HypeBoy, equipped with all of its components, outperforms the others in most datasets, demonstrating the effectiveness of our design choices. There are two other notable results. First, the necessity of projection heads is evidenced by the superior performance of V2 (compared to V1) and ours (compared to V4). Second, the advantage of the hyperedge filling task over feature reconstruction is manifested by the better average rank of V2 compared to V3.

6Conclusion

In this work, we conduct a comprehensive analysis of generative self-supervised learning on hypergraphs. Our contribution is three-fold. First, we formulate the hyperedge filling task, a generative self-supervised learning task on hypergraphs, and investigate the theoretical connection between the task and node classification (Section 3). Second, we present a generative SSL method HypeBoy to solve the task (Section 4). Third, we demonstrate the superiority of HypeBoy over existing SSL methods on hypergraphs through extensive experiments (Section 5).

Acknowledgements

This work was supported by Samsung Electronics Co., Ltd. and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)).

References Behrouz et al. (2023) ↑ Ali Behrouz, Farnoosh Hashemi, Sadaf Sadeghian, and Margo Seltzer.Cat-walk: Inductive hypergraph learning via set walks.In NeurIPS, 2023. Benson et al. (2018) ↑ Austin R Benson, Rediet Abebe, Michael T Schaub, Ali Jadbabaie, and Jon Kleinberg.Simplicial closure and higher-order link prediction.Proceedings of the National Academy of Sciences, 115(48):E11221–E11230, 2018. Bishop (2006) ↑ Christopher M. Bishop.Pattern Recognition and Machine Learning.Springer, 2006. Chen et al. (2020) ↑ Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.A simple framework for contrastive learning of visual representations.In ICML, 2020. Chien et al. (2022) ↑ Eli Chien, Chao Pan, Jianhao Peng, and Olgica Milenkovic.You are allset: A multiset function framework for hypergraph neural networks.In ICLR, 2022. Choe et al. (2023) ↑ Minyoung Choe, Sunwoo Kim, Jaemin Yoo, and Kijung Shin.Classification of edge-dependent labels of nodes in hypergraphs.In KDD, 2023. Devlin et al. (2019) ↑ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.In NACCL, 2019. Dong et al. (2020) ↑ Yihe Dong, Will Sawin, and Yoshua Bengio.Hnhn: Hypergraph networks with hyperedge neurons.In ICML Workshop on Graph Representation Learning and Beyond (GRL+), 2020. Du et al. (2022) ↑ Boxin Du, Changhe Yuan, Robert Barton, Tal Neiman, and Hanghang Tong.Self-supervised hypergraph representation learning.In BigData, 2022. Dua et al. (2017) ↑ Dheeru Dua, Casey Graff, et al.Uci machine learning repository.2017. Feng et al. (2018) ↑ Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao.Gvcnn: Group-view convolutional neural networks for 3d shape recognition.In CVPR, 2018. Feng et al. (2019) ↑ Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao.Hypergraph neural networks.In AAAI, 2019. Grill et al. (2020) ↑ Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al.Bootstrap your own latent-a new approach to self-supervised learning.In NeurIPS, 2020. He et al. (2022) ↑ Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.Masked autoencoders are scalable vision learners.In CVPR, 2022. Hou et al. (2022) ↑ Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang.Graphmae: Self-supervised masked graph autoencoders.In KDD, 2022. Hou et al. (2023) ↑ Zhenyu Hou, Yufei He, Yukuo Cen, Xiao Liu, Yuxiao Dong, Evgeny Kharlamov, and Jie Tang.Graphmae2: A decoding-enhanced masked self-supervised graph learner.In WWW, 2023. Huang & Yang (2021) ↑ Jing Huang and Jie Yang.Unignn: a unified framework for graph and hypergraph neural networks.In IJCAI, 2021. Jing et al. (2022) ↑ Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian.Understanding dimensional collapse in contrastive self-supervised learning.In ICLR, 2022. Khanam et al. (2023) ↑ Kazi Zainab Khanam, Gautam Srivastava, and Vijay Mago.The homophily principle in social network analysis: A survey.Multimedia Tools and Applications, 82(6):8811–8854, 2023. Kim et al. (2023a) ↑ Sunwoo Kim, Fanchen Bu, Minyoung Choe, Jaemin Yoo, and Kijung Shin.How transitive are real-world group interactions?–measurement and reproduction.In KDD, 2023a. Kim et al. (2023b) ↑ Sunwoo Kim, Minyoung Choe, Jaemin Yoo, and Kijung Shin.Reciprocity in directed hypergraphs: measures, findings, and generators.Data Mining and Knowledge Discovery, 37(6):2330–2388, 2023b. Kim et al. (2023c) ↑ Sunwoo Kim, Dongjin Lee, Yul Kim, Jungho Park, Taeho Hwang, and Kijung Shin.Datasets, tasks, and training methods for large-scale hypergraph learning.Data Mining and Knowledge Discovery, 37(6):2216–2254, 2023c. Kingma & Ba (2015) ↑ Diederik P Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In ICLR, 2015. Kipf & Welling (2017) ↑ Thomas N Kipf and Max Welling.Semi-supervised classification with graph convolutional networks.In ICLR, 2017. Laakasuo et al. (2020) ↑ Michael Laakasuo, Anna Rotkirch, Max Van Duijn, Venla Berg, Markus Jokela, Tamas David-Barrett, Anneli Miettinen, Eiluned Pearce, and Robin Dunbar.Homophily in personality enhances group success among real-life friends.Frontiers in Psychology, 11:710, 2020. Lee & Shin (2023a) ↑ Dongjin Lee and Kijung Shin.I’m me, we’re us, and i’m us: Tri-directional contrastive learning on hypergraphs.In AAAI, 2023a. Lee & Shin (2023b) ↑ Geon Lee and Kijung Shin.Temporal hypergraph motifs.Knowledge and Information Systems, 65(4):1549–1586, 2023b. Lee et al. (2024) ↑ Geon Lee, Fanchen Bu, Tina Eliassi-Rad, and Kijung Shin.A survey on hypergraph mining: Patterns, tools, and generators.arXiv preprint arXiv:2401.08878, 2024. Li et al. (2023) ↑ Jintang Li, Ruofan Wu, Wangbin Sun, Liang Chen, Sheng Tian, Liang Zhu, Changhua Meng, Zibin Zheng, and Weiqiang Wang.What’s behind the mask: Understanding masked graph modeling for graph autoencoders.In KDD, 2023. Nair & Hinton (2010) ↑ Vinod Nair and Geoffrey E Hinton.Rectified linear units improve restricted boltzmann machines.In ICML, 2010. OpenAI (2023) ↑ OpenAI.Gpt-4 technical report.2023. Patil et al. (2020) ↑ Prasanna Patil, Govind Sharma, and M Narasimha Murty.Negative sampling for hyperlink prediction in networks.In PAKDD, 2020. Ruan et al. (2021) ↑ Ding Ruan, Shuyi Ji, Chenggang Yan, Junjie Zhu, Xibin Zhao, Yuedong Yang, Yue Gao, Changqing Zou, and Qionghai Dai.Exploring complex and heterogeneous correlations on hypergraph for the prediction of drug-target interactions.Patterns, 2(12):100390, 2021. Saifuddin et al. (2023) ↑ Khaled Mohammed Saifuddin, Briana Bumgardner, Farhan Tanvir, and Esra Akbas.Hygnn: Drug-drug interaction prediction via hypergraph neural network.In ICDE, 2023. Sawhney et al. (2020) ↑ Ramit Sawhney, Shivam Agarwal, Arnav Wadhwa, and Rajiv Ratn Shah.Spatiotemporal hypergraph convolution network for stock movement forecasting.In ICDM, 2020. Sawhney et al. (2021) ↑ Ramit Sawhney, Shivam Agarwal, Arnav Wadhwa, Tyler Derr, and Rajiv Ratn Shah.Stock selection via spatiotemporal hypergraph attention network: A learning to rank approach.In AAAI, 2021. Song et al. (2023) ↑ Zeen Song, Xingzhe Su, Jingyao Wang, Wenwen Qiang, Changwen Zheng, and Fuchun Sun.Towards the sparseness of projection head in self-supervised learning.arXiv preprint arXiv:2307.08913, 2023. Su et al. (2015) ↑ Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller.Multi-view convolutional neural networks for 3d shape recognition.In CVPR, 2015. Tan et al. (2023) ↑ Qiaoyu Tan, Ninghao Liu, Xiao Huang, Soo-Hyun Choi, Li Li, Rui Chen, and Xia Hu.S2gae: Self-supervised graph autoencoders are generalizable learners with graph masking.In WSDM, 2023. Tang et al. (2015) ↑ Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.Line: Large-scale information network embedding.In WWW, 2015. Tong et al. (2022) ↑ Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.In NeurIPS, 2022. Veldt et al. (2023) ↑ Nate Veldt, Austin R Benson, and Jon Kleinberg.Combinatorial characterizations and impossibilities for higher-order homophily.Science Advances, 9(1):eabq3200, 2023. Veličković et al. (2019) ↑ Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm.Deep graph infomax.In ICLR, 2019. Wang et al. (2023a) ↑ Peihao Wang, Shenghao Yang, Yunyu Liu, Zhangyang Wang, and Pan Li.Equivariant hypergraph diffusion neural operators.In ICLR, 2023a. Wang & Isola (2020) ↑ Tongzhou Wang and Phillip Isola.Understanding contrastive representation learning through alignment and uniformity on the hypersphere.In ICML, 2020. Wang et al. (2019) ↑ Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu.Heterogeneous graph attention network.In WWW, 2019. Wang et al. (2023b) ↑ Yuxin Wang, Quan Gan, Xipeng Qiu, Xuanjing Huang, and David Wipf.From hypergraph energy functions to hypergraph neural networks.In ICML, 2023b. Wei et al. (2022) ↑ Tianxin Wei, Yuning You, Tianlong Chen, Yang Shen, Jingrui He, and Zhangyang Wang.Augmentations in hypergraph contrastive learning: Fabricated and generative.In NeurIPS, 2022. Wu et al. (2015) ↑ Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao.3d shapenets: A deep representation for volumetric shapes.In CVPR, 2015. Xia et al. (2022) ↑ Lianghao Xia, Chao Huang, and Chuxu Zhang.Self-supervised hypergraph transformer for recommender systems.In KDD, 2022. Yadati et al. (2019) ↑ Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, and Partha Talukdar.Hypergcn: A new method for training graph convolutional networks on hypergraphs.In NeurIPS, 2019. Yadati et al. (2020) ↑ Naganand Yadati, Vikram Nitin, Madhav Nimishakavi, Prateek Yadav, Anand Louis, and Partha Talukdar.Nhp: Neural hypergraph link prediction.In CIKM, 2020. Yoon et al. (2020) ↑ Se-eun Yoon, Hyungseok Song, Kijung Shin, and Yung Yi.How much and when do we need higher-order information in hypergraphs? a case study on hyperedge prediction.In WWW, 2020. You et al. (2020) ↑ Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen.Graph contrastive learning with augmentations.In NeurIPS, 2020. Yu et al. (2021) ↑ Junliang Yu, Hongzhi Yin, Jundong Li, Qinyong Wang, Nguyen Quoc Viet Hung, and Xiangliang Zhang.Self-supervised multi-channel hypergraph convolutional network for social recommendation.In WWW, 2021. Zaheer et al. (2017) ↑ Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola.Deep sets.In NeurIPS, 2017. Zhang et al. (2019) ↑ Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla.Heterogeneous graph neural network.In KDD, 2019. Zhang et al. (2021) ↑ Junwei Zhang, Min Gao, Junliang Yu, Lei Guo, Jundong Li, and Hongzhi Yin.Double-scale self-supervised hypergraph learning for group recommendation.In CIKM, 2021. Zheng et al. (2022) ↑ Yizhen Zheng, Shirui Pan, Vincent Lee, Yu Zheng, and Philip S Yu.Rethinking and scaling up graph contrastive learning: An extremely efficient approach with group discrimination.In NeurIPS, 2022. Appendix AAppendix

In this section, we provide a proof of each theorem. In addition, we extend the results from Section 3.2, which assume 𝑣 𝑖 ∈ 𝐶 1 is assumed, to include cases where 𝑣 𝑖 ∈ 𝐶 0 .

A.1Proof of Theorem 1 Proof.

We first analyze a node 𝑣 𝑖 that belongs to 𝐶 1 (i.e., 𝑣 𝑖 ∈ 𝐶 1 ). By definition, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ]

𝑃 𝒙 ⁢ ( 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 1 , 𝐈 ) > 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 0 , 𝐈 ) ) holds. Thus, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ] is determined by the distribution of 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 1 , 𝐈 )

𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 0 , 𝐈 ) , where a random variable is 𝒙 𝑖 . We provide the exact formula for this inequality:

𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 1 , 𝐈 ) > 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 0 , 𝐈 ) ,

≡
exp ⁡ ( − ( 𝒙 𝑖 − 𝝁 1 ) 𝑇 ⁢ ( 𝒙 𝑖 − 𝝁 1 ) ) > exp ⁡ ( − ( 𝒙 𝑖 − 𝝁 0 ) 𝑇 ⁢ ( 𝒙 𝑖 − 𝝁 0 ) ) ,

≡
( 𝒙 𝑖 − 𝝁 1 ) 𝑇 ⁢ ( 𝒙 𝑖 − 𝝁 1 ) < ( 𝒙 𝑖 − 𝝁 0 ) 𝑇 ⁢ ( 𝒙 𝑖 − 𝝁 0 ) ,

𝒙 𝑖 𝑇 ⁢ 𝒙 𝑖 − 2 ⁢ 𝝁 1 𝑇 ⁢ 𝒙 𝑖 + 𝝁 1 𝑇 ⁢ 𝝁 1 < 𝒙 𝑖 𝑇 ⁢ 𝒙 𝑖 − 2 ⁢ 𝝁 0 𝑇 ⁢ 𝒙 𝑖 + 𝝁 0 𝑇 ⁢ 𝝁 0 ,

≡

( 𝝁 1 − 𝝁 0 ) 𝑇 𝒙 𝑖 + 𝝁 0 𝑇 𝝁 0 − 𝝁 1 𝑇 𝝁 1

0 ≡ 𝟏 → 𝑇 𝒙 𝑖

0 , ∵ definition of 𝝁 1 and 𝝁 0 in Assumption 1 .

Thus, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ] is equivalent to 𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 > 0 ) . In a similar sense, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

1 ] is equivalent to 𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖

0 ) , since as mentioned in Section 3.2, 𝒛 𝑖 is a function of 𝒙 𝑘 , ∀ 𝑣 𝑘 ∈ 𝒱 .

Now, we formalize 𝒛 𝑖 . Note that 𝒛 𝑖

𝒙 𝑖 − 𝛾 ⁢ ∇ 𝒙 𝑖 ℒ holds by (F2). To further elaborate on this equation, we first detail ∇ 𝒙 𝑖 ℒ :

∇ 𝒙 𝑖 ℒ

∂ ( − log ( 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) ) ) ∂ 𝒙 𝑖

∂ ( − log ⁡ ( exp ⁡ ( 𝒙 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒙 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ) ) ∂ 𝒙 𝑖 ,

(6)

= − ∂ ( 𝒙 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ∂ 𝒙 𝑖 + ∂ ( log ⁡ ( ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒙 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ) ) ∂ 𝒙 𝑖 ,

(7)

= − ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) + exp ⁡ ( 𝒙 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒙 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ,

(8)

= − ( 1 − exp ⁡ ( 𝒙 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒙 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) ) ) ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 ) .

(9)

By Eq (9), 𝒛 𝑖 can be expressed as a function of 𝒙 𝑘 , ∀ 𝑣 𝑘 ∈ 𝒱 , as follows:

𝒛 𝑖

𝒙 𝑖 + 𝛾 ⁢ ( 1 − exp ⁡ ( 𝒙 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒙 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) ) ) ⏟ (Term 1) ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) .

(10)

Let 𝒙 𝑞 ′ denotes ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 . Furthermore, denote (Term 1) in Eq (10) as 𝑓 ⁢ ( 𝒙 ) ∈ ( 0 , 1 ) . Then, we rewrite Eq (10) as 𝒛 𝑖

𝒙 𝑖 + 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝒙 𝑞 ′ . We finally rewrite 𝟏 → 𝑇 ⁢ 𝒛 𝑖 as follows:

𝟏 → 𝑇 ⁢ 𝒛 𝑖

𝟏 → 𝑇 ⁢ 𝒙 𝑖 + 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝟏 → 𝑇 ⁢ 𝒙 𝑞 ′ .

(11)

Let 𝛽 denotes 𝟏 → 𝑇 ⁢ 𝒙 𝑞 ′ . Note that by the statement of Theorem 1, 𝟏 → 𝑇 ⁢ 𝒙 𝑞 ′

0 ≡ 𝛽

0 holds. Thus, our main interest, which is 𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖

0 ) , can be rewritten as follows:

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 > 0 )

𝑃 𝒙 ⁢ ( ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 + 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝛽 )

0 ) .

(12)

Note that when 𝟏 → 𝑇 ⁢ 𝒙 𝑖

0 holds, 𝟏 → 𝑇 ⁢ 𝒛 𝑖

0 holds as well, since 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝛽

0 holds. Thus, Eq (12) is split as follows:

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 > 0 )

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 > 0 ) + 𝑃 𝒙 ⁢ ( − 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝛽 < 𝟏 → 𝑇 ⁢ 𝒙 𝑖 < 0 ) ,

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

1 ]

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ] ⏟ (a) Expected accuracy of naive 𝒙 𝑖 + 𝑃 𝒙 ( − 𝛾 𝛽 < 𝟏 → 𝑇 ⁢ 𝒙 𝑖 ( 1 − 𝑓 ⁢ ( 𝒙 ) ) < 0 ) . ⏟ (b) Additional gain via hyperedge filling

(13)

Note that the gain term, which is the (b) term of Eq (13), is always greater than zero.

Generalization to 𝑣 𝑖 ∈ 𝐶 0 . Now, we analyze a node 𝑣 𝑖 that belongs to 𝐶 0 (i.e., 𝑣 𝑖 ∈ 𝐶 0 ). In this case, the previous condition 𝟏 → 𝑇 ⁢ 𝒙 𝑞 ′

0 ≡ 𝛽

0 becomes 𝟏 → 𝑇 ⁢ 𝒙 𝑞 ′ < 0 ≡ 𝛽 < 0 . In a similar sense, for the expected accuracy: 𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖

0 ) is changed as 𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 < 0 ) . In this setting, we can directly extend the result of Eq (12) as follows:

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 < 0 )

𝑃 𝒙 ⁢ ( ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 + 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝛽 ) < 0 ) .

(14)

By employing the above proof, we can obtain the following result (note that 𝛽 < 0 in this case):

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 < 0 )

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 < 0 ) + 𝑃 𝒙 ⁢ ( 0 < 𝟏 → 𝑇 ⁢ 𝒙 𝑖 < − 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝛽 ) ,

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

0 ]

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

0 ] ⏟ (a) Expected accuracy of naive 𝒙 𝑖 + 𝑃 𝒙 ( − 𝛾 𝛽 < 𝟏 → 𝑇 ⁢ 𝒙 𝑖 ( 1 − 𝑓 ⁢ ( 𝒙 ) ) < 0 ) . ⏟ (b) Additional gain via hyperedge filling

(15)

Thus, we can also derive the same result for 𝑣 𝑖 ∈ 𝐶 0 . ∎

A.2Proof of Theorem 2. Remark.

We have denoted the affinity parameter in Assumption 2 as 𝒫 to distinguish it from the hyperedge filling probability 𝑝 ( 𝐗 , ℰ , Θ ) . In this proof, by allowing a slight duplication in notation, we let 𝒫 ≔ 𝑝 , since we do not use 𝑝 ( 𝐗 , ℰ , Θ ) in this proof. In addition, we let 𝑒 𝑗 ≔ 𝑒 and 𝑞 𝑖 ⁢ 𝑗 ≔ 𝑞 .

Proof.

We first derive the functional form of 𝑃 𝒙 , 𝑒 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 | 𝑝 ) . By 𝑣 𝑖 ∈ 𝑒 𝑗 ∩ 𝐶 1 , which is the condition of the theorem, the following holds:

𝑃 𝒙 , 𝑒 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) > 0 | 𝑝 )

𝑃 𝒙 , 𝑒 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 , 𝑣 𝑖 ∈ 𝑒 | 𝑝 ) 𝑃 𝑒 ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 ) .

(16)

It is important to note that if the number of nodes that belong to both 𝐶 1 and 𝑒 𝑗 is decided, denoted by 𝑠 (i.e., 𝑠 ≔ | 𝑒 ∩ 𝐶 1 | ), the distribution of 𝟏 → 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) is automatically decided, since 𝒙 𝑡 , ∀ 𝑣 𝑡 ∈ 𝒱 is independently generated from Gaussian distribution, whose mean vector is decided according to the class of 𝑣 𝑡 . This is because of ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ∼ 𝒩 ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝝁 1 , ( 𝑆 − 1 ) ⁢ 𝐈 ) , where 𝑆 is a size of 𝑒 for a given 𝑠 (i.e., | 𝑒 |

𝑆 ). From this result, the following two results are induced:

𝟏 → 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

∼ 𝒩 ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝑑 2 , ( 𝑆 − 1 ) ⁢ 𝑑 ) ,

(17)

𝑃 𝒙 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) > 0 | 𝑠 )

Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝑑 4 ⁢ ( 𝑆 − 1 ) ) .

(18)

where Φ ⁢ ( ⋅ ) is a C.D.F. of the standard Gaussian distribution. Then, we rewrite Eq (16) as follows:

≡
∑ 𝑠

0 𝑆 𝑃 𝒙 , 𝑒 ⁢ ( 𝟏 → 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 , 𝑠 , 𝑣 𝑖 ∈ 𝑒 ) 𝑃 𝑒 ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 ) ,

(19)

=
∑ 𝑠

0 𝑆 𝑃 𝒙 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 | 𝑠 ) × 𝑃 𝑒 ( 𝑠 , 𝑣 𝑖 ∈ 𝑒 ) 𝑃 𝑒 ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 ) .

(20)

By Assumption 2, each 𝑃 𝑒 ⁢ ( 𝑠 , 𝑣 𝑖 ∈ 𝑒 ) is derived as follows 6:

𝑃 𝑒 ⁢ ( 𝑠 , 𝑣 𝑖 ∈ 𝑒 | 𝑝 )

( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 2 ⏟ prob. of 𝑠 at 𝑐

1 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 2 ⏟ prob. of 𝑠 at 𝑐

0 ) ⁢ ( 1 − ( 𝑁 − 1 𝑠 ) ( 𝑁 𝑠 ) ) ⏟ prob. of 𝑣 𝑖 ∈ 𝐶 1 ∩ 𝑒 .

(21)

By using Eq (21), we further detail 𝑃 𝑒 ⁢ ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 ) as follows:

𝑃 𝑒 ⁢ ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 )

∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 2 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 2 ) ⁢ ( 1 − ( 𝑁 − 1 𝑠 ) ( 𝑁 𝑠 ) ) ,

(22)

= ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 2 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 2 ) ⁢ 𝑠 𝑁 ,

(23)

= 1 𝑁 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ ( 𝑠 ⁢ 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 2 ⏟ 𝑠 ∼ 𝐵 ⁢ ( 𝑆 , 𝑝 ) + 𝑠 ⁢ ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 2 ⏟ 𝑠 ∼ 𝐵 ⁢ ( 𝑆 , 1 − 𝑝 ) ) ,

(24)

= 1 𝑁 ( 𝑆 ⁢ 𝑝 2 + 𝑆 ⁢ ( 1 − 𝑝 ) 2 ) , ∵ expectation of Binomial distribution,

(25)

= 𝑆 2 ⁢ 𝑁 .

(26)

With the result of Eq (26) and Eq (18), we rewrite Eq (20) as follows:

≡ 2 ⁢ 𝑁 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 2 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 2 ) ⁢ 𝑠 𝑁 ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝑑 4 ⁢ ( 𝑆 − 1 ) ) ,

(27)

= 1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑠 ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 ) ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝑑 4 ( 𝑆 − 1 ) ) ) .

(28)

Note that our main function, which is 𝑃 𝒙 , 𝑒 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 | 𝑝 ) , is equal to Eq (28).

It is important to note that Eq (28) is symmetric w.r.t. 0.5 in 𝑝 ∈ [ 0 , 1 ] . Thus, if we prove Eq (28) is strictly increasing on 𝑝 ∈ [ 0.5 , 1 ] , the proof of the statement that Eq (28) is strictly decreasing on 𝑝 ∈ [ 0 , 0.5 ] is straight forward. Hence, we prove the second statement of the theorem by showing that Eq (28) is strictly increasing on 𝑝 ∈ [ 0.5 , 1 ] .

Now, we show the first and second statements of the theorem. Here, we first show the second statement, and from the result, we further prove the first statement.

Second statement. We first rewrite Eq (28) by adequately mixing two binomial functions. For simplicity, we denote Φ ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝑑 4 ⁢ ( 𝑆 − 1 ) ) as 𝜙 ⁢ ( 𝑠 ) , since the value that changes in Φ ⁢ ( ⋅ ) is only 𝑠 and other terms are fixed constants. From this, Eq (28) is rewritten as follows:

=
1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑠 ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 ) ⁢ 𝜙 ⁢ ( 𝑠 ) ,

(29)

=
1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑠 ⁢ 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 ⁢ 𝜙 ⁢ ( 𝑠 ) ⏟ (Term 1) + 1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑠 ⁢ ( 1 − 𝑝 ) 𝑠 ⁢ ( 𝑝 ) 𝑆 − 𝑠 ⁢ 𝜙 ⁢ ( 𝑠 ) ⏟ (Term 2) .

(30)

We define 𝑠 ′ ≔ ( 𝑆 − 𝑠 ) . Due to the symmetric characteristic of the binomial function and standard Gaussian distribution, we can rewrite (Term 2) in Eq (30) as follows:

≡ 1 𝑆 ⁢ ∑ 𝑠 ′

0 𝑆 ( 𝑆 𝑠 ′ ) ⁢ 𝑝 𝑠 ′ ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 ′ ⁢ ( 𝑆 − 𝑠 ′ ) ⁢ Φ ⁢ ( ( 𝑆 − 2 ⁢ 𝑠 ′ − 1 ) ⁢ 𝑑 4 ⁢ ( 𝑆 − 1 ) ) .

(31)

Since we can regard 𝑠 and 𝑠 ′ as equivalent terms in our framework, we add (Term 1) of Eq (30) and Eq (31) as follows:

∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 𝑆 ⁢ [ ( 𝑆 − 𝑠 ) ⁢ Φ ⁢ ( ( 𝑆 − 2 ⁢ 𝑠 − 1 ) ⁢ 𝑑 4 ⁢ ( 𝑆 − 1 ) ) + 𝑠 ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝑑 4 ⁢ ( 𝑆 − 1 ) ) ] ⏟ Term (a) .

(32)

We aim to show that Eq (32) is a strictly increasing function w.r.t. 𝑝 ∈ [ 0.5 , 1 ] . For simplicity, we let 𝐾 ≔ 𝑑 / ( 4 ⁢ ( 𝑆 − 1 ) ) . We first discard a constant term 1 / 𝑆 from Eq (32) for simplicity. Note that without Eq (32) Term (a), Eq (32) is equivalent to 1, and this is a sum of binomial probability mass functions.

That is, we can think of Eq (32) as a weighted average of the following values:

( 𝑆 − 𝑠 ) ⁢ Φ ⁢ ( ( 𝑆 − 2 ⁢ 𝑠 − 1 ) ⁢ 𝐾 ) + 𝑠 ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝐾 ) , ∀ 𝑠 ∈ { 0 , 1 , ⋯ , 𝑆 } .

(33)

Specifically, we let ( 𝑆 𝑠 ) ⁢ 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 as a weight of [ ( 𝑆 − 𝑠 ) ⁢ Φ ⁢ ( ( 𝑆 − 2 ⁢ 𝑠 − 1 ) ⁢ 𝐾 ) + 𝑠 ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝐾 ) ] . Note that values of Eq (33) and their weights are both symmetric about 𝑠

𝑆 / 2 : which means 𝑠 and 𝑆 − 𝑠 have the same weight value. Thus, we can rewrite Eq (32) for an odd number 𝑆 as:

≡
∑ 𝑠

⌊ 𝑆 / 2 ⌋ 𝑆 ( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 )

× [ ( 𝑆 − 𝑠 ) ⁢ Φ ⁢ ( ( 𝑆 − 2 ⁢ 𝑠 − 1 ) ⁢ 𝐾 ) + 𝑠 ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝐾 ) ] .

(34)

While this formulation is for an odd number 𝑆 , we can extend further results to an even number 𝑆 .

We first show that Eq (33) is an increasing function w.r.t. 𝑠 ∈ { ⌊ 𝑆 / 2 ⌋ , ⌊ 𝑆 / 2 ⌋ + 1 , ⋯ , 𝑆 } . For two cases where 𝑠

𝑘 and 𝑠

𝑘 + 1 , their Eq (33) is expressed as follows:

[ ( 𝑆 − 𝑘 − 1 ) ⁢ Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 3 ) + ( 𝑘 + 1 ) ⁢ Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 + 1 ) ]

(35)

[ ( 𝑆 − 𝑘 ) ⁢ Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 1 ) + 𝑘 ⁢ Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 − 1 ) ] ,

(36)

≡

𝑘 ⁢ ( Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 + 1 ) − Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 − 1 ) ) + Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 + 1 )

(37)

( 𝑆 − 𝑘 ) ⁢ ( Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 1 ) − Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 3 ) ) + Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 3 ) ,

(38)

By the condition of 𝑠 ≥ ⌊ 𝑆 / 2 ⌋ , 𝑘

𝑆 − 𝑘 holds. In addition, by the characteristic of the CDF of the standard normal distribution, ( Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 + 1 ) − Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 − 1 ) )

( Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 1 ) − Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 3 ) ) also holds. Moreover, since 𝑠 ≥ ⌊ 𝑆 / 2 ⌋ , Φ ⁢ ( 2 ⁢ 𝑘 − 𝑆 + 1 )

Φ ⁢ ( 𝑆 − 2 ⁢ 𝑘 − 3 ) also holds. In sum, Eq (35)

Eq (36) holds, and this result implies that Eq (33) is an increasing function w.r.t. 𝑠 ∈ { ⌊ 𝑆 / 2 ⌋ , ⌊ 𝑆 / 2 ⌋ + 1 , ⋯ , 𝑆 } .

Since greater 𝑠 has a higher value of Eq (33), we can naturally conclude that Eq (A.2), which is our goal function, is an increasing function w.r.t. 𝑝 if weights are more assigned to the terms related to the higher 𝑠 as 𝑝 increases. From this rationale, we show the following statement: ∃ 𝑠 ∗ ∈ { ⌊ 𝑆 / 2 ⌋ , ⋯ , 𝑆 } , s.t. ∂ 𝑤 ⁢ ( 𝑝 , 𝑠 ) / ∂ 𝑝 > 0 , ∀ 𝑠 ≥ 𝑠 ∗ and ∂ 𝑤 ⁢ ( 𝑝 , 𝑠 ) / ∂ 𝑝 < 0 , ∀ 𝑠 ≤ 𝑠 ∗ , where 𝑤 ⁢ ( 𝑝 , 𝑠 )

( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 ) . We first derive the first derivative of 𝑤 ⁢ ( 𝑝 , 𝑠 ) :

∂ 𝜔 ⁢ ( 𝑝 , 𝑠 ) ∂ 𝑝

( 𝑆 𝑠 ) (

𝑠 ⁢ 𝑝 𝑠 − 1 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 ⏟ ( 𝑎 ) − ( 𝑆 − 𝑠 ) ⁢ 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 − 1 ⏟ ( 𝑏 )

(39)

− 𝑠 ⁢ ( 1 − 𝑝 ) 𝑠 − 1 ⁢ 𝑝 𝑆 − 𝑠 ⏟ ( 𝑐 ) + ( 𝑆 − 𝑠 ) ⁢ ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 − 1 ⏟ ( 𝑑 ) ) .

(40)

Then, we show that: ∃ 𝑠 ∗ ∈ { ⌊ 𝑆 / 2 ⌋ , ⋯ , 𝑆 } , s.t. ( ( 𝑎 ) − ( 𝑏 ) )

( ( 𝑐 ) − ( 𝑑 ) ) , ∀ 𝑠 ≥ 𝑠 ∗ and ( ( 𝑎 ) − ( 𝑏 ) ) < ( ( 𝑐 ) − ( 𝑑 ) ) , ∀ 𝑠 ≤ 𝑠 ∗ . We elaborate on ( 𝑎 ) − ( 𝑏 ) and ( 𝑐 ) − ( 𝑑 ) as follows:

( 𝑎 ) − ( 𝑏 )

𝑝 𝑠 − 1 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 − 1 ⁢ ( 𝑠 ⁢ ( 1 − 𝑝 ) − ( 𝑆 − 𝑠 ) ⁢ 𝑝 ) ,

(41)

= 𝑝 𝑠 − 1 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 − 1 ⁢ ( 𝑠 − 𝑠 ⁢ 𝑝 − 𝑆 ⁢ 𝑝 + 𝑠 ⁢ 𝑝 ) ,

(42)

( 𝑐 ) − ( 𝑑 )

( 1 − 𝑝 ) 𝑠 − 1 ⁢ 𝑝 𝑆 − 𝑠 − 1 ⁢ ( − 𝑠 ⁢ 𝑝 + ( 𝑆 − 𝑠 ) ⁢ ( 1 − 𝑝 ) ) ,

(43)

= ( 1 − 𝑝 ) 𝑠 − 1 ⁢ 𝑝 𝑆 − 𝑠 − 1 ⁢ ( − 𝑠 ⁢ 𝑝 + ( 𝑆 − 𝑠 ) − 𝑆 ⁢ 𝑝 + 𝑠 ⁢ 𝑝 ) .

(44) Remark.

When 𝑝

1 / 2 , ( ( 𝑎 ) − ( 𝑏 ) ) − ( ( 𝑐 ) − ( 𝑑 ) )

0 holds, which indicates that if Eq (32) turns out to be an increasing function w.r.t. 𝑝 , Eq (32) has its minimum value at 𝑝

1 / 2 . Thus, we narrow down the range of 𝑝 as 𝑝

1 / 2 .

We first formalize ( 𝑎 ) − ( 𝑏 )

( 𝑐 ) − ( 𝑑 ) equation. To this end, we utilize the logarithm function to simplify the computation (note that ( 𝑆 𝑠 ) is omitted).

≡ 𝑝 𝑠 − 1 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 − 1 ⁢ ( 𝑠 − 𝑆 ⁢ 𝑝 )

( 1 − 𝑝 ) 𝑠 − 1 ⁢ 𝑝 𝑆 − 𝑠 − 1 ⁢ ( 𝑠 − 𝑆 + 𝑆 ⁢ 𝑝 ) ,

(45)

≡ ( 𝑠 − 1 ) ⁢ log ⁡ 𝑝 + ( 𝑆 − 𝑠 − 1 ) ⁢ log ⁡ ( 1 − 𝑝 ) + log ⁡ ( 𝑠 − 𝑆 ⁢ 𝑝 ) ,

( 𝑠 − 1 ) ⁢ log ⁡ ( 1 − 𝑝 ) + ( 𝑆 − 𝑠 − 1 ) ⁢ log ⁡ 𝑝 + log ⁡ ( 𝑠 − 𝑆 + 𝑆 ⁢ 𝑝 ) ,

( 𝑠 − 1 ) ⁢ log ⁡ ( 𝑝 1 − 𝑝 ) + ( 𝑆 − 𝑠 − 1 ) ⁢ log ⁡ 1 − 𝑝 𝑝 + log ⁡ 𝑠 − 𝑆 ⁢ 𝑝 𝑠 − 𝑆 ⁢ ( 1 − 𝑝 )

0 ,

( 2 ⁢ 𝑠 − 𝑆 ) ⁢ log ⁡ 𝑝 1 − 𝑝 ⏟ Term (a) + log ⁡ 𝑠 − 𝑆 ⁢ 𝑝 𝑠 − 𝑆 ⁢ ( 1 − 𝑝 ) ⏟ Term (b)

0 .

(46)

To show the required conditions, it is enough to show that (A) Eq (46) < 0 holds at 𝑠

𝑆 / 2 , and (B) Eq (46) > 0 holds at 𝑠

𝑆 . This is because since Eq (46) is an increasing function w.r.t. 𝑠 , the above two conditions imply that there exists 𝑠 ∗ , which is our interest, between 𝑆 / 2 and 𝑆 . When plugging 𝑆 / 2 to 𝑠 , 2 ⁢ 𝑠 − 𝑆 gets zero. Since 𝑝

1 / 2 is given (see the above Remark), log ⁡ 𝑠 − 𝑆 ⁢ 𝑝 𝑠 − 𝑆 ⁢ ( 1 − 𝑝 ) becomes negative. Thus, (A) is proven. We now plug 𝑆 into 𝑠 and obtain the following result:

≡ ( 2 ⁢ 𝑆 − 𝑆 ) ⁢ log ⁡ 𝑝 1 − 𝑝 + log ⁡ 𝑆 − 𝑆 ⁢ 𝑝 𝑆 − 𝑆 ⁢ ( 1 − 𝑝 ) ,

(47)

≡ 𝑆 ⁢ log ⁡ 𝑝 1 − 𝑝 + log ⁡ 𝑆 ⁢ ( 1 − 𝑝 ) 𝑆 ⁢ 𝑝 ≡ 𝑆 ⁢ log ⁡ 𝑝 1 − 𝑝 − log ⁡ 𝑝 1 − 𝑝 ,

(48)

≡ ( 𝑆 − 1 ) log 𝑝 1 − 𝑝

0 , ∵ 𝑝

1 / 2 .

(49)

Thus, we can conclude that Eq (32), which is our interest, is an increasing function w.r.t. 𝑝 > 1 / 2 , and has its minimum value at 𝑝

1 / 2 .

Now, we show that the above proof can also be applied to an even number 𝑆 . Note that for an even number 𝑆 , Eq (32) is equal to the addition of Eq (A.2) that starts with 𝑠

𝑆 / 2 + 1 and Eq (50),

𝑤 ⁢ ( 𝑝 , 𝑠 )

( 𝑆 𝑆 2 ) × 𝑆 2 × 𝑝 𝑆 + 1 2 ⁢ ( 1 − 𝑝 ) 𝑆 + 1 2 .

(50)

We derive ∂ 𝑤 ⁢ ( 𝑝 , 𝑠 ) / ∂ 𝑝 as follows:

∂ 𝑤 ⁢ ( 𝑝 , 𝑠 ) ∂ 𝑝

( ( 𝑆 𝑆 2 ) × 𝑆 2 ) × ( 𝑆 2 ⁢ 𝑝 𝑆 − 2 2 ⁢ ( 1 − 𝑝 ) 𝑆 − 2 2 ⁢ ( 1 − 2 ⁢ 𝑝 ) ) ≤ 0 , ∀ 𝑝 ∈ [ 0.5 , 1 ] .

(51)

Since ∂ 𝑤 ⁢ ( 𝑝 , 𝑠 ) / ∂ 𝑤 ⁢ ( 𝑝 , 𝑠 ) < 0 holds for 𝑠

( 𝑆 + 1 ) / 2 , our proof is still valid for an even number 𝑆 . We finalize the proof of Second statement.

First statement. From the above proof, we now show the lower bound of Eq (27), which is the first statement. Note that the value of Eq (27) (equivalent to Eq (32)) is minimized at 𝑝

0.5 . Thus, we prove the statement by finding the lower bound of Eq (28) at 𝑝

0.5 . For simplicity, we again use the following notation 𝐾 ≔ 𝑑 4 ⁢ ( 𝑆 − 1 ) .

=
1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑠 ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 ) ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝑑 4 ( 𝑆 − 1 ) ) ) ,

(52)

=
1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 2 ⁢ 𝑠 ⁢ ( 1 2 ) 𝑆 ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝐾 ) .

(53)

First, consider an even number 𝑆 . Then, for Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝐾 ) , ∀ 𝑠 ∈ { 1 , ⋯ , 𝑆 } , consider the below relations:

1 ⁢ Φ ⁢ ( − 𝑆 + 1 ) ⏟ ( 𝑎 ⁢ 1 ) + 2 ⁢ Φ ⁢ ( − 𝑆 + 3 ) ⏟ ( 𝑏 ⁢ 1 ) + ⋯ + ( 𝑆 − 1 ) ⁢ Φ ⁢ ( 𝑆 − 3 ) ⏟ ( 𝑏 ⁢ 2 ) + 𝑆 ⁢ Φ ⁢ ( 𝑆 − 1 ) ⏟ ( 𝑎 ⁢ 2 ) .

(54)

In Eq (54), ( 𝑎 ⁢ 1 ) + ( 𝑎 ⁢ 2 )

1 holds, and ( 𝑏 ⁢ 1 ) + ( 𝑏 ⁢ 2 )

1 holds also. In this manner, we can pair all terms in Eq (54). In a general sense, the following holds: 𝑘 11 ⁢ 𝑘 12 + 𝑘 21 ⁢ 𝑘 22 such that 𝑘 11 + 𝑘 21

𝑆 , 𝑘 12 + 𝑘 22

1 , 𝑘 11 < 𝑘 12 , and 𝑘 21 < 𝑘 22 hold. Note that 𝑘 11 ⁢ 𝑘 12 + 𝑘 21 ⁢ 𝑘 22 is lower-bounded by 𝑘 11 + 𝑘 21 2 since distributing the weight of a larger value to a smaller value will decrease the overall value. From this result, we can obtain a lower bound of Eq (54) as follows:

1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 2 ⁢ 𝑠 ⁢ ( 1 2 ) 𝑆 ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝐾 ) ≥ 1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 2 ⁢ 𝑠 ⁢ ( 1 2 ) 𝑆 ⁢ 1 2

0.5 .

(55)

Now, we consider an odd number 𝑆 . We can extend the proof of an even number 𝑆 case, while additionally considering the following term: 𝑠

𝑆 + 1 2 ; Φ ⁢ ( 0 ) that does not have any pair. On the other hand, since Φ ⁢ ( 0 )

0.5 , we still ensure Eq (55) holds in this case as well. Thus, for both even- and odd-number 𝑆 , the value of Eq (28) at 𝑝

0.5 is lower-bounded by 0.5 , which is a global lower bound of Eq (28). We finalize the proof of First statement.

Moreover, as described, Eq (28) is symmetric w.r.t. 𝑝

0.5 on 𝑝 ∈ [ 0 , 1 ] . In conclusion, we have verified that Eq (28) is strictly decreasing at 𝒫 ∈ [ 0 , 0.5 ] , by extending the result at 𝒫 ∈ [ 0.5 , 1.0 ] .

Generalization to 𝑣 𝑖 ∈ 𝐶 0 . Note that the required condition for the improvement in 𝑣 𝑖 ∈ 𝐶 0 is 𝟏 𝑇 ⁢ ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 < 0 . Thus, we further investigate 𝑃 𝒙 , 𝑒 ⁢ ( 𝟏 𝑇 ⁢ ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ⁢ < 0 | ⁢ 𝒫 ) , where 𝑣 𝑖 ∈ 𝑒 and 𝑞

𝑒 ∖ { 𝑣 𝑖 } (refer to Remark). First, let 𝑠 denote the number of nodes that belong to both 𝑒 𝑗 and 𝐶 0 . Note that we aim to investigate the following probability:

𝑃 𝒙 , 𝑒 ⁢ ( 𝟏 → 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) ⁢ < 0 | ⁢ 𝑝 )

𝑃 𝒙 , 𝑒 ⁢ ( 𝟏 → 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) ⁢ < 0 , 𝑣 𝑖 ∈ 𝑒 | ⁢ 𝑝 ) 𝑃 𝑒 ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 ) .

(56)

Here, the following holds:

∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘

∼ 𝒩 ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝝁 0 , ( 𝑆 − 1 ) ⁢ 𝐈 ) ,

(57)

𝟏 → 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

∼ 𝒩 ⁢ ( − ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝑑 2 , ( 𝑆 − 1 ) ⁢ 𝑑 ) ,

(58)

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) ⁢ < 0 | ⁢ 𝑠 )

Φ ⁢ ( ( 2 ⁢ 𝑠 − 𝑆 − 1 ) ⁢ 𝑑 4 ⁢ ( 𝑆 − 1 ) ) .

(59)

Note that Eq. (59) is equivalent to Eq. (18). Moreover, the following also holds:

𝑃 𝑒 ⁢ ( 𝑠 , 𝑣 𝑖 ∈ 𝑒 | 𝑝 )

( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 2 ⏟ prob. of 𝑠 at 𝑐

0 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 2 ⏟ prob. of 𝑠 at 𝑐

1 ) ⁢ ( 1 − ( 𝑁 − 1 𝑠 ) ( 𝑁 𝑠 ) ) ⏟ prob. of 𝑣 𝑖 ∈ 𝐶 0 ∩ 𝑒 .

(60)

Here again, Eq. (21) is equivalent to Eq. (60). Finally, we guarantee that Eq. (56) is expressed as follows:

≡
∑ 𝑠

0 𝑆 𝑃 𝒙 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 | 𝑠 ) × 𝑃 𝑒 ( 𝑠 , 𝑣 𝑖 ∈ 𝑒 ) 𝑃 𝑒 ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 ) ,

(61)

≡
2 ⁢ 𝑁 𝑆 ⁢ ∑ 𝑠

(62)

=
1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑠 ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 ) ⁢ Φ ⁢ ( ( 2 ⁢ 𝑠 − 1 − 𝑆 ) ⁢ 𝑑 4 ( 𝑆 − 1 ) ) ) .

(63)

Importantly, Eq. (63) is equal to Eq. (28). This result implies that Theorem 2 and its proof is generalizable to the case of 𝑣 𝑖 ∈ 𝐶 0 also.

∎

A.3Empirical demonstration on the proof.

To empirically validate the proof provided above, we conduct toy experiments. We consider every combination of 𝑆 ∈ { 2 , 3 , 4 , 5 , 6 , 7 , 8 } , which is a size of 𝑒 𝑗 (i.e., | 𝑒 𝑗 | ), and 𝑑 ∈ { 2 , 3 , 4 , 5 , 6 , 7 , 8 } , which is a dimension of input node feature 𝒙 𝑖 ∈ ℝ 𝑑 , ∀ 𝑣 𝑖 ∈ 𝒱 . Then, we visualize how Eq (27) varies w.r.t. 𝑆 , 𝑑 , and 𝒫 ∈ [ 0 , 1 ] . Note that 𝑋 -axis of each plot corresponds to 𝒫 . As shown in Figure 3, for all cases, we can verify that the value of 𝑃 𝒙 , 𝑒 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 | 𝒫 ) is strictly-increasing w.r.t. 𝒫 ∈ [ 0.5 , 1 ] and -decreasing w.r.t. 𝒫 ∈ [ 0 , 1 ] (statement 2), and the value is lower bounded by 0.5 (statement 1). Thus, we can ensure that Theorem 2 is valid through empirical verification.

Figure 3: Empirical demonstration of Theorem 2. Note that 𝑆 denotes the size of a hyperedge, and 𝑑 denotes the dimension of features. As stated, 𝑃 𝒙 , 𝑒 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 )

0 | 𝒫 ) is strictly increasing in 𝒫 ∈ [ 0.5 , 1 ] (statement 2), and lower bounded by 0.5 (statement 1). Appendix BAdditional Theoretical Analysis B.1Existence of reasonable solutions in the hyperedge filling task

In this section, we investigate the existence of a reasonable solution in the proposed hyperedge filling task. To this end, we first formalize the concept of the hyperedge filling task is solved.

Definition 1 (Solved hyperedge filing task).

For a hypergraph 𝐺

( 𝒱 , ℰ ) , the hyperedge filling task is solved when the following holds for a probabilistic model 𝑝 ( 𝐗 , ℰ , Θ ) ⁢ ( ⋅ ) :

𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑒 𝑗 ∖ { 𝑣 𝑖 } )

𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑘 | 𝑒 𝑗 ∖ { 𝑣 𝑖 } ) ,

∀ 𝑣 𝑖 ∈ 𝑒 𝑗 , ∀ 𝑣 𝑘 ∈ { 𝑣 𝑠 : { 𝑣 𝑠 } ∩ ( 𝑒 𝑗 ∖ { 𝑣 𝑖 } ) ∉ ℰ , 𝑣 𝑠 ∈ 𝒱 } , , ∀ 𝑒 𝑗 ∈ ℰ .

Roughly, Definition 1 indicates that for all possible hyperedge filling cases in a hypergraph 𝐺 , a probabilistic model 𝑝 always has the greater probability of filling a correct node for a given query subset than that of filling a wrong node. It is important to note that our goal is to identify whether there exists a reasonable solution for 𝑝 ⁢ ( ⋅ ) , not a trivial solution. Thus, we formalize the concept of a reasonable solution as the following definition:

Definition 2 (Reasonable solution).

Let 𝒬 := { 𝑒 𝑗 ∖ { 𝑣 𝑖 } : ∀ 𝑣 𝑖 ∈ 𝑒 𝑗 , ∀ 𝑒 𝑗 ∈ ℰ } . For a hypergraph 𝐺

( 𝒱 , ℰ ) , a reasonable solution of the hyperedge filling task is a pair of node representations 𝐙 ∈ ℝ | 𝒱 | × 𝑑 and query set representations 𝐐 ∈ ℝ | 𝒬 | × 𝑑 such that:

𝒛 𝑖 𝑇 ⁢ 𝒒 𝑗 > 𝒛 𝑘 𝑇 ⁢ 𝒒 𝑗 ,

∀ 𝑣 𝑖 ∈ 𝑆 𝑗 , ∀ 𝑣 𝑘 ∈ 𝒱 ∖ 𝑆 𝑗 , ∀ 𝑞 𝑗 ∈ 𝒬 , where ⁢ 𝑆 𝑗

{ 𝑣 𝑠 : ( { 𝑣 𝑠 } ∩ 𝑞 𝑗 ∈ ℰ ) ∧ ( 𝑣 𝑠 ∉ 𝑞 𝑗 ) , 𝑣 𝑠 ∈ 𝒱 } .

Now, we analyze whether there exists a reasonable solution for any hypergraph 𝐺

( 𝒱 , ℰ ) .

Theorem 3 (Existence of a reasonable solution).

For any hypergraph 𝐺

( 𝒱 , ℰ ) , there exists a reasonable solution for some embedding dimension 𝑑 .

Proof.

Consider a binary matrix 𝐁 ∈ { 1 , 0 } | 𝑉 | × | 𝒬 | , where each column 𝑗 ∈ { 1 , ⋯ , | 𝒬 | } represents one-hot encoding of 𝑆 𝑗 over 𝒱 (i.e., 𝐁 𝑖 ⁢ 𝑗

𝟏 ⁢ [ 𝑣 𝑖 ∈ 𝑆 𝑗 ] , where 𝟏 ⁢ [ ⋅ ] is an indicator function). Here, we consider a matrix decomposition of 𝐁 , specifically, singular value decomposition of 𝐁 . Denote this singular value decomposition as 𝐁

𝐔 ⁢ Σ ⁢ 𝐕 𝑇 , where 𝐔 ∈ ℝ | 𝒱 | × rank( 𝐁 ) , 𝐕 ∈ ℝ rank( 𝐁 ) × rank( 𝐁 ) , and 𝐕 ∈ ℝ | 𝒬 | × rank( 𝐁 ) . Let 𝐙 := 𝐔 ⁢ Σ and 𝐐 := 𝐕 . In such a case, the following holds:

𝒛 𝑖 𝑇 ⁢ 𝒒 𝑗

{ 1

if ⁢ 𝑣 𝑖 ∈ 𝑆 𝑗 ,

otherwise.

(64)

Thus, 𝒛 𝑖 𝑇 ⁢ 𝒒 𝑗

𝒛 𝑘 𝑇 ⁢ 𝒒 𝑗 , ∀ 𝑣 𝑖 ∈ 𝑆 𝑗 , ∀ 𝑣 𝑘 ∈ 𝒱 ∖ 𝑆 𝑗 , ∀ 𝑞 𝑗 ∈ 𝒬 always holds, which implies that 𝐙 := 𝐔 ⁢ Σ and 𝐐 := 𝐕 are one of the reasonable solutions. ∎

From Theorem 3, we demonstrate that there exists a reasonable solution for any hypergraph 𝐺 .

B.2Dimensional collapse of HypeBoy without projection heads.

In this subsection, we provide a theoretical analysis of why HypeBoy without projection heads may suffer from dimensional collapse. Jing et al. (2022) have suggested that one of the reasons for the dimensional collapse in contrastive learning lies in the gradient flow (spec., mechanism of how representations are updated via contrastive loss). Specifically, consider a linear model that yields representations of a 𝑛 number of data by 𝐙

𝐗𝐖 , where 𝐗 ∈ ℝ 𝑛 × 𝑑 are node features and 𝐖 ∈ ℝ 𝑑 × 𝑘 is a learnable parameter matrix. When we update parameters 𝐖 by using contrastive losses, 𝑘 -th biggest singular value of weight matrices 𝜎 𝑘 evolve proportional to its own value (e.g., 𝜎 𝑘 ~ ← 𝜎 𝑘 × 𝑐 , where 𝜎 𝑘 ~ is a 𝑘 -th biggest singular value of updated 𝐖 and 𝑐 is a positive constant). Due to this mechanism, small single values grow slowly or diminish, causing dimensional collapse.

In this analysis, motivated by the analysis of Jing et al. (2022), we track how singular values of node representations change over learning hyperedge filling task. Following the setting of Jing et al. (2022), we consider a linear model: 𝐙

𝐗𝐖 . Note that 𝐗 are given features of data points. Thus, the change of singular values of embeddings 𝐙 depends on 𝐖 .

We first define the loss of hyperedge filling as follows:

ℒ := − ∑ 𝑒 𝑗 ∈ ℰ ∑ 𝑣 𝑖 ∈ 𝑒 𝑗 log ⁡ ( exp ⁡ ( 𝒛 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒛 𝑠 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒛 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒛 𝑠 ) ) ) .

(65)

Now, we obtain the closed form of the following update equation, which is based on the gradient descent: 𝐖 ← 𝐖 − 𝛾 ⁢ ∇ 𝐖 ℒ , where 𝛾 is a fixed learning rate. We first derive ( ∂ ℒ / ∂ 𝐙 ) , and utilize the chain rule: ( ∂ ℒ / ∂ 𝐙 ) × ( ∂ 𝐙 / ∂ 𝐖 ) .

The derivative of numerators of Eq (65) can be written as follows:

− ∂ ( ∑ 𝑒 𝑗 ∈ ℰ ∑ 𝑣 𝑖 ∈ 𝑒 𝑗 𝒛 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒛 𝑠 ) ) ∂ 𝐙

− 𝐄𝐙 ,

where ⁢ 𝐄 ∈ ℝ | 𝒱 | × | 𝒱 | , s.t. ⁢ 𝐄 𝑖 ⁢ 𝑗

2 × | { 𝑒 𝑘 : { 𝑣 𝑖 , 𝑣 𝑗 } ⊆ 𝑒 𝑘 , ∀ 𝑒 𝑘 ∈ ℰ } | .

We now derive the derivative of the denominators of Eq (65):

∂ ( ∑ 𝑒 𝑗 ∈ ℰ ∑ 𝑣 𝑖 ∈ 𝑒 𝑗 log ⁡ ( ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒛 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒛 𝑠 ) ) ) ) ∂ 𝐙 .

(66)

We denote a multiset of every possible query subset as 𝒬 (i.e., 𝒬

{ 𝑒 𝑗 ∖ { 𝑣 𝑖 } : 𝑣 𝑖 ∈ 𝑒 𝑗 , 𝑒 𝑗 ∈ ℰ } ). The reason for multiset is that given subsets can be duplicated due to the nature of overlapping hyperedges (e.g., when we omit a node 𝑣 1 from { 𝑣 1 , 𝑣 2 , 𝑣 3 } and 𝑣 4 from { 𝑣 4 , 𝑣 2 , 𝑣 3 } , remaining sets become identical).

We split the derivation into the node level. In addition, we let a subset of 𝒬 such that includes a node 𝑣 𝑘 as 𝒬 𝑘 . Then, for a node 𝑣 𝑘 , its gradient is computed as follows:

∑ 𝑞 𝑗 ∈ 𝒬 exp ⁡ ( 𝒛 𝑘 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑗 𝒛 𝑠 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒛 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑗 𝒛 𝑠 ) ) × ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑗 𝒛 𝑠 )

(67)

∑ 𝑞 𝑙 ∈ 𝒬 𝑘 ∑ 𝑣 𝑖 ∈ 𝒱 exp ⁡ ( 𝒛 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑙 𝒛 𝑠 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒛 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑙 𝒛 𝑠 ) ) ⁢ 𝒛 𝑖 .

(68)

We define two more matrices to express Eq. (67) and (68) in a simple matrix form. By assigning an index 𝑘

{ 1 , ⋯ , | 𝒬 | } to 𝑞 𝑗 ∈ 𝒬 , we define a binary matrix 𝐐 ∈ ℝ | 𝒬 | × | 𝒱 | that represents 𝒬 :

𝐐 𝑗 ⁢ 𝑖

{ 1

if 𝑣 𝑖 ∈ 𝑞 𝑗 ,

otherwise .

In addition, we denote a score matrix 𝐀 ∈ ℝ | 𝒬 | × | 𝒱 | that models a probability of filling each node 𝑣 𝑖 ∈ 𝒱 to each query subset 𝑞 𝑗 ∈ 𝒬 , which is defined as follows:

𝐀 𝑘 ⁢ 𝑖

exp ⁡ ( 𝒛 𝑖 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑗 𝒛 𝑠 ) ) ∑ 𝑣 𝑡 ∈ 𝒱 exp ⁡ ( 𝒛 𝑡 𝑇 ⁢ ( ∑ 𝑣 𝑠 ∈ 𝑞 𝑗 𝒛 𝑠 ) ) .

(69)

By using 𝐀 and 𝐐 , we can express Eq (67) and (68) for all nodes as follows:

( 𝐐 𝑇 ⁢ 𝐀 ⏟ Eq ( 67 ) of all nodes + 𝐀 𝑇 ⁢ 𝐐 ⏟ Eq ( 68 ) of all nodes ) ⁢ 𝐙 .

(70)

Finally, we express ( ∂ ℒ / ∂ 𝐙 ) as follows:

∂ ℒ ∂ 𝐙

( − 𝐄 + 𝐐 𝑇 ⁢ 𝐀 + 𝐀 𝑇 ⁢ 𝐐 ) ⁢ 𝐙 .

(71)

In addition, by using the chain rule and matrix derivative rules, ∂ ℒ / ∂ 𝐖 is derived as follows:

∂ ℒ ∂ 𝐖

𝐗 𝑇 ⁢ ( − 𝐄 + 𝐐 𝑇 ⁢ 𝐀 + 𝐀 𝑇 ⁢ 𝐐 ) ⁢ 𝐗𝐖 .

(72)

We denote updated 𝐖 as 𝐖 ~ . To sum up, update rule of 𝐖 ~ := 𝐖 − 𝛾 ⁢ ∇ 𝐖 ℒ is expressed as follows:

𝐖 ~ := ( 𝐈 + 𝛾 ⁢ 𝐗 𝑇 ⁢ ( 𝐄 − 𝐐 𝑇 ⁢ 𝐀 + 𝐀 𝑇 ⁢ 𝐐 ) ⁢ 𝐗 ) ⁢ 𝐖 .

(73)

Here, we denote 𝑘 -th biggest singular values of 𝐖 ~ and 𝐖 as 𝜎 ~ 𝑘 and 𝜎 𝑘 , respectively.

Note that ( 𝐈 + 𝛾 ⁢ 𝐗 𝑇 ⁢ ( 𝐄 − 𝐐 𝑇 ⁢ 𝐀 + 𝐀 𝑇 ⁢ 𝐐 ) ⁢ 𝐗 ) is a function of 𝐗 , ℰ , and 𝐖 . Thus, we denote this term as 𝑔 ⁢ ( 𝐗 , ℰ , 𝐖 ) ∈ ℝ 𝑑 × 𝑑 .

If we take a closer look at 𝑔 ⁢ ( 𝐗 , ℰ , 𝐖 ) ⁢ 𝐖 , 𝐖 ~ is a near-linear transformation of 𝐖 itself7. In such a case, from the min-max characterization rule, the following inequality holds:

𝜎 ~ 𝑘 ≤ 𝜎 1 ′ × 𝜎 𝑘 , where 𝜎 1 ′ is the biggest singular value of a matrix ⁢ 𝑔 ⁢ ( 𝐗 , ℰ , 𝐖 ) .

(74)

Eq (74) indicates that 𝜎 𝑘 ~ is upper-bounded by the multiplication between the biggest singular value of 𝑔 ⁢ ( 𝐗 , ℰ , 𝐖 ) and 𝜎 𝑘 . While it is not an exact equality, this result implies that the changed singular value is likely to be correlated with its previous singular value. Hence, small singular values are likely to grow slowly, similar to Jing et al. (2022) study, which may cause dimensional collapse.

Appendix CDiscussions C.1Comparison With HyperGRL

In this section, we compare our HypeBoy against HyperGRL (Du et al., 2022). Both methodologies utilize a strategy that leverages a node and a subset of a hyperedge for self-supervised learning. Specifically, this involves using a pair that consists of a node 𝑣 𝑖 that belongs to a hyperedge 𝑒 𝑗 , and the subset of 𝑒 𝑗 excluding 𝑣 𝑖 ( 𝑒 𝑗 ∖ { 𝑣 𝑖 } ), for training purposes. However, HypeBoy significantly differs from HyperGRL in various aspects.

Motivation. The motivation of HypeBoy is markedly distinct from that of HyperGRL. The hyperedge filling task is a task that aims to correctly fill the missing node for a given (query) subset. The underlying motivation is to train a machine learning model to generate hyperedges, thereby identifying entities that are forming group interactions. This may enable the model to understand the complex, higher-order relationships within a given hypergraph. This objective aligns with the goals of question-answering approaches in NLP (Devlin et al., 2019). Conversely, the key motivation of HyperGRL, which performs node-level pretext task introduced by Du et al. (2022), is described as follows: “the node inside one specific hyperedge should be distinguished from nodes outside this hyperedge given the context of node”. Here, the motivations behind Du et al. (2022) are akin to those of proximity-based approaches, including LINE (Tang et al., 2015).

Method design. The design choices of HypeBoy and HyperGRL differ significantly. Below, we outline these technical distinctions.

•

Input: HypeBoy receives the original hypergraph structure, while HyperGRL receives a clique-expanded graph.

•

Backbone encoder: HypeBoy utilizes hypergraph neural networks, while HyperGRL employs graph neural networks.

•

Augmentation: HypeBoy augments node features and hyperedges, while HyperGRL does not utilize an augmentation strategy.

•

Projection head: HypeBoy utilizes different projection head parameters for node-projection and query subset-projection, while HyperGRL utilizes the same parameters for both.

•

Negative samples: HypeBoy utilizes the entire node as negative samples for each hyperedge filling case, while HyperGRL samples several nodes.

•

SSL signal: HypeBoy utilizes all possible (node, query subset) pairs of a given hypergraph, while HyperGRL samples a single pair for each hyperedge.

•

Feature reconstruction: HypeBoy utilizes node feature reconstruction warm-up, while HyperGRL utilizes hyperedge prediction loss as another SSL signal.

Performance comparison. It is important to note that HypeBoy is superior to HyperGRL as an SSL strategy for hypergraphs. We provide empirical evidence below:

•

Under the fine-tuning protocol, HypeBoy outperforms HyperGRL in every dataset (refer to Section 5.1).

•

Under the linear-evaluation protocol with the node classification task, HypeBoy outperforms HyperGRL in every dataset (refer to Section 5.2).

•

Under the linear-evaluation protocol with he hyperedge prediction task, HypeBoy outperforms HyperGRL in 10 out of 11 datasets (refer to Section 5.2).

C.2Potential Extensions of HypeBoy

While this work mainly assumes static and undirected hypergraphs, many real-world group interactions have directions and temporal dynamics. Directed hypergraphs (Kim et al., 2023b) and temporal hypergraphs (Lee & Shin, 2023b) are utilized to include directional information and temporal information in hypergraphs, respectively. HypeBoy can be extended to such hypergraphs by adequately considering their unique characteristics. Moreover, while we assume a transductive hypergraph setting, there is a recent effort to perform representation learning on hypergraphs under an inductive case (Behrouz et al., 2023). Considering this, HypeBoy can also be extended to an inductive case, incorporating the idea of Behrouz et al. (2023). Lastly, HypeBoy can be enhanced to capture more complex characteristics of a hypergraph (e.g., hyperedge-dependency of nodes (Choe et al., 2023)).

Appendix DDatasets Table 4:Statistics of 11 datasets we utilize in our experiments. Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House

| 𝒱 | 1,458 1,434 3,840 2,388 41,302 2,591 20,201 3,939 12,311 16,242 1,290

| ℰ | 1,079 1,579 7,963 1,072 22,263 2,690 8,052 2,015 12,311 100 341

∑ 𝑒 ∈ ℰ | 𝑒 | 3,453 4,786 34,629 4,585 99,561 6,201 32,028 9,560 61,555 65,451 11,843

Classes 6 7 3 7 6 4 12 3 40 4 2

Features 3,703 1,433 500 1,433 1,425 334 500 3066 100 100 2

In this section, we provide details of the used datasets. In our experiments, we utilize 11 hypergraph benchmark datasets. Following Lee & Shin (2023a), we remove nodes that do not belong to any hyperedge. Data statistics of each dataset are reported in Table 4.

Source of each dataset. The sources of each dataset used in this work are as follows:

•

The Cora, Citeseer, Pubmed, Cora-CA, and DBLP-P datasets are from the work of Yadati et al. (2019).

•

The DBLP-A and IMDB datasets are from the work of Wang et al. (2019).

•

The AMiner dataset is from the work of Zhang et al. (2019).

•

The Mondelnet-40 (MN-40) dataset is from the work of Wu et al. (2015).

•

The 20Newsgroups (20News) dataset is from the work of Dua et al. (2017).

•

The House dataset is from the work of Chien et al. (2022).

Co-citation datasets. We utilize three co-citation datasets: Cora, Citeseer, and Pubmed. In these datasets, each node represents a publication and each hyperedge represents a set of publications co-cited by a publication. For example, if a publication has cited three publications that correspond to 𝑣 𝑖 , 𝑣 𝑗 , and 𝑣 𝑘 , respectively, these publications (nodes) are grouped as a hyperedge { 𝑣 𝑖 , 𝑣 𝑗 , 𝑣 𝑘 } ∈ ℰ . Node features are bag-of-words of the corresponding publication (node). Node class indicates the academic category of the corresponding publication.

Co-authorship datasets. We utilize four co-authorship datasets: Cora-CA, DBLP-P, DBLP-A, and AMiner. In Cora-CA, DBLP-P, and AMiner, each node represents a publication, and a set of publications that are written by the same author is grouped as a hyperedge. Node features are bag-of-words of the corresponding publication (node). Node class indicates the academic category of the corresponding publication. Conversely in DBLP-A, each node represents an author, and co-authors of a publication are grouped as a hyperedge. Node features are bag-of-words regarding the research keywords of the corresponding author 8. Node class indicates the primary research area of the corresponding author.

Computer graphic datasets. We utilize one computer graphical dataset: Modelnet-40 (MN-40). In this dataset, each node represents a visual object, and hyperedges are synthetic hyperedges that have been created with a k-NN graph constructed according to the features of each data point, following Feng et al. (2019) and Chien et al. (2022). Node features are embeddings (obtained via GVCNN (Feng et al., 2018) and MVCNN (Su et al., 2015)) of the corresponding visual object. Node class indicates the category of the corresponding visual object.

Movie-Actor dataset. We utilize one movie-actor relational dataset: IMDB. In this dataset, each node indicates a movie, and the filmography of an actor is grouped as a hyperedge. Node features are bag-of-words of the plot of the corresponding movie. Node class indicates the genre of the movie.

News dataset. We utilize one news dataset: 20NewsGroups (20News). In this dataset, each node represents a document of 20 newsgroups and a set of documents containing a particular word is grouped as a hyperedge. Node features are TF-IDF representations of news messages of the corresponding news document. Node class indicates the category of the corresponding news document.

Political membership dataset. We utilize one political membership dataset: House. In this dataset, each node represents a member of the US House of Representatives, and members belonging to the same committee are grouped as a hyperedge. Node features are generated by adding noise to the one-hot encoded label vector, following Chien et al. (2022). Node class indicates the political party of the corresponding member.

Appendix EExperimental Details

In this section, we provide our experimental details. We first describe the machines we use in our experiments. Then, we provide details of HypeBoy and baseline methods, including their implementation and hyperparameter settings.

E.1Machines

All experiments are conducted on a machine with NVIDIA RTX 8000 D6 GPUs (48GB memory) and two Intel Xeon Silver 4214R processors.

E.2Details of baseline methods

Neural networks. We utilize 10 (semi-)supervised baseline models: MLP, HGNN (Feng et al., 2019), HyperGCN (Yadati et al., 2019), HNHN (Dong et al., 2020), three unified hypergraph networks, (UniGCN, UniGIN, UniGCNII) (Huang & Yang, 2021), AllSet (Chien et al., 2022), ED-HNN (Wang et al., 2023a), and PhenomNN (Wang et al., 2023b). For all the methods, we fix their hidden dimension and model layers as 128 and 2, respectively.

Graph SSL. To utilize two graph-generative graph SSL methods, which are GraphMAE2 (Hou et al., 2023) and MaskGAE (Li et al., 2023), and one hypergraph-generative graph SSL method, which is HyperGRL (Du et al., 2022), we transform the input hypergraph into a graph by using the clique expansion, which converts each hyperedge into a clique of a graph (Dong et al., 2020). Moreover, we utilize a 2-layer GCN (Kipf & Welling, 2017) with hidden dimension 128 as a backbone encoder of these two SSL methods.

H-GD. One of our baseline methods H-GD directly extends the group discrimination approach, an SSL technique designed on ordinary graphs (Zheng et al., 2022). We strictly follow the overall structure suggested by Zheng et al. (2022), while the feature augmentation and topology augmentation have been replaced into 𝝉 𝑥 and 𝝉 𝑒 that are described in Section 4.1, respectively.

E.3Details of settings

Hyperedge prediction setting. Note that we need to obtain negative hyperedge samples to evaluate a model on the hyperedge prediction task. In our experiments, we utilize negative hyperedge samples that are generated by the Size Negative Sample strategy (Patil et al., 2020). Specifically, we first sample the size of the current hyperedge from the real-world hyperedge size distribution of the corresponding dataset. Then, we sample nodes uniformly at random and fill the corresponding hyperedge. By using the strategy, we obtain negative hyperedges with the same number of ground-truth hyperedges and utilize them for training/validation/test.

Hyperedge embedding. To perform hyperedge prediction, we require a representation of each hyperedge. On the other hand, since the output of utilized HNNs and GNNs are node embeddings, we need an additional technique to acquire hyperedge representation from node embeddings. To this end, we utilize a max-min pooling aggregation function (Yadati et al., 2020). For example, for a hyperedge 𝑒 𝑖

{ 𝑣 𝑗 , 𝑣 𝑘 , 𝑣 ℓ } , whose constituent node embeddings are 𝒛 𝑗 , 𝒛 𝑘 , 𝒛 ℓ ∈ ℝ 𝑑 , respectively, its hyperedge embedding 𝒉 𝑖 ( 𝑒 ) is obtained as follows:

𝒉 𝑖

( max 𝑡 ∈ { 𝑗 , 𝑘 , ℓ } ⁡ 𝑧 𝑡 ⁢ 𝑠 − min 𝑡 ∈ { 𝑗 , 𝑘 , ℓ } ⁡ 𝑧 𝑡 ⁢ 𝑠 ) 𝑠

1 𝑑 .

(75)

Finally, we utilize 𝒉 𝑖 as a representation of 𝑒 𝑖 .

E.4Details of implementation

Feature reconstruction warm-up of HypeBoy. As described in Section 4.4, before training an encoder HNN with HypeBoy, we perform a feature reconstruction process, which directly extends GraphMAE (Hou et al., 2022) to hypergraph structure. This process mitigates the encoder’s over-reliance on the projection heads, improving the performance in downstream tasks (Section 5.3). For the decoder HNN 𝑔 𝜓 , we utilize a 2-layer UniGCNII (Huang & Yang, 2021), which has the same architecture as the encoder HNN. We mask 50 % of input nodes and set hyperedge augmentation ratio 𝑝 𝑒 as 0.2.

Projection heads of HypeBoy. We utilize a two-layer MLP model with a ReLU (Nair & Hinton, 2010) activation function without dropout- and normalization-layers, for both node projection head 𝑓 𝜙 ′ and set projection head 𝑓 𝜌 ′′ (Section 4.2). Note that these projection heads are updated via the hyperedge filling task together with the encoder HNN.

Table 5:Hyperparameter combination of HypeBoy that shows the best validation accuracy on each dataset. 𝑝 𝑣 ∈ [ 0 , 1 ] indicates the magnitude of feature augmentation, 𝑝 𝑒 indicates the magnitude of topological augmentation, and 𝑆 ⁢ 𝑆 ⁢ 𝐿 ⁢ 𝑒 ⁢ 𝑝 ⁢ 𝑜 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑠 indicates the training epoch of hyperedge filling task. Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House

𝑝 𝑣 0.2 0.4 0.0 0.4 0.0 0.1 0.2 0.3 0.4 0.2 0.4

𝑝 𝑒 0.9 0.9 0.5 0.8 0.6 0.9 0.6 0.9 0.5 0.5 0.8 SSL epochs 120 200 180 200 180 120 100 180 180 80 140

Training details and hyperparameters. We train all models using the Adam optimizer (Kingma & Ba, 2015) with a fixed weight decay rate of 10 − 6 . We fix the hidden dimension and dropout rate of all models as 128 and 0.5, respectively. When training any neural network for downstream tasks, we train a model for 200 epochs, and for every 10 epochs, we evaluate the validation accuracy of the model. Then, we utilize the model’s checkpoint that yields the best validation accuracy to perform the final evaluation of the model on the test dataset.

For a linear evaluation protocol of node classification, we utilize a logistic classifier, with a learning rate 0.001 . For a linear evaluation protocol of hyperedge classification, we utilize a two-layer MLP, with a learning rate of 0.001 and a dropout rate of 0.5 .

Regarding hyperparameter tuning, for each model and dataset, we find the hyperparameter combination that gives the best validation performance. Then, with the selected hyperparameters, we train the model on a training dataset and evaluate the trained model on a test dataset.

For all the supervised models, we tune the learning rate as a hyperparameter within { 0.05 , 0.01 , 0.005 , 0.001 , 0.0005 , 0.0001 } . Moreover, especially for PhenomNN, which is a state-of-the-art hypergraph neural network, we additionally tune their other hyperparameters. Specifically, along with the learning rate, we additionally tune 𝜆 0 and 𝜆 1 within { 0 , 1 , 10 } , which are weights of different message-passing functions, and 𝛼 within { 0 , 1 , 10 } , which is the weight of a node’s own feature.

For all the SSL baseline methods, we set broader search spaces, which are as follows:

•

TriCL (Lee & Shin, 2023a): We tune their feature augmentation magnitude within { 0.2 , 0.3 , 0.4 } , hyperedge augmentation magnitude within { 0.2 , 0.3 , 0.4 } , and learning rate of its all components within { 0.01 , 0.001 , 0.0001 } .

•

HyperGCL (Wei et al., 2022): We tune learning rate of an encoder within { 0.01 , 0.001 , 0.0001 } , that of a view generator within { 0.01 , 0.001 , 0.0001 } , and the weight of contrastive loss 𝛽 within { 0.5 , 1 , 2 } .

•

HGD, a variant of GD (Zheng et al., 2022): We tune their feature augmentation magnitude within { 0.2 , 0.3 , 0.4 } , hyperedge augmentation magnitude within { 0.2 , 0.3 , 0.4 } , and learning rate of its all components within { 0.01 , 0.001 , 0.0001 } .

•

GraphMAE2 (Hou et al., 2023): We tune mask ratio within { 0.25 , 0.5 , 0.75 } and learning rate of its all components within { 0.01 , 0.001 , 0.0001 } .

•

MaskGAE (Li et al., 2023) We tune degree loss weight 𝛼 within { 0.001 , 0.002 , 0.003 } and learning rate of its all components within { 0.01 , 0.001 , 0.0001 } .

•

HyperGRL (Du et al., 2022) We tune negative-sampling adjacency matrix order 𝑘 within { 1 , 2 , 3 } and learning rate of its all components within { 0.01 , 0.001 , 0.0001 } .

In addition, we set self-supervised learning epochs of the above methods as hyperparameters, searching within { 50 , 100 , 150 , ⋯ , 450 , 500 } .

For HypeBoy, we tune feature augmentation magnitude 𝑝 𝑥 within { 0.0 , 0.1 , 0.2 , 0.3 , 0.4 } and hyperedge augmentaiton magnitude 𝑝 𝑒 within { 0.5 , 0.6 , 0.7 , 0.8 , 0.9 } . We fix the learning rate and training epochs of the feature reconstruction warm-up as 0.001 and 300 , respectively.

Regarding the overall training process of HypeBoy, we first train an encoder HNN with a feature reconstruction process for 300 epochs. Then, we train the encoder HNN with the hyperedge filling task. Specifically, the hyperedge filling training epoch is a hyperparameter, which we search within { 20 , 40 , ⋯ , 180 , 200 } . We report a hyperparameter combination of each dataset that shows the best validation accuracy in our fine-tuning protocol experiment (Section 5.1) in Table 5.

Appendix FAdditional Experimental Results Table 6:Comparison between HypeBoy (HypeBoy (denoted by w/ Aug.)) and a variant of HypeBoy that does not use augmentation strategy (denoted by w/o Aug.) in the node classification task under two protocols. The best performance is colored as a green. In most of the settings, HypeBoy outperforms its variant. Protocol Method Citeseer Cora Pubmed Cora-CA DBLP-P DBLP-A AMiner IMDB MN-40 20News House Fine w/o Aug. 53.5 (8.7) 58.6 (8.3) 75.5 (4.0) 62.9 (6.0) 87.8 (0.4) 79.7 (2.3) 33.9 (2.1) 47.2 (2.5) 90.7 (0.7) 77.5 (0.9) 69.9 (4.9) tuning w/ Aug. 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) 80.6 (2.3) 34.1 (2.2) 47.6 (2.5) 90.4 (0.9) 77.6 (0.9) 70.4 (4.8) Linear w/o Aug. 55.8 (9.0) 60.6 (7.3) 72.7 (3.4) 63.3 (4.6) 87.6 (0.5) 81.4 (2.5) 34.2 (2.7) 47.8 (2.0) 89.7 (1.9) 75.1 (1.6) 67.5 (1.6) evaluation w/ Aug. 59.6 (9.9) 63.5 (9.4) 75.0 (3.4) 66.0 (4.6) 87.9 (0.5) 81.2 (2.7) 34.3 (3.2) 48.8 (1.8) 89.2 (2.2) 75.7 (2.1) 69.4 (5.4)

In this section, we present additional experimental results that were excluded from the main paper due to space constraints.

F.1Comparison against HypeBoy without augmentation

As described in Section 4.1, augmentation (spec., masking) has played a key role in various generative SSL methods for obtaining effective representations. Motivated by these findings, we incorporate augmentation process 𝝉 𝑥 and 𝝉 𝑒 into HypeBoy. To assess the impact of augmentation, we analyze a variant of HypeBoy that omits the augmentation step. In this variant, the input to HypeBoy’s encoder HNN consists of ground-truth features and topology, ( 𝐗 , ℰ ) . We then compare the performance of this variant against our proposed HypeBoy in node classification tasks, under both fine-tuning and linear evaluation protocols. As shown in Table 6, HypeBoy consistently outperforms the variant across most of the settings, demonstrating the necessity of augmentation in the model’s performance.

Figure 4: Analyzing dimensional collapse of HypeBoy with/without projection heads on five benchmark datasets. While HypeBoy does not suffer from dimensional collapse, its variant that does not utilize projection heads, suffers from this issue. F.2Dimensional collapse analysis

In Section 4.2, we discuss the role of projection heads in terms of dimensional collapse: projection heads encourage an encoder HNN to avoid dimensional collapse. We further analyze this phenomenon in five benchmark hypergraph datasets: Citeseer, Pubmed, Cora-CA, DBLP-P, and DBLP-A. As shown in Figure 4, across all these datasets, we verify that certain singular values of embeddings achieved via HypeBoy without projection heads drop to zero. This empirically demonstrates that dimensional collapse has occurred in such cases (refer to red lines). Notably, such issues are not observed in HypeBoy (refer to blue lines).

Table 7:Performance of SSL methods in various HNNs under fine-tuning protocol. The best and second-best performances are colored green and yellow, respectively. NA indicates the corresponding encoder without pre-training. HypeBoy outperforms other hypergraph SSL methods in most of the settings. Method Citeseer Cora Pubmed Cora-CA DBLP-P HGNN NA 41.9 (7.8) 50.0 (7.2) 72.9 (5.0) 50.2 (5.7) 85.3 (0.8) TriCL 49.3 (10.3) 56.9 (10.0) 74.3 (4.1) 57.2 (6.1) 87.4 (5.0) HyperGCL 47.2 (9.3) 59.5 (7.6) 76.3 (4.3) 53.3 (6.9) 85.8 (0.4) HypeBoy 52.1 (9.1) 61.1 (9.8) 76.8 (4.3) 60.0 (6.0) 87.4 (0.5) MeanPoolingConv NA 39.9 (10.0) 48.5 (8.9) 72.4 (4.0) 50.6 (7.5) 85.7 (0.7) TriCL 46.6 (8.3) 59.8 (8.9) 74.0 (3.8) 57.3 (4.8) 87.0 (5.0) HyperGCL 43.2 (9.4) 59.6 (7.6) 74.2 (4.3) 55.0 (6.7) 85.7 (0.9) HypeBoy 54.5 (8.3) 61.2 (7.8) 77.3 (3.4) 63.0 (4.4) 86.8 (0.5) SetGNN NA 43.6 (6.3) 48.9 (9.2) 69.8 (4.5) 50.7 (6.7) 82.8 (1.0) TriCL 48.9 (7.6) 57.7 (8.4) 72.0 (4.5) 57.8 (4.9) 85.1 (0.5) HyperGCL 46.8 (8.5) 54.6 (7.5) 73.1 (3.9) 56.9 (5.3) 83.6 (0.7) HypeBoy 53.8 (9.1) 62.3 (6.4) 73.4 (3.4) 61.0 (3.9) 84.2 (0.7) Figure 5: Analyzing alignment and uniformity of representations obtained via HypeBoy. In most cases, representations obtained by HypeBoy achieve both alignment and uniformity. F.3Alignment and uniformity analysis

In Section 4.3, we discuss the role of our 𝑝 ( 𝐗 , ℰ , Θ ) ⁢ ( ⋅ ) design choice in promoting alignment and uniformity of node representations. This characteristic is further analyzed across five datasets: Citeseer, Pubmed, Cora-CA, DBLP-P, and DBLP-A. As shown in Figure 5, node representations obtained from HypeBoy achieve both uniformity (first column) and class alignment (other columns) in most of the cases.

F.4Encoder analysis

In this section, we experimentally verify that the efficacy of HypeBoy extends beyond UniGCNII (Huang & Yang, 2021). For this purpose, we employ three HNNs: HGNN (Feng et al., 2019), MeanPoolingConv (Lee & Shin, 2023a), and SetGNN (Chien et al., 2022). Notably, the latter two encoders, MeanPoolingConv and SetGNN, serve as backbone encoders in TriCL (Lee & Shin, 2023a) and HyperGCL (Wei et al., 2022), respectively. The experimental setting mirrors that of the fine-tuning experiments detailed in Section 5.1. As shown in Table 7, HypeBoy outperforms other hypergraph SSL methods in most of the settings, validating its broad effectiveness across a diverse range of HNNs.

Table 8:Efficacy of pretext tasks under fine-tuning protocol. The best performances are colored as green. Among the three tasks, the hyperedge filling task shows superior performance. Method Citeseer Cora Pubmed Cora-CA DBLP-P Naive contrastive learning 59.2 (6.7) 44.5 (10.2) 75.2 (4.0) 62.1 (5.3) 86.7 (0.5) Naive feature reconstruction 58.6 (8.0) 51.5 (9.2) 74.7 (5.1) 61.9 (7.3) 87.3 (0.6) Naive hyperedge filling 60.7 (8.2) 51.6 (11.8) 76.2 (3.6) 63.5 (6.0) 88.1 (0.5) F.5Pretext task analysis

In this section, we evaluate the efficacy of different SSL pretext tasks. Specifically, we assess our hyperedge filling task by comparing it with two other SSL tasks: contrastive learning and feature reconstruction. To ensure a fair comparison and minimize the impact of the underlying SSL method design choices, we streamline each method as described below:

•

Naive contrastive learning: This method is a variant of TriCL (Lee & Shin, 2023a) that does not use projection head parameters. Other settings are all the same as that of TriCL (e.g., contrastive losses and view generation process).

•

Naive feature reconstruction: This method is a variant of the hypergraph feature reconstruction method (Section 4.4), that does not use a decoder HNN.

•

Naive hyperedge filling: This method is a variant of HypeBoy that does not use feature reconstruction warm-up and projection heads. Specifically, this method is equivalent to V1, which is described in Section 5.3.

As shown in Table 8, the naive hyperedge filling method outperforms the other two methods across all settings, highlighting the superior effectiveness of the hyperedge filling task.

Table 9:Comparing HypeBoy against other topological SSL methods under fine-tuning protocol. The best performances are colored as green. Among all the four methods, HypeBoy shows the best performance. Data Method Citeseer Cora Pubmed Cora-CA DBLP-P Graph Edge filling 44.1 (11.3) 55.7 (8.2) 73.6 (3.8) 55.5 (8.2) 86.9 (0.5) Graph Hyperedgeedge filling 42.8 (9.2) 56.4 (6.0) 73.6 (4.0) 55.0 (8.6) 86.7 (0.5) Hypergraph Hyperedge prediction 43.9 (7.6) 56.9 (9.7) 73.0 (4.9) 59.1 (6.5) 85.8 (0.6) Hypergraph Hypereedge filling (HypeBoy) 56.7 (9.8) 62.3 (7.7) 77.0 (3.4) 66.3 (4.6) 88.2 (0.4) F.6Topological generative pretext task analysis

In this section, we compare HypeBoy with other topology generative SSL methods. To this end, we adopt the following three baseline methods:

•

Graph edge filling: Utilizes edge filling on a clique-expanded graph (Appendix E.2), employing the same method HypeBoy used to fill size-2 hyperedges.

•

Graph hyperedge filling: Applies hyperedge filling on a clique-expanded graph (Appendix E.2), using the same approach as HypeBoy for filling hyperedges.

•

Hyperedge prediction: Undertakes the hyperedge prediction task as an SSL task, leveraging size negative samples (Patil et al., 2020).

As shown in Table 9, HypeBoy outperforms the other three topological generative SSL methods in all settings, highlighting the effectiveness of the hyperedge filling task.

F.7Homophily analysis

In this section, we demonstrate that HypeBoy exhibits prominent node classification performance across various homophilic scenarios, including both homophilic and non-homophilic hypergraphs. To this end, we analyze both real-world hypergraphs and semi-synthetic hypergraphs.

Real-world hypergraphs. We analyze the homophily characteristic of three hypergraph benchmark datasets: Cora-Cite, AMiner, and IMDB, which are utilized in this work. Note that HypeBoy shows the best performance among all the methods in these datasets (refer to Table 1).

Figure 6: Homophily analysis of three benchmark datasets: Cora-Cite, AMiner, and IMDB. Each color represents the class (label) of nodes. An upward trend in a class indicates homophily, whereas a downward trend indicates non-homophily. Cora-Cite demonstrates homophily, IMDB shows non-homophily, and AMiner falls in between, with some classes exhibiting homophily and others not.

For the homophily analysis, we utilize the method suggested by Veldt et al. (2023), which is a visualization-based approach. The analysis should be conducted for each hyperedge size, with the results for each hyperedge size presented in separate plots. In these plots, the line color represents the node class. A brief interpretation of the resulting plots is as follows:

•

An upward trend line indicates that nodes of the corresponding class demonstrate higher-order homophilic characteristics within hyperedges of the same size.

•

A downward trend line indicates that nodes of the corresponding class display higher-order non-homophilic characteristics within hyperedges of the same size.

As shown in Figure 6, the three datasets (Cora-cite, AMiner, and IMDB) exhibit different homophilic patterns. Specifically, the Cora-Cite dataset demonstrates homophilic tendencies, the IMDB dataset shows a non-homophilic pattern, and the AMiner dataset falls in between, with some classes being homophilic and others not. Given that HypeBoy outperforms other methods in these datasets, we can conclude that its efficacy is valid for both homophilic and non-homophilic hypergraphs.

Table 10:Analysis of performance changes under variations in homophilic extent. The smallest drop from the original performance is highlighted in green. NA indicates UniGCNII encoder without pre-training. As shown, HypeBoy is the most robust method against homophily violation among the used hypergraph SSL methods. Dataset Method Original 𝑇

50
𝑇

100
𝑇

150
𝑇

200

Cora NA 48.5 (-) 46.7 (-1.8) 44.7 (-3.8) 43.5 (-5.0) 40.9 (-7.6) TriCL 60.2 (-) 59.1 (-1.1) 56.1 (-4.1) 55.7 (-4.5) 53.6 (-6.5) HyperGCL 60.3 (-) 57.4 (-2.9) 53.7 (-6.6) 52.0 (-8.3) 50.0 (-10.3) HypeBoy 62.3 (-) 61.7 (-0.6) 59.2 (-3.1) 58.0 (-4.3) 56.7 (-5.6) Citeseer NA 44.2 (-) 41.4 (-2.8) 39.0 (-5.2) 35.9 (-8.3) 33.8 (-10.4) TriCL 51.7 (-) 51.1 (-0.6) 47.5 (-4.2) 45.2 (-6.5) 41.8 (-9.9) HyperGCL 47.0 (-) 45.8 (-1.2) 43.3 (-3.7) 40.9 (-6.1) 37.7 (-9.3) HypeBoy 56.7 (-) 56.5 (-0.2) 54.2 (-2.5) 50.8 (-5.9) 48.6 (-8.1)

Semi-synthetic hypergraphs. We analyze how the performance of HypeBoy varies as the homophilic extent of hypergraphs varies. To this end, we corrupt the hypergraph topology of two homophilic hypergraph datasets: Cora and Citeseer. Specifically, we utilize the node-swapping method. Details are provided in Algorithm 1. Note that the more we swap nodes in a hypergraph, the more the hypergraph gets non-homophilic.

As shown in Table 10, HypeBoy is the most robust among the hypergraph SSL methods evaluated, exhibiting the smallest performance drop in scenarios where homophily is violated.

Algorithm 1 Node swapping algorithm

Input: A set of hyperedges ℰ , Number of iterations 𝑇

Output: Shuffled hyperedges ℰ ′

1: ℰ ′ ← ℰ 2:for 𝑘 ← 1 to 𝑇 do 3: Sample two hyperedges 𝑒 𝑖 ′ and 𝑒 𝑗 ′ from ℰ ′ uniformly at random 4: Sample a node 𝑣 𝑖 ′ from 𝑒 𝑖 ′ and a node 𝑣 𝑗 ′ from 𝑒 𝑗 ′ uniformly at random 5: 𝑒 𝑖 ′′ ← ( 𝑒 𝑖 ′ ∖ { 𝑣 𝑖 ′ } ) ∪ { 𝑣 𝑗 ′ } 6: 𝑒 𝑗 ′′ ← ( 𝑒 𝑗 ′ ∖ { 𝑣 𝑗 ′ } ) ∪ { 𝑣 𝑖 ′ } 7: ℰ ′ ← ( ℰ ′ ∖ { 𝑒 𝑖 ′ } ∖ { 𝑒 𝑗 ′ } ) ∪ { 𝑒 𝑖 ′′ } ∪ { 𝑒 𝑗 ′′ } Return: Swapped hyperedges ℰ ′ F.8Additional data splits

In this section, we investigate the efficacy of HypeBoy in various node-label-split settings. To this end, we sample a certain number of nodes for each class as training nodes and validation nodes. Specifically, we utilize the following two settings:

•

For each node class, use 5 nodes/5 nodes/the rest as train/validation/test nodes, respectively.

•

For each node class, use 10 nodes/10 nodes/the rest as train/validation/test nodes, respectively.

Other experimental settings are the same as that of the fine-tuning experiments (Section 5.1). As shown in Table 11, HypeBoy outperforms other hypergraph SSL methods in all the settings. Thus, we demonstrate that the efficacy of HypeBoy is not restricted to a particular label split setting.

Table 11:Results in the various label-split settings under the fine-tuning protocol. The best and second-best performances are colored green and yellow, respectively. NA indicates UniGCNII encoder without pre-training. As shown, HypeBoy outperforms other hypergraph SSL methods in all the settings. Method Citeseer Cora Pubmed Cora-CA DBLP-P 5 nodes NA 53.7 (6.8) 63.7 (3.8) 71.6 (4.6) 60.7 (4.9) 77.8 (4.7) TriCL 60.1 (5.2) 71.0 (3.1) 71.9 (3.8) 67.6 (2.9) 80.5 (2.5) HyperGCL 57.0 (5.6) 70.3 (2.9) 72.2 (4.4) 61.5 (4.4) 79.5 (3.7) HypeBoy 63.1 (4.6) 72.0 (2.4) 72.9 (3.2) 67.8 (2.4) 81.5 (2.4) 10 nodes NA 62.7 (3.8) 72.8 (2.2) 73.7 (3.0) 67.7 (3.0) 82.4 (1.9) TriCL 66.3 (2.6) 75.7 (2.3) 74.5 (3.1) 69.7 (2.2) 83.6 (1.9) HyperGCL 64.4 (3.9) 74.9 (2.1) 74.3 (3.1) 69.0 (3.4) 83.6 (1.6) HypeBoy 67.7 (2.6) 75.8 (7.8) 74.9 (2.7) 70.9 (1.7) 84.6 (1.5) F.9Heterogenous hypergraph experiment

In this section, we investigate the efficacy of HypeBoy as a pre-training strategy on the heterogeneous hypergraph dataset, which contains hyperedges of various semantics. To this end, we utilize the ACM dataset, where nodes correspond to publications. There are two types of hyperedges: authorship and subject relations.

•

Authorship: Publications authored by the same person are represented as a hyperedge.

•

Subject relations: Publications sharing a subject are represented as a hyperedge.

Other experimental settings are the same as that of the fine-tuning experiments (Section 5.1). As shown in Table 12, HypeBoy outperforms other hypergraph SSL methods in the ACM dataset, demonstrating its efficacy beyond homogeneous hypergraphs.

Table 12:Results in the ACM dataset under the fine-tuning protocol. The best performances are colored green. Among the four methods, HypeBoy shows the best performance. Method UniGCNII TriCL HyperGCL HypeBoy Accuracy 40.0 (3.1) 40.3 (2.4) 40.8 (3.4) 41.7 (2.6) Appendix GPotential Applications of Hyperedge Filling

In this section, we provide two potential applications of the hyperedge filling task.

•

Email recipient recommendation: Given recipients of an email, recommend a user likely to be added as a recipient of the email. Nodes in a hypergraph indicate users, and each hyperedge indicates an email, consisting of a sender, recipients, and CCs.

•

Item recommendation: Given items in a shopping cart recommend an item to be co-purchased with the items. Nodes in a hypergraph indicate items, and each hyperedge contains a set of co-purchased items.

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 129 kB
Xet hash:: 21a611b7814f2b564b29c951ed139400aeb2457e118a8a1c4b357b5396d57993

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Preliminaries. A hypergraph 𝒢

A hypergraph neural network (HNN) 𝑓 𝜃 is a function that receives a node feature matrix 𝐗 and a set of hyperedges ℰ as inputs to return node embeddings 𝐙 ∈ ℝ | 𝒱 | × 𝑘 (i.e., 𝐙

Task definition. Given a set of nodes, hyperedge filling aims to predict a node that is likely to form a hyperedge together. Specifically, for each hyperedge 𝑒 𝑗 ∈ ℰ , we divide it into a (missing) node 𝑣 𝑖 ∈ 𝑒 𝑗 and a (query) subset 𝑞 𝑖 ⁢ 𝑗

First, we assume a data model of a hypergraph 𝒢

Assume that there are 2 ⁢ 𝑁 nodes and node classes 𝐶 1 and 𝐶 0 such that 𝐶 1 ∪ 𝐶 0

𝒱 , 𝐶 1 ∩ 𝐶 0

∅ , and | 𝐶 1 |

| 𝐶 0 |

𝑁 . Each node feature vector 𝐱 𝑖 is independently generated from 𝒩 ⁢ ( 𝐱 ; 𝛍 1 , 𝚺 ) if 𝑣 𝑖 ∈ 𝐶 1 , and 𝒩 ⁢ ( 𝐱 ; 𝛍 0 , 𝚺 ) if 𝑣 𝑖 ∈ 𝐶 0 . For simplicity, we assume 𝛍 1

( 0.5 ) 𝑖

1 𝑑 , 𝛍 0

( − 0.5 ) 𝑖

1 𝑑 , and 𝚺

Note that the tendency of each hyperedge containing nodes of the same class is symmetric about 𝒫

Node representation 𝒛 𝑖 is obtained via gradient descent with respect to 𝒙 𝑖 from ℒ , which is the negative log-likelihood of Eq. (1), (i.e., ℒ

− log 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) ). For ease of analysis, we assume 𝒛 𝑖

ℱ ⁢ ( 𝒙 𝑖 )

1 ] := 𝑃 𝒙 ⁢ ( 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 1 , 𝐈 ) > 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 0 , 𝐈 ) ) ). Similarly, that with a derived representation is defined as 𝔼 𝒛 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

1 ]

( 𝟏 → 𝑇 ⁢ ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 > 0 ) ⇒ 𝔼 𝒛 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

1 ] > 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ] , where 𝟏 → denotes ( 1 ) 𝑘

After augmentation, HypeBoy obtains node and query subset representations. First, HypeBoy obtains node embeddings 𝐙 ∈ ℝ | 𝒱 | × 𝑑 ′ by feeding the augmented hypergraph into an encoder HNN: 𝐙

𝑓 𝜙 ′ ⁢ ( 𝒛 𝑖 ) ∈ ℝ 𝑘 ⁢ and ⁢ 𝒒 𝑖 ⁢ 𝑗

The last step is to compute the SSL loss based on the hyperedge filling probability. We design 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑘 | 𝑞 𝑖 ⁢ 𝑗 ) to be normalized over 𝒱 (i.e., ∑ 𝑣 𝑘 ∈ 𝒱 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑘 | 𝑞 𝑖 ⁢ 𝑗 )

Note that the set Θ of all parameters consists of the parameters of the encoder HNN 𝑓 𝜃 , the node projection head 𝑓 𝜙 ′ , and the set projection head 𝑓 𝜌 ′′ (i.e., Θ

𝒎 ( 𝐼 ) , ∀ 𝑣 𝑖 ∈ 𝒱 ′ and 𝐗 𝑗 ′′

𝐗 𝑗 , ∀ 𝑣 𝑗 ∈ 𝒱 ∖ 𝒱 ′ ). We also obtain augmented hyperedges ℰ ′ by using the strategy described in Section 4.1. Then, we obtain node embeddings 𝐙 ′ ∈ ℝ | 𝒱 | × 𝑑 ′ as follows: 𝐙 ′

Step 2: Decoding. We again mask the embeddings of the nodes that we sampled earlier with a learnable embedding token 𝒎 ( 𝐸 ) ∈ ℝ 𝑑 ′ , obtaining the masked node embedding matrix 𝐙 ′′ (i.e., 𝐙 𝑖 ′′

𝒎 ( 𝐸 ) , ∀ 𝑣 𝑖 ∈ 𝒱 ′ and 𝐙 𝑗 ′′

𝐙 𝑗 ′ , ∀ 𝑣 𝑗 ∈ 𝒱 ∖ 𝒱 ′ ). Then, we acquire reconstructed node features 𝐗 ^ ∈ ℝ | 𝒱 | × 𝑑 by using a decoder HNN 𝑔 𝜓 as follows: 𝐗 ^

Step 3: Updating. After gaining reconstructed node features 𝐗 ^ , we maximize the similarity between the original and reconstructed features of the masked nodes. To this end, we minimize the following loss function, which is proposed by Hou et al. (2022): ℒ ′

We first analyze a node 𝑣 𝑖 that belongs to 𝐶 1 (i.e., 𝑣 𝑖 ∈ 𝐶 1 ). By definition, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ]

𝑃 𝒙 ⁢ ( 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 1 , 𝐈 ) > 𝑓 ⁢ ( 𝒙 𝑖 ; 𝝁 0 , 𝐈 ) ) holds. Thus, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

Thus, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

1 ] is equivalent to 𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 > 0 ) . In a similar sense, 𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

Now, we formalize 𝒛 𝑖 . Note that 𝒛 𝑖

∇ 𝒙 𝑖 ℒ

∂ ( − log ( 𝑝 ( 𝐗 , ℰ , Θ ) ( 𝑣 𝑖 | 𝑞 𝑖 ⁢ 𝑗 ) ) ) ∂ 𝒙 𝑖

𝒛 𝑖

Let 𝒙 𝑞 ′ denotes ∑ 𝑣 𝑘 ∈ 𝑞 𝑖 ⁢ 𝑗 𝒙 𝑘 . Furthermore, denote (Term 1) in Eq (10) as 𝑓 ⁢ ( 𝒙 ) ∈ ( 0 , 1 ) . Then, we rewrite Eq (10) as 𝒛 𝑖

𝟏 → 𝑇 ⁢ 𝒛 𝑖

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 > 0 )

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 > 0 )

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 > 0 ) + 𝑃 𝒙 ⁢ ( − 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝛽 < 𝟏 → 𝑇 ⁢ 𝒙 𝑖 < 0 ) ,

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

1 ]

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 < 0 )

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒛 𝑖 < 0 )

𝑃 𝒙 ⁢ ( 𝟏 → 𝑇 ⁢ 𝒙 𝑖 < 0 ) + 𝑃 𝒙 ⁢ ( 0 < 𝟏 → 𝑇 ⁢ 𝒙 𝑖 < − 𝛾 ⁢ ( 1 − 𝑓 ⁢ ( 𝒙 ) ) ⁢ 𝛽 ) ,

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒛 𝑖 )

0 ]

𝔼 𝒙 ⁢ [ 𝟏 ℱ ⁢ ( 𝒙 𝑖 )

𝑃 𝒙 , 𝑒 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) > 0 | 𝑝 )

𝑃 𝒙 ( 𝟏 → 𝑇 ( ∑ 𝑣 𝑘 ∈ 𝑞 𝒙 𝑘 ) > 0 | 𝑠 )

≡ ∑ 𝑠

= ∑ 𝑠

𝑃 𝑒 ⁢ ( 𝑠 , 𝑣 𝑖 ∈ 𝑒 | 𝑝 )

( 𝑆 𝑠 ) ⁢ ( 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 2 ⏟ prob. of 𝑠 at 𝑐

1 + ( 1 − 𝑝 ) 𝑠 ⁢ 𝑝 𝑆 − 𝑠 2 ⏟ prob. of 𝑠 at 𝑐

𝑃 𝑒 ⁢ ( 𝑣 𝑖 ∈ 𝑒 | 𝑝 )

∑ 𝑠

= ∑ 𝑠

= 1 𝑁 ⁢ ∑ 𝑠

≡ 2 ⁢ 𝑁 𝑆 ⁢ ∑ 𝑠

= 1 𝑆 ⁢ ∑ 𝑠

= 1 𝑆 ⁢ ∑ 𝑠

= 1 𝑆 ⁢ ∑ 𝑠

0 𝑆 ( 𝑆 𝑠 ) ⁢ 𝑠 ⁢ 𝑝 𝑠 ⁢ ( 1 − 𝑝 ) 𝑆 − 𝑠 ⁢ 𝜙 ⁢ ( 𝑠 ) ⏟ (Term 1) + 1 𝑆 ⁢ ∑ 𝑠

≡ 1 𝑆 ⁢ ∑ 𝑠 ′

∑ 𝑠

≡ ∑ 𝑠

We first show that Eq (33) is an increasing function w.r.t. 𝑠 ∈ { ⌊ 𝑆 / 2 ⌋ , ⌊ 𝑆 / 2 ⌋ + 1 , ⋯ , 𝑆 } . For two cases where 𝑠

𝑘 and 𝑠

∂ 𝜔 ⁢ ( 𝑝 , 𝑠 ) ∂ 𝑝

( 𝑎 ) − ( 𝑏 )

( 𝑐 ) − ( 𝑑 )

When 𝑝

≡
∑ 𝑠

=
∑ 𝑠

=
1 𝑆 ⁢ ∑ 𝑠

=
1 𝑆 ⁢ ∑ 𝑠

≡
∑ 𝑠

=
1 𝑆 ⁢ ∑ 𝑠

=
1 𝑆 ⁢ ∑ 𝑠

≡
∑ 𝑠

≡
2 ⁢ 𝑁 𝑆 ⁢ ∑ 𝑠

=
1 𝑆 ⁢ ∑ 𝑠

𝒛 𝑖 𝑇 ⁢ 𝒒 𝑗 > 𝒛 𝑘 𝑇 ⁢ 𝒒 𝑗 ,

∀ 𝑣 𝑖 ∈ 𝑆 𝑗 , ∀ 𝑣 𝑘 ∈ 𝒱 ∖ 𝑆 𝑗 , ∀ 𝑞 𝑗 ∈ 𝒬 , where ⁢ 𝑆 𝑗

− 𝐄𝐙 ,

where ⁢ 𝐄 ∈ ℝ | 𝒱 | × | 𝒱 | , s.t. ⁢ 𝐄 𝑖 ⁢ 𝑗