Data Collaboration Analysis Framework Using Centralization of Individual Intermediate Representations for Distributed Data Sets

: This paper proposes a data collaboration analysis framework for distributed data sets. The proposed framework involves centralized machine learning while the original data sets and models remain distributed over a number of institutions. Recently, data has become larger and more distributed with decreasing costs of data collection. Centralizing distributed data sets and analyzing them as one data set can allow for novel insights and attainment of higher prediction performance than that of analyzing distributed data sets individually. However, it is generally difficult to centralize the original data sets because of a large data size or privacy concerns. This paper proposes a data collaboration analysis framework that does not involve sharing the original data sets to circumvent these difficulties. The proposed framework only centralizes intermediate representations constructed individually rather than the original data set. The proposed framework does not use privacy-preserving computations or model centralization. In addition, this paper proposes a practical algorithm within the framework. Numerical experiments reveal that the proposed method achieves higher recognition performance for artificial and real-world problems than individual analysis. DOI: 10.1061/AJRUA6.0001058. This work is made available under the terms of the Creative Commons Attribution 4.0 International license, http://creativecommons.org/licenses/by/4.0/.


Introduction
Dimensionality reduction methods that project high-dimensional data to a low-dimensional space are successfully applied in several application areas to improve the prediction performance and accelerate machine learning algorithms, including gene expression data analysis (Tarca et al. 2006), chemical sensor data analysis (Jurs et al. 2000), social network analysis (Tichy et al. 1979), infrastructure analysis Attoh-Okine 2018, 2020) and so on. Recently, there has been a rise in large and distributed data, and the costs of data collection have decreased. Centralizing distributed data sets and analyzing as one data set, which we refer to as centralized analysis, can enable us to obtain novel insights and achieve higher prediction performance than that of individual analysis on an individual distributed data set. However, it is generally difficult to centralize the original data sets because of large data size or privacy concerns.
For example, in the case of medical data analysis, the data sets in each medical institution may not be sufficient for generating a highquality prediction result because of insufficiency and imbalance of the data samples. However, it is difficult to centralize the original medical data samples with those from other institutions because of privacy concerns. If the original data is transformed to another (e.g., low-dimensional) space by an appropriate mapping; however, the mapped data, which is referred to as an intermediate representation, can be centralized fairly easily because each feature of the intermediate representation lacks any physical interpretation.
Examples of overcoming the difficulties of centralized analysis include the usage of privacy-preserving computation based on cryptography (Jha et al. 2005;Kerschbaum 2012;Cho et al. 2018;Gilad-Bachrach et al. 2016) and differential privacy (Abadi et al. 2016;Ji et al. 2014;Dwork 2006). Federated learning (Konečnỳ et al. 2016;McMahan et al. 2016), in which a model is centralized while the original data sets remain distributed, has also been studied in this context.
In contrast to these existing methods, this paper proposes a data collaboration analysis framework for distributed data sets that centralizes only individually constructed intermediate representations.
The proposed framework assumes that each institution uses a different mapping function for constructing intermediate representations.
The framework does not centralize the mapping functions to avoid a risk of approximating the original data samples from their intermediate representations by using the (approximate) inverse of the mapping functions. The proposed data collaboration analysis framework also does not use privacy-preserving computation. Instead, using sharable data such as public data and randomly constructed dummy data, the proposed framework achieves a data collaboration analysis by mapping individual intermediate representations to incorporable representations referred to as collaboration representations.
This paper additionally proposes a practical algorithm and a practical operation strategy regarding the problem of privacy preservation. Using numerical experiments on artificial and real-world data sets, the recognition performance of the proposed method is evaluated and compared with centralized and individual analyses.
• The proposed framework differs from existing approaches as it does not use privacy-preserving computations or model centralization. • The proposed data collaboration analysis achieves higher recognition performance than that produced by individual analysis.

Data Collaboration Analysis Framework
In this section, we discuss the case that there are multiple institutions and each institution has an individual data set. We propose a data collaboration analysis framework for distributed data sets that does not centralize the original data. The proposed method can be considered a dimensionality reduction method for distributed data sets. The distributed original data sets are transformed into the collaboration representations via the intermediate representations.
Therefore, after constructing the collaboration representations, we can use any machine learning algorithms including unsupervised, supervised, and semi-supervised learning. Let d be the number of institutions. Let m, n i be the numbers of features and training data samples of the ith institution and n be the total number of training data samples, n ¼ P d i¼1 n i . In addition, let X i ¼ ½x i1 ; x i2 ; : : : ; x in i ∈ R m×n i be the training data set of the ith institution. For supervised learning, we additionally let L i ¼ ½l i1 ; l i2 ; : : : ; l in i ∈ R l×n i be the ground truth for the training data. Also let s i be the number of test data samples of the ith institution, s ¼ P d i¼1 s i and Y i ¼ ½y i1 ; y i2 ; : : : ; y is i ∈ R m×s i be test data sets of the ith institution.
We do not centralize the original data set X i (and Y i in supervised learning). Instead, we centralize the intermediate representations constructed individually from X i . We also do not centralize the mapping function for the intermediate representation to reduce the risk of approximating the original data.
In the remainder of this section, we introduce a fundamental concept of the data collaboration analysis framework and propose a practical algorithm. In addition, we consider a practical operation strategy regarding privacy concerns.

Fundamental Concept and Framework
Instead of centralizing the original data set X i , we consider centralizing the intermediate representatioñ constructed individually in each institution, where f i is a linear or nonlinear column-wise mapping function. Since each mapping function f i is constructed using X i individually, f i and its dimensionality l i depend on i. Examples of the mapping function include unsupervised dimensionality reductions, such as principal component analysis (PCA) (Pearson 1901;Jolliffe 1986); locality preserving projections (LPP) (He and Niyogi 2004); and supervised dimensionality reductions, such as Fisher discriminant analysis (FDA) (Fisher 1936), local FDA (LFDA) (Sugiyama 2007), semi-supervised LFDA (SELF) (Sugiyama et al. 2010), locality adaptive discriminant analysis (LADA) (Li et al. 2017); and complex moment-based supervised eigenmap (CMSE) (Imakura et al. 2019). One can also consider a partial structure of deep neural networks. The proposed framework aims to avoid difficulties of centralized analysis by achieving collaboration analysis while the original data set X i and the mapping function f i remain distributed in each institution.
Because f i depends on the institution i, even when each institution has an identical data sample x, the intermediate representation of the data differs; that is In addition, the relationship between the original data samples x and y is generally not preserved across different institutions; that is where Dð·; ·Þ denotes a relationship between data samples, such as distance and similarity. Therefore, one cannot analyze intermediate representations as one data set, even if dimensionality is identi- To overcome this difficulty, the authors transform individual intermediate representations to incorporable representations again as follows:X Here, g i is a column-wise mapping function such that Dðg i ðf i ðxÞÞ; g j ðf j ðyÞÞÞ ≈ Dðx; yÞ ði ≠ jÞ ð 6Þ Preserving the relationships of the original data set, one can analyze the obtained dataX i (i ¼ 1; 2; : : : ; d) as one data set as follows:X Because the mapping function f i for the intermediate representation is not centralized, the function g i cannot be constructed only from the centralized intermediate representationsX i . To construct the mapping function g i , we introduce sharable data referred to as an anchor data set consisting of public data or dummy data constructed randomly: where r ≥ l i . Applying each mapping function f i to the anchor data, we have the ith intermediate representation of the anchor data setX Then, we centralizeX anc i and construct g i such that The fundamental procedure in the proposed data collaboration analysis framework is as follows: 1.

Construction of intermediate representations Each institution constructs intermediate representations individually and centralizes them. 2. Construction of collaboration representations
From the centralized intermediate representations, the collaboration representations are constructed.

Collaboration analysis
Collaboration representations obtained from individual original data sets are analyzed as one data set.

Proposal for Practical Algorithm
A fundamental component of the proposed framework involves constructing the collaboration representations using the anchor data (Phase 2). The mapping function g i can be constructed using the following two steps.

Map function construction
We construct mapping function g i such that Z ≈ g i ðX anc i Þ ði ¼ 1; 2; : : : ; dÞ ð 14Þ There may be several ways for computing Steps 1 and 2. This paper assumes g i to be a linear map. Considering only Eq. (12) for Step 1, we propose a practical algorithm.
Because the map function g i is a linear map, using a matrix G i ∈ R l×l i , we havê Then, to achieve Eq. (12), we address the following minimization problem: min G 0 1 ;G 0 2 ; : : : This problem is difficult to solve directly. Instead, we consider solving the following minimal perturbation problem, i.e. min E i ;G 0 i ði¼1;2; : : : ;dÞ;Z The minimal perturbation problem Eq. (17) with d ¼ 2 is called the total least squares problem and is solved by singular value decomposition (SVD) (Ito and Murota 2016). In the same manner, one can solve Eq. (17) with d > 2 using SVD. Let h ðX anc 1 Þ T ; ðX anc 2 Þ T ; : : be the SVD of the matrix combiningX anc i , where and Σ 1 has larger part of singular values. Then, we have where C ∈ R l×l is a nonsingular matrix. Next, setting Z ¼ U T 1 , we compute G i from Eq. (14). The matrix G i can be computed individually by solving the following linear least squares problem: where ðX anc i Þ † denotes the Moore-Penrose pseudo-inverse of the matrixX anc i . Algorithm 1 summarizes the algorithm of the proposed method for supervised learning.
One of the main computational costs of the proposed method is for SVD (18) that depends on the number of anchor data r and dimensionality of the intermediate representations l i . We can use some approximation algorithms including randomized SVD (Halko et al. 2011) for reducing the computational costs. On the other hand, the anchor data X anc also strongly affects the recognition performance of the proposed method. A simple method is to set X anc as a random matrix. If the anchor data has the same statistics with the original data set, it may improve the recognition performance of the proposed method. We intend to investigate practical techniques for constructing suitable anchor data in the future.

Practical Operation Strategy Regarding Privacy Concerns
Here, we consider a practical operation strategy regarding privacy concerns based on the proposed framework for supervised learning. This paper uses the term privacy is preserved when each entry of corresponding data cannot be (approximately) obtained by others.
Here, this paper does not consider the privacy of data set statistics. Based on this definition, one can assert that regarding the original data X i in each institution, privacy is preserved if the data collaboration analysis satisfies the following operation strategies: 1. There are two roles: users who have training and test data sets individually and an analyst who centralizes the intermediate representations and analyzes them. a. The users and analyst possess some of the data, as illustrated in Tables 1 and 2.
b. Each step of Algorithm 1 is executed by the corresponding role, as demonstrated in Fig. 1.  2. Each mapping function f i is constructed with the following requirements: a. The original data can be approximated only with an intermediate representation and the mapping function f i or its approximation. b. The mapping function f i can be approximated only with both the input and output data of f i . 3. The analyst does not collude with user(s) to obtain the original data of other users. In this operation strategy, each user does not possess the intermediate representations of other users and the analyst does not possess the original anchor data X anc . Therefore, the original data set X i cannot be (approximately) obtained by others; that proves the privacy of the original data X i is preserved in our definition.

Related Works
One possibility for achieving a high-quality analysis while avoiding the difficulties of centralized analysis involves the usage of privacypreserving computation. There are two types of typical privacypreserving computation techniques based on cryptography (Jha et al. 2005;Kerschbaum 2012;Cho et al. 2018;Gilad-Bachrach et al. 2016) and differential privacy (Abadi et al. 2016;Ji et al. 2014;Dwork 2006).
Cryptographic privacy-preserving (or secure multi-party) computations can compute a function over distributed data while retaining the privacy of the data. Fully homomorphic encryption (FHE) (Gentry 2009) can compute any given function; however, it is impractical for large data sets with respect to computational cost even using the latest implementations (Chillotti et al. 2016). Differential privacy is another type of privacy-preserving computation that protects the privacy of the original data sets by randomization. In terms of computational cost, these computations are more efficient than cryptographic computations; however, they may have low prediction accuracy because of the noise added for protecting privacy.
Federated learning, involving centralizing a model, has also been studied in this context (Konečnỳ et al. 2016;McMahan et al. 2016). Federated learning achieves a high-quality analysis avoiding the difficulties of centralized analysis by centralizing a model function instead of using cryptography or randomization. However, it may carry a risk of exposing the original data set as a result of centralizing a model for each institution. Therefore, in practice, federated learning is used in conjunction with privacypreserving computations (Yang 2019).
Our proposed framework differs from these existing approaches as it does not use privacy-preserving computations or a model centralization.

Numerical Experiments
This section presents an evaluation of the recognition performance of the proposed data collaboration analysis method and compares it with that of centralized and individual analyses for classification problems. In our target situation, it should be noted that centralized analysis is just ideal because one cannot share the original data sets X i . The proposed data collaboration analysis must achieve a recognition performance higher than that of individual analysis and lower, but similar to, that of centralized analysis.
We used kernel ridge regression (Saunders et al. 1998) for the individual and centralized analyses and Step 8 in the proposed method (Algorithm 1). In the proposed method, each intermediate representation is constructed from X i by kernel LPP (K-LPP) (He and Niyogi 2004). We note that K-LPP is an unsupervised dimensionality reduction; however, the constructed map f i depends on i because it depends on data set X i . The anchor data set is constructed as a random matrix.
In the training phase, we use the ground truth L as a binary matrix whose ði; jÞ entry is 1 if the training data x j is in class i. This type of ground truth L is used for several classification algorithms including ridge regression and deep neural networks (Bishop 2006). All numerical experiments were performed using MATLAB 2018b.
User i and analyst X anc All users h All users and analyst Fig. 1. Practical operation strategy: algorithm flow. (e) data collaboration analysis (user 1 has test data set); (f) data collaboration analysis (user 2 has test data set); and (g) data collaboration analysis (user 3 has test data set).

Artificial Data
In this experiment, we used a three-class classification of 10-dimensional artificial data. Fig. 2(a) illustrates the first two dimensions of the ground truth. Figs. 2(b-d) illustrate 40 training data points in each user of the first two dimensions with the corresponding labels: ∘, •, and þ. For the test data set, we used 201 × 201 data points whose first two dimensions were square grid points in ½−1; 1 × ½−1; 1. The remaining eight dimensions of the training and test data sets were random values in ½−0.1; 0.1 generated by the Mersenne Twister. The Gaussian kernel was used for all methods.
The accuracy (ACC) of centralized analysis and the average ACC of three users of individual and proposed data collaboration analyses are 92.3, 79.8, and 91.3. Fig. 3 presents the recognition results. In each subfigure, white markers: ∘, •, and þ, denote training data points. From the comparison between the results of centralized and individual analyses, we observed that the recognition results of individual analysis are significantly poorer than those of centralized analysis because of the insufficiency of data samples. In contrast, the proposed data collaboration analysis achieves results comparable to those of centralized analysis.

Handwritten Digits Data (MNIST)
In this experiment, we used a 10-class classification of handwritten digits (MNIST) (LeCun 1998), where the number of features was m ¼ 784. Here, we set 100 data samples for each user and evaluated the recognition performance, normalized mutual information (NMI) (Strehl and Ghosh 2002), accuracy (ACC), rand index (RI) (Rand 1971), for 1,000 test data samples, increasing the number of users from 1 to 50. We used the Gaussian kernel for all methods. Fig. 4 presents the average and standard error of the recognition performance for 20 trials for each method. It can be seen that the recognition performance of the proposed data collaboration analysis increases with an increasing number of users and achieves a significantly higher recognition performance than individual analysis.

Gene Expression Data
In this numerical experiment, we used a three-class classification problem for cancer data from a previous study (Golub et al. 1999). The data set has 38 training and 34 test data samples with m ¼ 7,129 features. Here, we considered the case of two users and allocated 19 data samples for each user. Then, we evaluated the recognition performance for 20 trials. A linear kernel was used for all methods. Fig. 5 presents a three-dimensional visualization of the training þ and test ∘ data samples for each method. Table 3 summarizes the recognition performance (average AE standard error). In threedimensional visualization, three classes are well separated in lowdimensional space constructed by the proposed data collaboration analysis as well as centralized analysis. We observed that the proposed data collaboration analysis achieved higher recognition performance than individual analysis for real-world problems.

Remarks of Numerical Results
The results of numerical experiments reveal that the proposed data collaboration analysis achieves higher recognition performance for artificial and real-world data sets than individual analysis. It should be noted that because centralized analysis is ideal, the recognition performance of the proposed data collaboration analysis is not required to be higher than that of centralized analysis.

Conclusions
This paper has proposed a data collaboration analysis framework for distributed data sets based on centralizing individual intermediate representations, while the original data sets and mapping functions remain distributed. This paper has also proposed a practical algorithm within the framework and a practical operation strategy regarding privacy concerns. The proposed framework differs from existing approaches in that it does not use privacy-preserving computations and does not centralize mapping functions. Numerical experiments demonstrate that the proposed method achieves higher recognition performance for artificial and real-world data sets than individual analysis.
In future works, we will investigate the usage of a nonlinear mapping function g i and how to set anchor data to improve recognition performance for large real-world problems.

Data Availability Statement
Some or all data, models, code-generated or used during the study are available from the corresponding author by request. Available items: program codes, data sets used in the numerical experiments.