Data fusion using optimal scaling
This doc is going to show how to do data fusion using optimal scaling.
Contents
Add current folder to the path
clear all;
current_fold = pwd;
addpath(genpath(current_fold));
The Simulation of coupled binary and quantitative data sets
The simulation process can be refereed to the accompanied software of the paper "Generalized Simultaneous Component Analysis of Binary and Quantitative data" in https://arxiv.org/abs/1807.04982. We use the logit transform of the empirical marginal probabilities of the binary CNA data set as the simulated offset term to simulate imbalanced binary data . The number of samples, binary variables and quantitative variables are
,
,
respectively. The noise level
in simulating quantitative data set
is 1. The simulated low rank is 10. The SNRs in generating
and
are both 1.
% % import real binary CNA data set load('X1_CNA.mat'); mu1_fixed = logit(mean(X1_CNA,1)); % imbalanced binary data simulation %mu1_fixed = zeros(1,size(X1_CNA,2)); % balanced binary data simulation % data simulation SNRs = [1,1]; % SRNs in simulating X1 and X2 K = 3; % simulated low rank link = 'logit'; % use logit link %link = 'probit';% use logit link seed = 1234; % set seed to reproduce the example [X1,X2,Theta_simu,mu_simu,Z_simu,E_simu] = GSCA_data_simulation(mu1_fixed,SNRs,K,link,seed); % plot of the simulated offset term figure plot(mu_simu); title('simulated offset'); xlabel('variables');

Optimal scaling of binary data set X1 and quantitative data set X2
Optimal scaling of binary data equivalent to do auto scaling on data set X1; optimal scaling of quantitative data set X2 is itself.
% muX1 = mean(X1,1); stdX1 = std(X1,1); [m, n1] = size(X1); Z1 = (X1 - ones(m,1)*muX1)./(ones(m,1)*stdX1); % we use the auto scaled version of X2 muX2 = mean(X2,1); stdX2 = std(X2,1); [~, n2] = size(X2); Z2 = (X2 - ones(m,1)*muX2)./(ones(m,1)*stdX2);
Data fusion of optimal scaled X1 and X2
The data fusion of optimal scaled X1 and X2 is simple doing PCA on the cmobined (full) data sets.
% % SCA on Z1 Z2 R = 3; % number of PCs Z = [Z1, Z2]; % row cancatation for SCA [U,S,V] = svd(Z,'econ'); AHat = U(:,1:R); % score matrix BHat = V(:,1:R)*S(1:R, 1:R); % loading matrix
Variation explained ratios
The variation explained ratios are computed in the same way as a standard PCA model.
% comput the variation explained raito for the full data set S = diag(S); % extract singular values S_square = S.^2; varExp_PCs = S_square./sum(S_square); varExp_PCs = varExp_PCs(1:R) * 100; varExp_total = sum(varExp_PCs(1:R)); % Variation explained in binary X1 by the model [~,S1,~] = svd(Z1,'econ'); S1 = diag(S1); % extract singular values S1_square = S1.^2; varExp_PCs_X1 = S1_square./sum(S1_square); varExp_PCs_X1 = varExp_PCs_X1(1:R) * 100; varExp_X1 = sum(varExp_PCs_X1(1:R)); % Variation explained in quantitative X2 by the model [~,S2,~] = svd(Z2,'econ'); S2 = diag(S2); % extract singular values S2_square = S2.^2; varExp_PCs_X2 = S2_square./sum(S2_square); varExp_PCs_X2 = varExp_PCs_X2(1:R); varExp_X2 = sum(varExp_PCs_X2(1:R)); disp('total variation explained for the full data sets or each single data set:'); [varExp_total, varExp_X1, varExp_X2] disp('variaiton explained for different PCs for the full data sets or each single data set:'); [varExp_PCs, varExp_PCs_X1, varExp_PCs_X2]
total variation explained for the full data sets or each single data set: ans = 33.3834 19.0310 0.4040 variaiton explained for different PCs for the full data sets or each single data set: ans = 21.8388 10.6723 0.2646 9.3282 4.7845 0.1132 2.2164 3.5742 0.0263
Score and loading plots
B1Hat = BHat(1:n1,:); % loading matrix for Binary data X1 B2Hat = BHat(n1+1:end,:); % loading matrix for quantitative data X2 figure subplot(1,3,1) plot(AHat(:,1), AHat(:,2), 'o'); xlabel('PC1'); ylabel('PC2'); title('score plot'); %xlim([-0.2,0.2]); ylim([-0.2,0.2]); subplot(1,3,2) plot(B1Hat(:,1), B1Hat(:,2), 'or'); xlabel('PC1'); title('loading plot of X1'); %xlim([0, 26]); ylim([0, 26]); subplot(1,3,3); plot(B2Hat(:,1), B2Hat(:,2), 'o'); xlabel('PC1'); title('loading plot of X2'); %xlim([0, 150]); ylim([0, 150]);
