Data fusion using optimal scaling

This doc is going to show how to do data fusion using optimal scaling.

Contents

Add current folder to the path

clear all;
current_fold = pwd;
addpath(genpath(current_fold));

The Simulation of coupled binary and quantitative data sets

The simulation process can be refereed to the accompanied software of the paper "Generalized Simultaneous Component Analysis of Binary and Quantitative data" in https://arxiv.org/abs/1807.04982. We use the logit transform of the empirical marginal probabilities of the binary CNA data set as the simulated offset term to simulate imbalanced binary data $\mathbf{X}_1$. The number of samples, binary variables and quantitative variables are $m=160$, $n_1=410$, $n_2=1000$ respectively. The noise level $\sigma^2$ in simulating quantitative data set $\mathbf{X}_2$ is 1. The simulated low rank is 10. The SNRs in generating $\mathbf{X}_1$ and $\mathbf{X}_2$ are both 1.

%
% import real binary CNA data set
load('X1_CNA.mat');
mu1_fixed = logit(mean(X1_CNA,1)); % imbalanced binary data simulation
%mu1_fixed = zeros(1,size(X1_CNA,2)); % balanced binary data simulation

% data simulation
SNRs = [1,1];   % SRNs in simulating X1 and X2
K    = 3;       % simulated low rank
link = 'logit'; % use logit link
%link = 'probit';% use logit link
seed = 1234;    % set seed to reproduce the example
[X1,X2,Theta_simu,mu_simu,Z_simu,E_simu] = GSCA_data_simulation(mu1_fixed,SNRs,K,link,seed);

% plot of the simulated offset term
figure
plot(mu_simu); title('simulated offset'); xlabel('variables');

Optimal scaling of binary data set X1 and quantitative data set X2

Optimal scaling of binary data equivalent to do auto scaling on data set X1; optimal scaling of quantitative data set X2 is itself.

%
muX1  = mean(X1,1);
stdX1 = std(X1,1);
[m, n1] = size(X1);
Z1    = (X1 - ones(m,1)*muX1)./(ones(m,1)*stdX1);

% we use the auto scaled version of X2
muX2  = mean(X2,1);
stdX2 = std(X2,1);
[~, n2] = size(X2);
Z2    = (X2 - ones(m,1)*muX2)./(ones(m,1)*stdX2);

Data fusion of optimal scaled X1 and X2

The data fusion of optimal scaled X1 and X2 is simple doing PCA on the cmobined (full) data sets.

%
% SCA on Z1 Z2
R = 3;              % number of PCs
Z = [Z1, Z2];       % row cancatation for SCA
[U,S,V] = svd(Z,'econ');

AHat = U(:,1:R);           % score matrix
BHat = V(:,1:R)*S(1:R, 1:R);   % loading matrix

Variation explained ratios

The variation explained ratios are computed in the same way as a standard PCA model.

% comput the variation explained raito for the full data set
S = diag(S); % extract singular values
S_square = S.^2;
varExp_PCs = S_square./sum(S_square);
varExp_PCs = varExp_PCs(1:R) * 100;
varExp_total = sum(varExp_PCs(1:R));

% Variation explained in binary X1 by the model
[~,S1,~] = svd(Z1,'econ');
S1 = diag(S1); % extract singular values
S1_square = S1.^2;
varExp_PCs_X1 = S1_square./sum(S1_square);
varExp_PCs_X1 = varExp_PCs_X1(1:R) * 100;
varExp_X1 = sum(varExp_PCs_X1(1:R));

% Variation explained in quantitative X2 by the model
[~,S2,~] = svd(Z2,'econ');
S2 = diag(S2);            % extract singular values
S2_square = S2.^2;
varExp_PCs_X2 = S2_square./sum(S2_square);
varExp_PCs_X2 = varExp_PCs_X2(1:R);
varExp_X2 = sum(varExp_PCs_X2(1:R));

disp('total variation explained for the full data sets or each single data set:');
[varExp_total, varExp_X1, varExp_X2]

disp('variaiton explained for different PCs for the full data sets or each single data set:');
[varExp_PCs, varExp_PCs_X1, varExp_PCs_X2]
total variation explained for the full data sets or each single data set:

ans =

   33.3834   19.0310    0.4040

variaiton explained for different PCs for the full data sets or each single data set:

ans =

   21.8388   10.6723    0.2646
    9.3282    4.7845    0.1132
    2.2164    3.5742    0.0263

Score and loading plots

B1Hat = BHat(1:n1,:);     % loading matrix for Binary data X1
B2Hat = BHat(n1+1:end,:); % loading matrix for quantitative data X2

figure
subplot(1,3,1)
plot(AHat(:,1), AHat(:,2), 'o');
xlabel('PC1'); ylabel('PC2'); title('score plot');
%xlim([-0.2,0.2]); ylim([-0.2,0.2]);
subplot(1,3,2)
plot(B1Hat(:,1), B1Hat(:,2), 'or');
xlabel('PC1'); title('loading plot of X1');
%xlim([0, 26]); ylim([0, 26]);
subplot(1,3,3);
plot(B2Hat(:,1), B2Hat(:,2), 'o');
xlabel('PC1'); title('loading plot of X2');
%xlim([0, 150]); ylim([0, 150]);