Data fusion using representation matrices

This doc is going to show how to do data fusion using representation matrices.

Contents

Add current folder to the path

clear all;
current_fold = pwd;
addpath(genpath(current_fold));

The Simulation of coupled binary and quantitative data sets

The simulation process can be refereed to the accompanied software of the paper "Generalized Simultaneous Component Analysis of Binary and Quantitative data" in https://arxiv.org/abs/1807.04982. We use the logit transform of the empirical marginal probabilities of the binary CNA data set as the simulated offset term to simulate imbalanced binary data $\mathbf{X}_1$. The number of samples, binary variables and quantitative variables are $m=160$, $n_1=410$, $n_2=1000$ respectively. The noise level $\sigma^2$ in simulating quantitative data set $\mathbf{X}_2$ is 1. The simulated low rank is 10. The SNRs in generating $\mathbf{X}_1$ and $\mathbf{X}_2$ are both 1.

%
% import real binary CNA data set
load('X1_CNA.mat');
mu1_fixed = logit(mean(X1_CNA,1)); % imbalanced binary data simulation
%mu1_fixed = zeros(1,size(X1_CNA,2)); % balanced binary data simulation

% data simulation
SNRs = [1,1];   % SRNs in simulating X1 and X2
K    = 3;       % simulated low rank
link = 'logit'; % use logit link
%link = 'probit';% use logit link
seed = 1234;    % set seed to reproduce the example
[X1,X2,Theta_simu,mu_simu,Z_simu,E_simu] = GSCA_data_simulation(mu1_fixed,SNRs,K,link,seed);

% plot of the simulated offset term
figure
plot(mu_simu); title('simulated offset'); xlabel('variables');

Representation matrices of binary data set X1

Represent each binary variable in X1 by a representation matrix, then combine all $n_1$ representation matrices together into a $m \times m \times n_1$ three way array. Here we use the matricized form $m \times m*n_1$ of the three way array.

% An example of representaion matrix of binary data
xj = X1(:,5); % index out the 5th binary variable
Sj = representationBinVar(xj); % represent matrix of xj

% Representation matrices of binary data X1
[m, n1] = size(X1);
SBin = zeros(m, m*n1);
for j = 1:n1
    xj = X1(:,j);
    Sj = representationBinVar(xj);
    SBin(:,(j-1)*m+1:j*m) = Sj;
end

Representation matrices of quantitative data set X2

%
% An example of representation matrix of quantitative data
xj = X2(:,5);
Sj = representationQuantiVar( xj );

% Representation matrice of quantitative data X2
[~,n2] = size(X2);
SQuanti = zeros(m,m*n2);
for j = 1:n2
    xj = X2(:,j);
    Sj = representationQuantiVar(xj);
    SQuanti(:,(j-1)*m+1:j*m) = Sj;
end

Using INDSCAL or INDORT model to analysis the three way array

S = [SBin, SQuanti]; % similarity matrices
%size(S)             % check the dimensions
R      = 3;     % number of PCs
start  = 1;     % using 0 as initilization
conv   = 1e-8;  % convergence criteria
method = 1;     % using alternating minimization algorithm

%[AHat, BHat, out] = INDS(S, R, conv, start, method); % INDSCAL model
[AHat, BHat, out] = INDORT(S, R, conv, start);       % INDORT model
r
     3

conv
   1.0000e-08

start
     1

 total ssq is 
    35623400

 indort value at iter 1 is 33494622.482448958000 ;  
 indort value at iter 2 is 31970714.671496097000 ;  
 indort value at iter 3 is 31798195.084478199000 ;  
 indort value at iter 4 is 31757477.398271054000 ;  
 indort value at iter 5 is 31726247.219856549000 ;  
 indort value at iter 6 is 31689102.647111282000 ;  
 indort value at iter 7 is 31644724.743494246000 ;  
 indort value at iter 8 is 31595499.671277210000 ;  
 indort value at iter 9 is 31545781.406881247000 ;  
 indort value at iter 10 is 31500298.525592636000 ;  
 indort value at iter 11 is 31462437.008664176000 ;  
 indort value at iter 12 is 31433398.817583825000 ;  
 indort value at iter 13 is 31412537.935417455000 ;  
 indort value at iter 14 is 31398264.026650302000 ;  
 indort value at iter 15 is 31388826.991981395000 ;  
 indort value at iter 16 is 31382731.375309817000 ;  
 indort value at iter 17 is 31378854.034704395000 ;  
 indort value at iter 18 is 31376412.091776252000 ;  
 indort value at iter 19 is 31374883.905362945000 ;  
 indort value at iter 20 is 31373931.410590142000 ;  
 indort value at iter 21 is 31373339.252923090000 ;  
 indort value at iter 22 is 31372971.709731728000 ;  
 indort value at iter 23 is 31372743.815576706000 ;  
 indort value at iter 24 is 31372602.602756131000 ;  
 indort value at iter 25 is 31372515.137895375000 ;  
 indort value at iter 26 is 31372460.978183053000 ;  
 indort value at iter 27 is 31372427.447423723000 ;  
 indort value at iter 28 is 31372406.690600850000 ;  
 indort value at iter 29 is 31372393.842303269000 ;  
 indort value at iter 30 is 31372385.889717877000 ;  
 indort value at iter 31 is 31372380.967552528000 ;  
 indort value at iter 32 is 31372377.921104252000 ;  
 indort value at iter 33 is 31372376.035614111000 ;  
 indort value at iter 34 is 31372374.868671089000 ;  
 indort value at iter 35 is 31372374.146447916000 ;  
 indort value at iter 36 is 31372373.699465435000 ;  
 indort value at iter 37 is 31372373.422830053000 ;  
 Final indort value at iter 37 is 31372373.422830053000 
 Elapsed time 

ans =

    8.1870

Convergence behavior of the algorithm

figure
plot(out.obj(2:end));
xlabel('iterations');
ylabel('object loss');
title('convergence behavior');

Variation explained ratios

Variation explained of the model and PCs

[varExp_total, varExp_PCs] = representation_varExp(S,R,AHat,BHat);
varExp_total = varExp_total*100;
varExp_PCs  = varExp_PCs*100;

% Variation explained in binary X1 by the model
[varExp_X1, varExp_PCs_X1] = representation_varExp(SBin,R,AHat,BHat(1:n1,:));
varExp_X1 = varExp_X1*100;
varExp_PCs_X1 = varExp_PCs_X1*100;

% Variation explained in quantitative X2 by the model
[varExp_X2, varExp_PCs_X2] = representation_varExp(SQuanti,R,AHat,BHat(n1+1:end,:));
varExp_X2 = varExp_X2*100;
varExp_PCs_X2 = varExp_PCs_X2*100;

disp('total variation explained for the full data sets or each single data set:');
[varExp_total, varExp_X1, varExp_X2]

disp('variaiton explained for different PCs for the full data sets or each single data set:');
[varExp_PCs', varExp_PCs_X1', varExp_PCs_X2']
total variation explained for the full data sets or each single data set:

ans =

   11.9332    3.0104   15.5836

variaiton explained for different PCs for the full data sets or each single data set:

ans =

    0.1676    0.0476    0.2166
    9.4014    2.4013   12.2651
    2.3643    0.5615    3.1018

Score and loading plots

B1Hat = BHat(1:n1,:);     % loading matrix for Binary data X1
B2Hat = BHat(n1+1:end,:); % loading matrix for quantitative data X2

figure
subplot(1,3,1)
plot(AHat(:,1), AHat(:,2), 'o');
xlabel('PC1'); ylabel('PC2'); title('score plot');
%xlim([-0.2,0.2]); ylim([-0.2,0.2]);
subplot(1,3,2)
plot(B1Hat(:,1), B1Hat(:,2), 'or');
xlabel('PC1'); title('loading plot of X1');
%xlim([0, 26]); ylim([0, 26]);
subplot(1,3,3);
plot(B2Hat(:,1), B2Hat(:,2), 'o');
xlabel('PC1'); title('loading plot of X2');
%xlim([0, 150]); ylim([0, 150]);