Data fusion using representation matrices
This doc is going to show how to do data fusion using representation matrices.
Contents
- Add current folder to the path
- The Simulation of coupled binary and quantitative data sets
- Representation matrices of binary data set X1
- Representation matrices of quantitative data set X2
- Using INDSCAL or INDORT model to analysis the three way array
- Convergence behavior of the algorithm
- Variation explained ratios
- Score and loading plots
Add current folder to the path
clear all;
current_fold = pwd;
addpath(genpath(current_fold));
The Simulation of coupled binary and quantitative data sets
The simulation process can be refereed to the accompanied software of the paper "Generalized Simultaneous Component Analysis of Binary and Quantitative data" in https://arxiv.org/abs/1807.04982. We use the logit transform of the empirical marginal probabilities of the binary CNA data set as the simulated offset term to simulate imbalanced binary data . The number of samples, binary variables and quantitative variables are
,
,
respectively. The noise level
in simulating quantitative data set
is 1. The simulated low rank is 10. The SNRs in generating
and
are both 1.
% % import real binary CNA data set load('X1_CNA.mat'); mu1_fixed = logit(mean(X1_CNA,1)); % imbalanced binary data simulation %mu1_fixed = zeros(1,size(X1_CNA,2)); % balanced binary data simulation % data simulation SNRs = [1,1]; % SRNs in simulating X1 and X2 K = 3; % simulated low rank link = 'logit'; % use logit link %link = 'probit';% use logit link seed = 1234; % set seed to reproduce the example [X1,X2,Theta_simu,mu_simu,Z_simu,E_simu] = GSCA_data_simulation(mu1_fixed,SNRs,K,link,seed); % plot of the simulated offset term figure plot(mu_simu); title('simulated offset'); xlabel('variables');

Representation matrices of binary data set X1
Represent each binary variable in X1 by a representation matrix, then combine all representation matrices together into a
three way array. Here we use the matricized form
of the three way array.
% An example of representaion matrix of binary data xj = X1(:,5); % index out the 5th binary variable Sj = representationBinVar(xj); % represent matrix of xj % Representation matrices of binary data X1 [m, n1] = size(X1); SBin = zeros(m, m*n1); for j = 1:n1 xj = X1(:,j); Sj = representationBinVar(xj); SBin(:,(j-1)*m+1:j*m) = Sj; end
Representation matrices of quantitative data set X2
% % An example of representation matrix of quantitative data xj = X2(:,5); Sj = representationQuantiVar( xj ); % Representation matrice of quantitative data X2 [~,n2] = size(X2); SQuanti = zeros(m,m*n2); for j = 1:n2 xj = X2(:,j); Sj = representationQuantiVar(xj); SQuanti(:,(j-1)*m+1:j*m) = Sj; end
Using INDSCAL or INDORT model to analysis the three way array
S = [SBin, SQuanti]; % similarity matrices %size(S) % check the dimensions R = 3; % number of PCs start = 1; % using 0 as initilization conv = 1e-8; % convergence criteria method = 1; % using alternating minimization algorithm %[AHat, BHat, out] = INDS(S, R, conv, start, method); % INDSCAL model [AHat, BHat, out] = INDORT(S, R, conv, start); % INDORT model
r 3 conv 1.0000e-08 start 1 total ssq is 35623400 indort value at iter 1 is 33494622.482448958000 ; indort value at iter 2 is 31970714.671496097000 ; indort value at iter 3 is 31798195.084478199000 ; indort value at iter 4 is 31757477.398271054000 ; indort value at iter 5 is 31726247.219856549000 ; indort value at iter 6 is 31689102.647111282000 ; indort value at iter 7 is 31644724.743494246000 ; indort value at iter 8 is 31595499.671277210000 ; indort value at iter 9 is 31545781.406881247000 ; indort value at iter 10 is 31500298.525592636000 ; indort value at iter 11 is 31462437.008664176000 ; indort value at iter 12 is 31433398.817583825000 ; indort value at iter 13 is 31412537.935417455000 ; indort value at iter 14 is 31398264.026650302000 ; indort value at iter 15 is 31388826.991981395000 ; indort value at iter 16 is 31382731.375309817000 ; indort value at iter 17 is 31378854.034704395000 ; indort value at iter 18 is 31376412.091776252000 ; indort value at iter 19 is 31374883.905362945000 ; indort value at iter 20 is 31373931.410590142000 ; indort value at iter 21 is 31373339.252923090000 ; indort value at iter 22 is 31372971.709731728000 ; indort value at iter 23 is 31372743.815576706000 ; indort value at iter 24 is 31372602.602756131000 ; indort value at iter 25 is 31372515.137895375000 ; indort value at iter 26 is 31372460.978183053000 ; indort value at iter 27 is 31372427.447423723000 ; indort value at iter 28 is 31372406.690600850000 ; indort value at iter 29 is 31372393.842303269000 ; indort value at iter 30 is 31372385.889717877000 ; indort value at iter 31 is 31372380.967552528000 ; indort value at iter 32 is 31372377.921104252000 ; indort value at iter 33 is 31372376.035614111000 ; indort value at iter 34 is 31372374.868671089000 ; indort value at iter 35 is 31372374.146447916000 ; indort value at iter 36 is 31372373.699465435000 ; indort value at iter 37 is 31372373.422830053000 ; Final indort value at iter 37 is 31372373.422830053000 Elapsed time ans = 8.1870
Convergence behavior of the algorithm
figure plot(out.obj(2:end)); xlabel('iterations'); ylabel('object loss'); title('convergence behavior');

Variation explained ratios
Variation explained of the model and PCs
[varExp_total, varExp_PCs] = representation_varExp(S,R,AHat,BHat); varExp_total = varExp_total*100; varExp_PCs = varExp_PCs*100; % Variation explained in binary X1 by the model [varExp_X1, varExp_PCs_X1] = representation_varExp(SBin,R,AHat,BHat(1:n1,:)); varExp_X1 = varExp_X1*100; varExp_PCs_X1 = varExp_PCs_X1*100; % Variation explained in quantitative X2 by the model [varExp_X2, varExp_PCs_X2] = representation_varExp(SQuanti,R,AHat,BHat(n1+1:end,:)); varExp_X2 = varExp_X2*100; varExp_PCs_X2 = varExp_PCs_X2*100; disp('total variation explained for the full data sets or each single data set:'); [varExp_total, varExp_X1, varExp_X2] disp('variaiton explained for different PCs for the full data sets or each single data set:'); [varExp_PCs', varExp_PCs_X1', varExp_PCs_X2']
total variation explained for the full data sets or each single data set: ans = 11.9332 3.0104 15.5836 variaiton explained for different PCs for the full data sets or each single data set: ans = 0.1676 0.0476 0.2166 9.4014 2.4013 12.2651 2.3643 0.5615 3.1018
Score and loading plots
B1Hat = BHat(1:n1,:); % loading matrix for Binary data X1 B2Hat = BHat(n1+1:end,:); % loading matrix for quantitative data X2 figure subplot(1,3,1) plot(AHat(:,1), AHat(:,2), 'o'); xlabel('PC1'); ylabel('PC2'); title('score plot'); %xlim([-0.2,0.2]); ylim([-0.2,0.2]); subplot(1,3,2) plot(B1Hat(:,1), B1Hat(:,2), 'or'); xlabel('PC1'); title('loading plot of X1'); %xlim([0, 26]); ylim([0, 26]); subplot(1,3,3); plot(B2Hat(:,1), B2Hat(:,2), 'o'); xlabel('PC1'); title('loading plot of X2'); %xlim([0, 150]); ylim([0, 150]);
