de Rigo, D. (2012). Detecting general multi-dimensional nonlinear correlations: the module "dist_corr" of the Mastrave modelling library. In: Semantic Array Programming with Mastrave - Introduction to Semantic Computational Modelling. http://mastrave.org/doc/mtv_m/dist_corr

## Detecting general multi-dimensional nonlinear correlations: the module "dist_corr" of the Mastrave modelling library

Daniele de Rigo

Abstract: Linear correlation analysis of complex nonlinear physical or computationally derived quantities - despite straightforward to be implemented with the help of basic numerical tools - may be far sub-optimal in assessing the actual strength of existing relationships between quantities. Moreover, in many applications not only the correlation between pairs of quantities is of interest, but also the more general correlation between a certain group of quantities and another one. Multi-dimensional nonlinear correlation analysis may offer elegant and concise ways of exploring unknown complex relationships either among set of mono-dimensional quantities or of logically connected groups of quantities. Brownian Distance Correlation (BDC) has been proven to be a powerful generalization of the classical linear correlation to detect “all kinds of possible relationships” between real-valued random variables. Within the Mastrave modelling library, a semantic array programming implementation of BDC – extended to also consider multi-dimensional analysis between groups of quantities – is here described. The implementation also offers the possibility of exploring user-defined metrics.

The file dist_corr.m is part of Mastrave.

Mastrave is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Mastrave is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Mastrave. If not, see http://www.gnu.org/licenses/.

#### Function declaration

 [d_cor, d_cov, dists, centered_dists] =
dist_corr( values                                       ,
dims      = []                               ,
dist_func = @(x,y) sum(abs(x-y).^2,2).^(1/2) )



#### Description

Correlation analysis of complex nonlinear physical or computationally derived quantities may be misleading when only linear components are considered. Linear correlation analysis - despite straightforward to be implemented with the help of basic numerical tools - may be far sub-optimal in assessing the actual strength of existing relationships between quantities.

Moreover, in many applications not only the correlation between pairs of quantities is of interest, but also the more general correlation between a certain group of quantities and another one. Multi-dimensional nonlinear correlation analysis may offer elegant and concise ways of exploring unknown complex relationships either among datasets of quantities or of logically connected groups of quantities.

Brownian Distance Correlation, BDC (Szekely et al. 2007; Szekely and Rizzo 2009) has been proven to be an elegant and powerful generalization of the classical linear correlation:
"While the standardized product-moment covariance, Pearson
correlation, measures the degree of linear relationship between
two real-valued variables, [...] standardized Brownian covariance
measures the degree of all kinds of possible relationships between
two real-valued random variables" (Szekely and Rizzo 2009).

Since several multi-dimensional metric spaces might be suitable for assessing BDC based statistics (Lyons, 2011), the possibility of exploring custom metrics as a conceptual parameter of the BDC analysis may be useful.

A semantic array programming (de Rigo 2012 b; 2012 c) implementation of BDC, extended to also consider multi-dimensional analysis between groups of quantities, is here described (de Rigo, 2012 a). The implementation also offers the possibility of exploring user-defined metrics.

@dist_corr is a module for computing the Brownian Distance Correlation d_cor of a numeric matrix values . values is composed by N row-vectors representing N vectorial points in an n-dimensional space (so that the n adjacent columns of values refer to n different dimension coordinates). If invoked by only passing the dataset values , a square matrix of distance-correlations with n rows and columns will be returned as d_corr and a matrix of distance-covariance as d_cov , with the same size.

It is possible to pass as input a cell-array dims of subsets of the n dimensions to be considered as composing separate multi-dimensional quantities. For example, assuming that values has 7 dimensions, if dims were { 1:3 , 4 , 6 , [5 7] } then the 7 dimensions would be considered as composing 4 multi-dimensional quantities (a 3-dimensional quantity made by the first three dimensions, two other monodimensional quantities respectively made by the fourth and sixth dimensions, and a final 2-dimensional quantity made by the fifth and seventh dimensions). The resulting d_corr and d_cov matrices would have 4x4 size.

##### Extension to other metric spaces

Lyons (2011) extended the metric spaces for which the distance covariance of a pair of random variables { X, Y } - where X and Y have finite first moments - is 0 if and only if X and Y are independent. Valid metric spaces are those having strong negative type (a strengthening of the strict negative type condition). While Lp spaces for p in [1, 2] have negative type, R^n spaces with l-p metric are not of negative type when n is in [3, inf) and p is in (2, inf). If n == 2 and p == 1 then the metric space is not of strict negative type.

Euclidean spaces - i.e. p == 2, n in [1, inf) - have strong negative type. Therefore, the utility uses by default an Euclidean distance function dist_func to compute distances for multi-dimensional quantities as defined by dims .

Custom distance functions can be passed as input argument. To be computationally efficent, the array-programming version of the algorithm relies on two assumptions which reduce the generality of the externally provided distance functions. First, the distance function dist_func is expected to be symmetric so that considering the generic i-th sub-space

val_i = values( : , dims{i} )

the distance computed with

dist_func( val_i( j ) , val_i( k ) )

must be equivalent to

dist_func( val_i( k ) , val_i( j ) )

Second, the distance function is expected to be invariant in case additional dimensions were added and filled with zeros, so that adding z zero-filled dimensions to the i-th sub-space

val_iz = [ val_i , zeros( N, z ) ]

the distance computed with

dist_func( val_i( j ) , val_i( k ) )

must be equivalent to

dist_func( val_iz( j ) , val_iz( k ) )

These are usually reasonable assumptions which are always satisfied within Euclidean metric spaces.

##### References

[1] Szekely, G.J., Rizzo, M.L., Bakirov, N.K. (2007): Measuring and
testing dependence by correlation of distances. The Annals of
Statistics 2007, Vol. 35, No. 6, pp. 2769-2794.
DOI: 10.1214/009053607000000505.
Free access version:
http://personal.bgsu.edu/~mrizzo/energy/AOS0283-reprint.pdf

[2] de Rigo, D. (2012 a). Multidimensional Distance Correlation
Analysis with User-defined Metrics. Free Software and Semantic
Array Programming Research (submitted).

[3] de Rigo, D. (2012 b). Semantic Array Programming with Mastrave
- Introduction to Semantic Computational Modelling. The Mastrave
project. http://mastrave.org/doc/MTV-1.012-1 (to appear).

[4] de Rigo, D. (2012 c). Semantic Array Programming for Environmental
Modelling: Application of the Mastrave Library. 6th International
Congress on Environmental Modelling and Software, (iEMSs 2012)
Managing resources of a limited planet (accepted).

[5] Szekely, G.J., Rizzo, M. L. (2009): Brownian Distance Covariance.
The Annals of Applied Statistics 2009, Vol. 3, No. 4, pp. 1236-1265.
DOI: 10.1214/09-AOAS312.
Free access version:
http://personal.bgsu.edu/~mrizzo/energy/AOAS312.pdf

[6] Szekely, G.J., Rizzo, M. L. (2009b): Rejoinder: Brownian Distance
Covariance. The Annals of Applied Statistics 2009, Vol. 3, No. 4,
pp. 1303-1308. DOI: 10.1214/09-AOAS312REJ.
Free access version:
http://personal.bgsu.edu/~mrizzo/energy/AOAS312REJ.pdf

[7] Lyons, R. (2011): Distance Covariance in Metric Spaces.
Arxiv preprint. arXiv: 1106.5758.
Free access version:
http://arxiv.org/pdf/1106.5758

#### Input arguments


values            ::numeric,matrix::
Numeric matrix each row of it represents a
vectorial point in an n-dimensional space (so that
the n adjacent columns of  values  are expected to refer
to n different dimension coordinates).

dims              ::cellindex::
Cell-array of dimension sets (or just a single dimension
set) the i-th of them lists which of the n dimensions of
values  (i.e. which of its n columns) need to be
considered as part of the same multi-dimensional quantity.
if  dims  is an empty matrix [], all dimensions will be
separately included, so that [] is considered equivalent
{ 1, 2, ... , n }.
If omitted, the default value is an empty matrix: [].

dist_func         ::function_handle::
Distance function to compute the distance between two
instances of a given coordinate (i.e. a given column)
referring to two points (i.e. two rows) of  values .
If omitted, the default value is the Euclidean
distance:
@(x,y) sum(abs(x-y).^2,2).^(1/2)
The default distance is suitable to work with
real or complex numbers, with x and y suitable
to be matrices having the same size.
A custom-provided distance function must be able
to work with matrices too.  If the size of both
inputs x and y is [N,n], the expected output of
dist_func  must be a column vector of size [N,1].



#### Example of usage


% Basic usage
%
% Validating the array-programming oriented implementation
% w.r.t the reference data used in Szekely and Rizzo (2009).
%
% Reference data:
% Eckerle, K., NIST (1979). Circular Interference Transmittance
% Study. Free access at
%    http://www.itl.nist.gov/div898/strd/nls/data/eckerle4.shtml
[ d, url, ref, m, p ] = get_reference_data( 'Eckerle4' );
[ y, x ]              = mdeal( d );
y_est                 = m{1,1}( p{1}, x );
r                     = y_est - y;

subplot( 1, 2, 1); plot( x, y, 'o' )
xlabel( 'wavelength' ); ylabel( 'transmittance' );
subplot( 1, 2, 2); plot( x, r, 'o-' )
xlabel( 'wavelength' ); ylabel( 'residuals'     );

dcor_xy               = dist_corr( [ x, y ] )
% Expected result: 0.4275431 (Szekely and Rizzo 2009, p. 1254)
assert( 0.4275431 == round( dcor_xy(2) * 10^7 )/10^7 )

dcor_ry               = dist_corr( [ r, y ] )
% Expected result: 0.4285534 (Szekely and Rizzo 2009, p. 1254)
assert( 0.4285534 == round( dcor_ry(2) * 10^7 )/10^7 )


See also:
train_pca

Keywords:
brownian distance correlation, correlation

Version: 0.4.8

#### Support

The Mastrave modelling library is committed to provide reusable and general - but also robust and scalable - modules for research modellers dealing with computational science.  You can help the Mastrave project by providing feedbacks on unexpected behaviours of this module.  Despite all efforts, all of us - either developers or users - (should) know that errors are unavoidable.  However, the free software paradigm successfully highlights that scientific knowledge freedom also implies an impressive opportunity for collectively evolve the tools and ideas upon which our daily work is based.  Reporting a problem that you found using Mastrave may help the developer team to find a possible bug.  Please, be aware that Mastrave is entirely based on voluntary efforts: in order for your help to be as effective as possible, please read carefully the section on reporting problems.  Thank you for your collaboration.

Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 Daniele de Rigo