de Rigo, D. (2012). * Detecting general multi-dimensional nonlinear correlations: the module "dist_corr" of the Mastrave modelling library.* In: Semantic Array Programming with Mastrave - Introduction to Semantic Computational Modelling. http://mastrave.org/doc/mtv_m/dist_corr

## Detecting general multi-dimensional nonlinear correlations: the module "dist_corr" of the Mastrave modelling library

**Daniele de Rigo**

#### Copyright and license notice of the function dist_corr

Copyright © 2009,2010,2011,2012 Daniele de Rigo

The file dist_corr.m is part of Mastrave.

Mastrave is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Mastrave is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Mastrave. If not, see http://www.gnu.org/licenses/.

#### Function declaration

[d_cor,d_cov,dists,centered_dists] = dist_corr(values,dims= [] ,dist_func= @(x,y) sum(abs(x-y).^2,2).^(1/2) )

#### Description

Correlation analysis of complex nonlinear physical or computationally derived quantities may be misleading when only linear components are considered. Linear correlation analysis - despite straightforward to be implemented with the help of basic numerical tools - may be far sub-optimal in assessing the actual strength of existing relationships between quantities.

Moreover, in many applications not only the correlation between pairs of quantities is of interest, but also the more general correlation between a certain group of quantities and another one. Multi-dimensional nonlinear correlation analysis may offer elegant and concise ways of exploring unknown complex relationships either among datasets of quantities or of logically connected groups of quantities.

Brownian Distance Correlation, BDC (Szekely et al. 2007; Szekely and
Rizzo 2009) has been proven to be an elegant and powerful
generalization of the classical linear correlation:
` "While the standardized product-moment covariance, Pearson `
` correlation, measures the degree of linear relationship between `
` two real-valued variables, [...] standardized Brownian covariance`
` measures the degree of all kinds of possible relationships between`
` two real-valued random variables"`
(Szekely and Rizzo 2009).

Since several multi-dimensional metric spaces might be suitable for assessing BDC based statistics (Lyons, 2011), the possibility of exploring custom metrics as a conceptual parameter of the BDC analysis may be useful.

A semantic array programming (de Rigo 2012 b; 2012 c) implementation of BDC, extended to also consider multi-dimensional analysis between groups of quantities, is here described (de Rigo, 2012 a). The implementation also offers the possibility of exploring user-defined metrics.

@dist_corr is a module for computing the Brownian Distance Correlation
` d_cor ` of a numeric matrix

`.`

**values**`is composed by N row-vectors representing N vectorial points in an n-dimensional space (so that the n adjacent columns of`

**values**`refer to n different dimension coordinates). If invoked by only passing the dataset`

**values**`, a square matrix of distance-correlations with n rows and columns will be returned as`

**values**`and a matrix of distance-covariance as`

**d_corr**`, with the same size.`

**d_cov**It is possible to pass as input a cell-array ` dims ` of subsets of the
n dimensions to be considered as composing separate multi-dimensional
quantities.
For example, assuming that

`has 7 dimensions, if`

**values**`were { 1:3 , 4 , 6 , [5 7] } then the 7 dimensions would be considered as composing 4 multi-dimensional quantities (a 3-dimensional quantity made by the first three dimensions, two other monodimensional quantities respectively made by the fourth and sixth dimensions, and a final 2-dimensional quantity made by the fifth and seventh dimensions). The resulting`

**dims**`and`

**d_corr**`matrices would have 4x4 size.`

**d_cov**

##### Extension to other metric spaces

Lyons (2011) extended the metric spaces for which the distance covariance of a pair of random variables { X, Y } - where X and Y have finite first moments - is 0 if and only if X and Y are independent. Valid metric spaces are those having strong negative type (a strengthening of the strict negative type condition). While Lp spaces for p in [1, 2] have negative type, R^n spaces with l-p metric are not of negative type when n is in [3, inf) and p is in (2, inf). If n == 2 and p == 1 then the metric space is not of strict negative type.

Euclidean spaces - i.e. p == 2, n in [1, inf) - have strong negative type.
Therefore, the utility uses by default an Euclidean distance function
` dist_func ` to compute distances for multi-dimensional quantities as
defined by

`.`

**dims**Custom distance functions can be passed as input argument.
To be computationally efficent, the array-programming version of the
algorithm relies on two assumptions which reduce the generality of
the externally provided distance functions.
First, the distance function ` dist_func ` is expected to be symmetric
so that considering the generic i-th sub-space

` val_i = values( : , dims{i} )`

the distance computed with

` dist_func( val_i( j ) , val_i( k ) ) `

must be equivalent to

` dist_func( val_i( k ) , val_i( j ) ) `

Second, the distance function is expected to be invariant in case additional dimensions were added and filled with zeros, so that adding z zero-filled dimensions to the i-th sub-space

` val_iz = [ val_i , zeros( N, z ) ]`

the distance computed with

` dist_func( val_i( j ) , val_i( k ) ) `

must be equivalent to

` dist_func( val_iz( j ) , val_iz( k ) ) `

These are usually reasonable assumptions which are always satisfied within Euclidean metric spaces.

##### References

[1] Szekely, G.J., Rizzo, M.L., Bakirov, N.K. (2007): Measuring and
` testing dependence by correlation of distances. The Annals of`
` Statistics 2007, Vol. 35, No. 6, pp. 2769-2794.`
` DOI: 10.1214/009053607000000505.`
` Free access version: `
` http://personal.bgsu.edu/~mrizzo/energy/AOS0283-reprint.pdf`

[2] de Rigo, D. (2012 a). Multidimensional Distance Correlation
` Analysis with User-defined Metrics. Free Software and Semantic `
` Array Programming Research (submitted).`

[3] de Rigo, D. (2012 b). Semantic Array Programming with Mastrave
` - Introduction to Semantic Computational Modelling. The Mastrave`
` project. http://mastrave.org/doc/MTV-1.012-1 (to appear).`

[4] de Rigo, D. (2012 c). Semantic Array Programming for Environmental
` Modelling: Application of the Mastrave Library. 6th International`
` Congress on Environmental Modelling and Software, (iEMSs 2012) `
` Managing resources of a limited planet (accepted).`

[5] Szekely, G.J., Rizzo, M. L. (2009): Brownian Distance Covariance.
` The Annals of Applied Statistics 2009, Vol. 3, No. 4, pp. 1236-1265.`
` DOI: 10.1214/09-AOAS312.`
` Free access version:`
` http://personal.bgsu.edu/~mrizzo/energy/AOAS312.pdf`

[6] Szekely, G.J., Rizzo, M. L. (2009b): Rejoinder: Brownian Distance
` Covariance. The Annals of Applied Statistics 2009, Vol. 3, No. 4, `
` pp. 1303-1308. DOI: 10.1214/09-AOAS312REJ.`
` Free access version:`
` http://personal.bgsu.edu/~mrizzo/energy/AOAS312REJ.pdf`

[7] Lyons, R. (2011): Distance Covariance in Metric Spaces.
` Arxiv preprint. arXiv: 1106.5758.`
` Free access version:`
` http://arxiv.org/pdf/1106.5758`

#### Input arguments

valuesNumeric matrix each row of it represents a vectorial point in an n-dimensional space (so that the n adjacent columns of::numeric,matrix::are expected to refer to n different dimension coordinates).valuesdimsCell-array of dimension sets (or just a single dimension set) the i-th of them lists which of the n dimensions of::cellindex::(i.e. which of its n columns) need to be considered as part of the same multi-dimensional quantity. ifvaluesis an empty matrix [], all dimensions will be separately included, so that [] is considered equivalent { 1, 2, ... , n }. If omitted, the default value is an empty matrix: [].dimsdist_funcDistance function to compute the distance between two instances of a given coordinate (i.e. a given column) referring to two points (i.e. two rows) of::function_handle::. If omitted, the default value is the Euclidean distance: @(x,y) sum(abs(x-y).^2,2).^(1/2) The default distance is suitable to work with real or complex numbers, with x and y suitable to be matrices having the same size. A custom-provided distance function must be able to work with matrices too. If the size of both inputs x and y is [N,n], the expected output ofvaluesmust be a column vector of size [N,1].dist_func

#### Example of usage

% Basic usage % % Validating the array-programming oriented implementation % w.r.t the reference data used in Szekely and Rizzo (2009). % % Reference data: % Eckerle, K., NIST (1979). Circular Interference Transmittance % Study. Free access at % http://www.itl.nist.gov/div898/strd/nls/data/eckerle4.shtml [ d, url, ref, m, p ] = get_reference_data( 'Eckerle4' ); [ y, x ] = mdeal( d ); y_est = m{1,1}( p{1}, x ); r = y_est - y; subplot( 1, 2, 1); plot( x, y, 'o' ) xlabel( 'wavelength' ); ylabel( 'transmittance' ); subplot( 1, 2, 2); plot( x, r, 'o-' ) xlabel( 'wavelength' ); ylabel( 'residuals' ); dcor_xy = dist_corr( [ x, y ] ) % Expected result: 0.4275431 (Szekely and Rizzo 2009, p. 1254) assert( 0.4275431 == round( dcor_xy(2) * 10^7 )/10^7 ) dcor_ry = dist_corr( [ r, y ] ) % Expected result: 0.4285534 (Szekely and Rizzo 2009, p. 1254) assert( 0.4285534 == round( dcor_ry(2) * 10^7 )/10^7 )

See also: train_pca Keywords: brownian distance correlation, correlation Version: 0.4.8

#### Support

The Mastrave modelling library is committed to provide reusable and general - but also robust and scalable - modules for research modellers dealing with computational science. You can help the Mastrave project by providing feedbacks on unexpected behaviours of this module. Despite all efforts, all of us - either developers or users - (should) know that errors are unavoidable. However, the free software paradigm successfully highlights that scientific knowledge freedom also implies an impressive opportunity for collectively evolve the tools and ideas upon which our daily work is based. Reporting a problem that you found using Mastrave may help the developer team to find a possible bug. Please, be aware that Mastrave is entirely based on voluntary efforts: in order for your help to be as effective as possible, please read carefully the section on reporting problems. Thank you for your collaboration.