ICA and ISA Using Schweizer-Wolff Measure of Dep endence Sergey Kirshner sergey@cs.ualberta.ca Barnab´s P´czos a o poczos@cs.ualberta.ca AICML, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8 Abstract We propose a new algorithm for independent component and independent subspace analysis problems. This algorithm uses a contrast based on the Schweizer-Wolff measure of pairwise dependence (Schweizer & Wolff, 1981), a non-parametric measure computed on pairwise ranks of the variables. Our algorithm frequently outperforms state of the art ICA methods in the normal setting, is significantly more robust to outliers in the mixed signals, and performs well even in the presence of noise. Our method can also be used to solve independent subspace analysis (ISA) problems by grouping signals recovered by ICA methods. We provide an extensive empirical evaluation using simulated, sound, and image data. been the sub ject of extensive research (e.g., Cardoso, 1998; Theis, 2005; Bach & Jordan, 2003; Hyv¨rinen a & K¨ster, 2006; P´czos & Lrincz, 2005) and applied, o o o for instance, to EEG-fMRI data. Our contribution, SWICA, is a new ICA algorithm based on Schweizer-Wolff (SW) non-parametric dependence measure. SWICA has the following properties: · SWICA performs comparably to other state of the art ICA methods, outperforming them in a large number of test cases. · SWICA is extremely robust to outliers as it uses rank values of the signals rather than their actual values. · SWICA suffers less from the presence of noise than other algorithms. · SW measure can be used as the cost function to solve ISA problems by grouping sources recovered by ICA methods. · SWICA is simple to implement, and the Matlab/C++ code is available for public use. · On a negative side, SWICA is slower than other methods, limiting its use to sources of moderate dimensions, and it requires more samples to demix sources with near-Gaussian distributions. The paper is organized as follows. An overview of the ICA and ISA problems and methods is presented in Section 2. Section 3 motivates and describes Schweizer-Wolf dependence measure. Section 4 describes a 2-source version of SWICA, extends it to a d-source problem, describes an application to ISA, and mentions possible approaches for accelerating SWICA. Section 5 provides a thorough empirical evaluation of SWICA to other ICA algorithms under different settings and data types. The paper is concluded with a summary in Section 6. 1. Intro duction Independent component analysis (ICA) (Comon, 1994) deals with a problem of a blind source separation under the assumptions that the sources are independent and that they are linearly mixed. ICA has been used in the context of blind source separation and deconvolution, feature extraction, denoising, and successfully applied to many domains including finances, neurobiology, and processing of fMRI, EEG, and MEG data. For a review on ICA, see Hyv¨rinen a et al. (2001). Independent subspace analysis (ISA) (also called multi-dimensional ICA and group ICA) is a generalization of ICA that assumes that certain sources depend on each other, but the dependent groups of sources are still independent of each other, i.e., the independent groups are multidimensional. The ISA task has Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). ICA and ISA Using Schweizer-Wolff Measure of Dep endence 2. ICA and ISA We consider the following problem. Assume we have d independent 1-dimensional sources (random variables) denoted by S 1 , . . . , S d . We asss me each s. urce u o emits N i.i.d. samples denoted by i , . . . , si Let 1 N s of it that are sub ject of active research. One such formulation is a noisy version of ICA X = AS + (2) Rd×N be a matrix of these samples. We S= j i assume that these sources are hidden, and that only a matrix X of mixed samples can be observed: X = AS where multivariate noise is often assumed normally distributed. Another related problem occurs when the mixed samples X are corrupted by a presence of outliers. There are many other possibilities that go beyond the scope of this paper. Of a special note is a generalization of ICA where some of the sources are dependent, independent subspace analysis (ISA). For this case, the mutual information and Shannon entropies from Equation 1 would involve multivariate random vectors instead of scalars. Resulting multidimensional entropies are exponentially more difficult to estimate than their scalar counterparts, making ISA problem more difficult than ICA. However, Cardoso (1998) conjectured that the ISA problem can be solved by first preprocessing the mixtures X by an ICA algorithm and then grouping the estimated components with highest dependence. While the extent of this conjecture is still on open issue, it has been rigorously proven for some distribution types (Szab´ et al., 2007). Even without a proof for the o general case, a number of algorithms apply this heuristics with success (Cardoso, 1998; Theis, 2007; Bach & Jordan, 2003). There are ISA methods not relying on Cardoso's conjecture (e.g., Hyv¨rinen & K¨ster, 2006) a o although they are susceptible to getting trapped in local minima. where A Rd×d . (We further assume that A has full rank d.) The task is to recover the sample matrix S of the hidden sources by finding a demixing matrix W Y = WX = (WA) S, and the estimated sources Y 1 , . . . , Y d are mutually independent. The solution can be recovered only up to a scale and a permutation of the components; thus we assume that the data has been pre-whitened, and it is sufficient to search for an orthogonal matrix W (e.g., Hyv¨rinen et al., 2001). Additionally, since a jointly Gaussian sources are not identifiable under linear transformations, we assume that no more than one source is normally distributed. There are many approaches to solving the ICA problem, differing both in the ob jective function designed to measure the independence between the unmixed sources (sometimes referred to as a contrast function) and the optimization methods for that function. Most commonly used ob jective function is the mutual information (MI) J (W) = I Y1 ,...,Y d 3. Non-parametric Rank-Based Approach Most of the ICA algorithms use an approximation to mutual information (MI) as their ob jective functions, and the quality of the solution thus depends on how accurate is the corresponding approximation. The problem with using MI is that without a parametric assumption on the functional form of the joint distribution, MI cannot be evaluated exactly, and numerical estimation can be both inaccurate and computationally expensive. In this section, we explore other measures of pairwise association as possible ICA contrasts. To note, most commonly used measure of correlation, Pearson's linear correlation coefficient, cannot be used as it is invariant under rotations (once the data has been centered and whitened) Instead, we are focusing on measures of dependence of the ranks. Ranks have a number of desirable properties ­ they are invariant under monotonic transformations of the individual variables, insensitive to outliers, = id =1 h Yi - h Y1 ( ,...,Y d 1) where h is the differential ed tropy. Alternatively, one n Yi o can minimize the sum f the univarii=1 h ate entropies as the joint entropy is constant (e.g., Hyv¨rinen et al., 2001). Neither of these quantities a can be evaluated directly, so approximations are used instead. Among effective methods falling in the former category is KernelICA (Bach & Jordan, 2002); RADICAL (Learned-Miller & Fisher, 2003) and FastICA (Hyv¨rinen, 1999) approximate the sum of the univaria ate entropies. There are other possible cost functions including maximum likelihood, moment-based methods, and correlation-based methods. While ICA problems has been well-studied in the above formulation, there are a number of variations ICA and ISA Using Schweizer-Wolff Measure of Dep endence Spearman's . 0.1 0.05 0 0 /8 /4 3/8 /2 0 /8 /4 3/8 /2 0 /8 /4 3/8 /2 Figure 1. Absolute values of sample versions for Pearson's p (solid thin, brown), Kendall's (dashed, red), Spearman's (dash-dotted, blue), and Schweizer-W0 lff (solid o . thick, black) as a function of rotation angle , Data 2 was obtained by rotating by 1000 samples from a uni4 form distribution on I2 (left), with added outliers (center), and with added noise (right). Let I denote a unit interval [0, 1]. A bivariate copula C is probability function (cdf ) defined on a unit square, C : I2 I such that its univariate marginals are uniform, i.e., C (u, 1) = u, C (1, v ) = v , u, v , I.1 Let U = Px (X ) and V = Py (Y ) denote the corresponding cdfs for previously defined random variables - - X and Y . Variables X = Px 1 (U ) and Y = Py 1 (V ) can be defined in terms of the inverse of marginal cdfs. Then, for (u, v ) I2 , define C as P- . - C (u, v ) = P x 1 (u) , Py 1 (v ) It is easy to verify that C is a copula. Sklar's theorem (Sklar, 1959) states that such copula exists for any distribution P , and that it is unique on the range of values of the marginal distributions. A copula can be thought of as binding univariate marginals Px and Py to make a distribution P . Copulas can also be viewed as a canonical form of multivariate distributions as they preserve multivariate dependence properties of the corresponding families of distributions. For example, the mutual information of the joint distribution is equal to the negentropy of its copula restricted to the region on which the copula density function (denoted in this paper by c (u, v )) is defined: c (u, v ) I (X, Y ) = = 2 C (u, v ) p (x, y ) = ; u v px (x) py (y ) I c (u, v ) ln c (u, v ) dudv . 2 and not very sensitive to small amounts of noise. We found that a dependence measure defined on copulas (e.g., Nelsen, 2006), probability distributions on continuous ranks, has the right properties to be used as a contrast for ICA demixing. 3.1. Ranks and Copulas Let a pair of random variables (X, Y ) R2 be distributed according to a bivariate probability distribution P . Assume we are given N samples of (X, Y ), D = {(x1 , y1 ) , . . . , (xN , yN )}. Let the rank rx (x) be the number of xi , i = 1, . . . , N such that x > xi , and let ry (y ) be defined similarly. Many non-linear dependence measures are based on ranks. Among most commonly used are Kendall's and Spearman's rank correlation coefficients. Kendall's measures the difference between proportions of concordant pairs ((xi , yi ) and (xj , yj ) such that (xi - xj ) (yi - yj ) > 0) and discordant pairs. Spearman's measures a linear correlation between ranks of rx (x) and ry (y ). Both and have a range of [-1, 1] and are equal to 0 (in the limit) if the X and Y are independent. However, the converse is not true, and both and can be 0 even if X and Y are not independent. While they are robust to outliers, neither nor make for a good ICA contrast as they provide a noisy estimate for dependence from moderately-sized data sets when the dependence is weak (See Figure 1 for an illustration). Rank correlations can be extended from samples to distributions with the help of copulas, distributions over continuous multivariate ranks. We will devise an effective robust contrast for ICA using a measure of dependence for copulas which is closely related to Such negentropy is minimized when C (u, v ) = (u, v ) = uv . Copula is referred to as the product copula and is equivalent to variables U and V (and the original variables X and Y ) being mutually independent. This copula will play a central part in definition of contrasts in the next subsection. Copulas can also be viewed as a joint distribution over univariate ranks, and therefore, preserve all of the rank statistics of the corresponding multivariate distributions; rank based statistics can be expressed in terms of the copula alone. For example, Spearman's has a convenient functional form in terms of the corresponding copulas (e.g., Nelsen, 2006): I = 12 (C (u, v ) - (u, v )) dudv . (3) 2 While we restrict our attention to bivariate copulas, many of the definitions and properties described in this section can be extended to a d-variate case. 1 ICA and ISA Using Schweizer-Wolff Measure of Dep endence As the true distribution P and its copula C are not known, the rank statistics can be estimated from the available samples using an empirical copula (Deheuvels, 1979). For a data set {(x1 , y1 ) , . . . , (xN , yN )}, an empirical copula CN is given by CN # of (xk , yk ) s.t. xk xi and yk yj . N (4) Well-known sample versions of several non-linear dependence measures can be obtained using an empirical copula (e.g., Nelsen, 2006). For example, sample version r of Spearman's appears to be a grid integration evaluation of its expression in terms of a copula (Equation 3): j , NN C NN 12 i j N N 2 - 1 =1 =1 j NN , i - i j × N N . (5) i = 4. SWICA: A New Algorithm for ICA and ISA In this section, we present a new algorithm for ICA and ISA demixing. The algorithm uses SchweizerWolff estimates as a contrast in demixing pairs of variables; we named this algorithm Schweizer-Wolff contrast for ICA, or SWICA for short. 4.1. 2-dimensional Case First, we tackle the case of a two-dimensional signal S mixed with a 2 × 2 matrix A. We, further assume A is orthogonal (otherwise achievable by whitening). The problem is then redc ced to finding a . demixing u os () sin () rotation matrix W = - sin () cos () For the ob jective function, we use s (Equation 7) computed on 2 × N matrix Y = WX of rotated samples. Given an angle , s (Y ()) can be computed by first sorting each of the rows of Y () and computing row ranks for each entry of Y (), then computing an empirical copula CN (Equation 4) for ranks of Y, and finally computing s (Y ()) (Equation 7). The solution is then found by finding angle minimizing s (Y ()). Similar to RADICAL (Learned-Miller & Fisher, 2003), we find such solution .by searching over K values of 0 This algorithm is outlined in in the interval , 2 Figure 2. 4.2. d-dimensional Case A d-dimensional linear transformation described by a d×d orthogonal matrix W is equivalent to a composition of 2-dimensional rotations (called Jacobi or Givens rotations) (e.g., Comon, 1994). The transformation matrix itself can be written as a product of corresponding rotation matrices, W = WL × . . . × W1 where each matrix Wl , l = 1, . . . , L is a rotation matrix (by angle l ) for some pair of dimensions (i, j ). Thus a d-dimensional ICA problem can be solved by solving 2-dimensional ICA problems in succession. Given a current demixing matrix Wc = Wl × . . . × W1 and a current version of the signal Xc = X c X, we find an W . (i,j ) angle corresponding to SWICA c , K Taking an approach similar to RADICAL, we perform a fixed number of successive sweeps through all possible pairs of dimensions (i, j ). We should note that while d-dimensional SWICA is not guaranteed to converge, it converges in practice a vast ma jority of the time. A likely explanation is that each 2-dimensional optimization finds a transfor- r= 3.2. Schweizer-Wolff and Part of the problem with Kendall's and Spearman's as a contrast for ICA is a property that their value may be 0 even though the corresponding variables X and Y are not independent. Instead, we suggest using Schweizer-Wolff , a measure of dependence between two continuous random variables (Schweizer & Wolff, 1981): I = 12 2 |C (u, v ) - uv | dudv . (6) can be viewed as an L1 norm between a copula for the distribution and a product copula. It has a range of [0, 1], with an important property that = 0 if and only if the corresponding variables are mutually independent, i.e., C = . The latter property suggests an ICA algorithm for a pair of variables: pick a rotation angle such that the corresponding demixed data set has its minimized. A sample version of is similar to that of (Equation 5): NN 12 i j s= 2 N - 1 =1 =1 C N j , NN i - i j × N N . (7) We note that other measures of dependence can be potentially used as an ICA contrast. We also experimented with an L version of , = 4 supI2 |C (u, v ) - uv | , a dependence measure similar to Kolmorogov-Smirnov univariate statistic (Schweizer & Wolff, 1981), with results similar to . ICA and ISA Using Schweizer-Wolff Measure of Dep endence Algorithm SWICA(X, K ) Inputs: X, a 2 × N matrix where rows are mixed signals (centered and whitened), K equispaced evaluation angles in the [0, /2) interval For each of K angles in the interval [0, /2) k ( = 2K , k = 0, . . . , K - 1.) · Compute rotation matrix · c os () sin () W () = - sin () cos () Compute rotated signals Y () = W () X. · Compute s (Y ()), a sample estimate of (Equation 7) Find best angle m = arg min s (Y ()) Output: Rotation matrix W = W (m ), demixed signal Y = Y (m ), and estimated dependence measure s = s (Y (m )) Figure 2. Outline of SWICA algorithm (2-d case). ture research. We used several other tricks to speed up the computation. One, for large N (N > 2500) we 2 estimated s using only Ns (Ns = N ) terms in N the sum corresponding to equispaced gridpoints on I2 . Two, when searching for minimizing s (Y ()), it is unnecessary to sum over all N 2 terms when evaluating a candidate if a partial sum already results in a value of s (Y ()) larger than the current best. This optimization translates into a 2-fold speed increase in practice. Three, it is unnecessary to complete all S sweeps if the algorithm already converged. One possible measure of convergence is the Amari error (Equation 8) measured for the cumulative rotation matrix for the most recent sweep. 4.4. Using Schweizer-Wolff for ISA Following Cardoso's conjecture, ISA problems can be solved by first finding a solution to an ICA problem, and then by grouping resulting sources that are not independent (Cardoso, 1998). We propose employing Schweizer-Wolff to measure dependence of sources for an ICA solution as it provides a computationally effective alternative to mutual information, commonly used measure of source dependence. Note that ICA solution, the first step, can be obtained using any approach, e.g., FastICA due to its computational speed for large d. One commonly used trick for grouping the variables is to use a non-linear transformation of the variables to "amplify" their dependence as independent variables remain independent under such transformations.2 2500 mation that reduces the sum of entropies for the corresponding dimensions, reducing the overall sum of entropies. In addition to this, Learned-Miller and Fisher (2003) suggest that the minimization of the overall sum of entropies in this fashion (by changing only two terms in the sum) may make it easier to escape local minima. 4.3. Complexity Analysis and Acceleration Tricks 2-dimensional SWICA requires a search over K angles. For each angle, we first sort the data to compute the ranks of each data point (O (N log N )), and then use these ranks to compute s by computing the empirical copula and summing over the N × N grid N2 a (Equation 7), requiring O dditions. Therefore, K 2. running time complexity of 2-d SWICA is O N Each sweep of a d-dimensional ICA problem solves a 2-dimeno ional ICA problem for each pair of S ariables, s v cd O 2 f them; S sweeps would have O d2 K N 2 omplexity. In our experiments, we employed K = 180, S = 1 for d = 2, and K = 90, S = d for d > 2. The most expensive computation in SWICA is N2 n O eeded to compute s (Y ()). Reducing this complexity, either by approximation, or perhaps, by an efficient rearrangement of the sum, is left to fu- 5. Exp eriments For the experimental evaluation of SWICA, we considered several settings. For the evaluation of the quality of demixing solution matrix W, we computed the Amari error (Amari et al., 1996) for the resulting transformation matrix B = WA. Amari error r (B) measures how different matrix B is from a permutation matrix, and is defined as id =1 d j =1 |bij | + -1 jd =1 maxj |bij | |bij | -1 maxi |bij | i=1 d . (8) where = 1/(2d(d - 1)). r (B) [0, 1], and r (B) = 0 if and only if B is a permutation matrix. We compared SWICA to FastICA (Hyv¨rinen, 1999), KernelICAa KGV (Bach & Jordan, 2002), RADICAL (LearnedMiller & Fisher, 2003), and JADE (Cardoso, 1999). 2 Such transformations are at the core of the KernelICA and JADE ICA algorithms. ICA and ISA Using Schweizer-Wolff Measure of Dep endence For the simulated data experiments, we used 18 different one-dimensional densities to simulate sources. These test-bed densities (and some of the experiments below) were proposed by Bach and Jordan (2002) to test KernelICA and by Learned-Miller and Fisher (2003) to evaluate RADICAL; we omit the description of these densities due to lack of space as they can be looked up in the above papers. Table 1 summarizes the medians of the Amari errors for 2-dimensional problems where both sources had the same distribution. Samples from these sources were then transformed by a random rotation, and then demixed using competing ICA algorithms. SWICA outperforms its competitors in 8 out of 18 cases, and performs comparably in several other cases. However, it performs poorly when the joint distribution for the sources is close to a Gaussian (e.g., (d) t-distribution with 5 degrees of freedom). One possible explanation for why SWICA performs worse than its competitors for these cases is that by using ranks instead of the actual values, SWICA is discarding some of the information that may be essential to separating such sources. However, given larger number of samples, SWICA is able to separate near-Gaussian sources (data not shown due to space constraints). SWICA also outperformed other methods when sources were not restricted to come from the same distribution (Table 2) and proved effective for multi-dimensional problems (d = 4, 8, 16). Figure 3 summarizes the performance of ICA algorithms in the presence of outliers for the d-source case (d = 2, 4, 8). Distributions for the sources were chosen at random from the 18 distributions from the experiment in Table 1. The sources were mixed using a random rotation matrix. The mixed sources were then corrupted by adding +5 or -5 to a single component for a small number of samples. SWICA significantly outperforms the rest of the algorithms as the contrast used by SWICA is insensitive to minor changes in the sample ranks introduced by a small number of outliers. For d = 2, we tested SWICA further by significantly increasing the number of outliers; the performance was virtually unaffected when the proportion of the outliers was below 20%. SWICA is also less sensitive to noise than other ICA methods (Figure 4). We further tested SWICA on sound and image data. We mixed N = 1000 samples from 8 sound pieces of an ICA benchmark3 by a random orthogonal 8 × 8 matrix. Then we added 20 outliers to this mixture 3 Table 1. The Amari errors (multiplied by 100) for twocomponent ICA with 1000 samples. Each entry is the median of 100 replicates for each pdf, (a) to (r). The lowest (best) entry in each row is boldfaced. pdf SWICA FastICA RADICAL KernelICA JADE a b c d e f g h i j k l m n o p q r 3.74 2.39 0.79 10.10 0.47 0.78 0.74 3.66 10.21 0.86 2.10 4.09 1.11 2.08 5.07 1.24 3.01 3.32 3.01 4.87 1.91 5.63 4.75 2.85 1.49 5.32 7.38 4.64 5.58 7.68 3.41 4.05 3.81 2.92 12.84 4.30 2.18 2.31 1.60 4.10 1.43 1.39 1.19 4.01 6.95 1.29 2.65 3.61 1.43 2.10 2.86 1.81 2.30 3.06 2.09 2.50 1.54 5.05 1.21 1.34 1.11 3.54 7.70 1.21 2.38 3.65 1.23 1.56 2.92 1.53 1.67 2.65 2.67 3.47 1.63 3.94 3.27 2.77 1.19 3.36 6.41 3.38 3.53 5.21 2.58 4.07 2.78 2.70 10.78 3.32 Table 2. The Amari errors (multiplied by 100) for dcomponent ICA with N samples. Each entry is the median of 1000 replicates for d = 2 and 100 for d = 4, 8, 16. Source densities were chosen uniformly at random from (a)-(r). The lowest (best) entry in each row is boldfaced. dN 2 4 8 16 1000 2000 5000 10000 SWICA FastICA RADICAL KernelICA JADE 1.53 1.31 1.20 1.16 4.31 3.74 2.58 1.92 2.13 1.72 1.31 0.93 1.97 1.66 1.25 6.69 3.47 2.83 2.25 1.76 in the same way as in the previously described outlier experiment and demixed them using ICA algorithms. Figure 5 shows that SWICA outperforms other methods on this task. For the image experiment, we used 4 natural images4 of size 128 × 256. The pixel intensities we normalized in the [0, 255] interval. Each image was considered as a realization of a stochastic variable with 32768 sample points. We mixed these 4 images by a 4 × 4 random orthogonal mixing matrix, resulting in a mixture matrix of size 4 × 32768. Then we added large +2000 or -2000 outliers to 3% randomly selected points of these mixture, and then selected at random 2000 samples from the 32768 vectors. We estimated the demixing matrix W using only these 2000 points, 4 http://www.cis.hut.fi/pro jects/ica/cocktail/cocktail en.cgi http://www.cis.hut.fi/pro jects/ica/data/images/ ICA and ISA Using Schweizer-Wolff Measure of Dep endence 40 35 30 25 20 25 20 15 10 5 0 0 5 10 15 20 25 0 10 20 30 40 50 0 25 50 75 100 125 15 10 5 SWICA FastICA RADICAL KernelICA JADE Figure 3. Amari errors (multiplied by 100) for 2-d (left), 4d (center), and 8-dimensional (right) ICA problem in the presence of outliers. The plot shows the median values over R = 1000, 100, 100 replicas of N = 1000, 2000, 5000 samples for d = 2, 4, 8, respectively. Legend: Swica ­ red dots (thick), RADICAL ­ blue x's, KernelICA ­ green pluses, FastICA ­ cyan circles, JADE ­ magenta triangles. The x-axis shows the number of outliers. 35 30 25 20 15 10 5 0 0 Figure 5. Box plot of Amari errors (multiplied by 100) for the mixed sounds with outliers. Plot was computed over R = 100 replicas. (a) Original (b) Mixed 0.6 1.2 0 0.6 1.2 0 0.6 1.2 Figure 4. Amari errors (multiplied by 100) for 2-d (left), 4-d (center), and 8-dimensional (right) ICA problems in the presence of independent Gaussian noise applied to mixed sources. The plot shows the median values of R = 1000, 100, 100 replicas of N = 1000, 2000, 5000 samples for d = 2, 4, 8, respectively. The abscissa shows the variance of the Gaussian noise, 2 = (0, 0.3, 0.6, 0.9, 1.2, 1.5). The legend is the same as in Figure 3. (c) Estimated (d) Hinton diagram Figure 6. ISA experiment for 6 3-dimensional sources. cated by the Hinton diagram of WA (Figure 6d). 6. Conclusion We proposed a new ICA and ISA method, SWICA, based on a non-parametric rank-based estimate of the dependence between pairs of variables. Our method frequently outperforms other state of the art ICA algorithms, is very robust to outliers, and only moderately sensitive to noise. On the other hand, it is somewhat slower than other ICA methods, and requires more samples to separate near-Gaussian sources. In the future, we plan to investigate possible accelerations to the algorithm, and statistical characteristics of the source distributions that affect the contrast. and then recovered the hidden sources for all 32768 samples using this matrix. SWICA significantly outperformed other methods. Figure 7 shows an example of the demixing achieved by different ICA algorithms. Finally, we applied Schweizer-Wolff in an ISA setting. We used 6 3-dimensional sources where each variable was sampled from a geometric shape (Figure 6a), resulting in 18 univariate hidden sources. These sources (N = 1000 samples) were then mixed with a random 18 × 18 orthogonal matrix (Figure 6b). Applying Cardoso's conjecture, we first processed the mixed sources using FastICA, and then clustered the recovered sources using computed on their absolute values (a non-linear transformation) (Figure 6c). The hidden subspaces were recovered with high precision as indi- Acknowledgements This work has been supported by the Alberta Ingenuity Fund through the Alberta Ingenuity Centre for Machine Learning. ICA and ISA Using Schweizer-Wolff Measure of Dep endence (a) Original (b) Mixed (c) SWICA (d) FastICA (e) RADICAL Figure 7. Separation of outlier-corrupted mixed images. (a) The original images. (b) the mixed images corrupted with outliers. (c)-(e) The separated images using SWICA, FastICA, and RADICAL algorithms, respectively. The Amari error of the SWICA, FastICA, Radical was 0.10, 0.30, 0.29 respectively. The quality of the KernelICA and JADE was similar to that of FastICA and RADICAL. References Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind source separation. NIPS (pp. 757­763). Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. JMLR, 3, 1­48. Bach, F. R., & Jordan, M. I. (2003). Beyond independent components: Trees and clusters. JMLR, 4, 1205­1233. Cardoso, J.-F. (1998). Multidimensional independent component analysis. Proc. ICASSP'98, Seattle, WA. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11, 157­192. Comon, P. (1994). Independent component analysis, a new concept? Signal Proc., 36, 287­314. Deheuvels, P. (1979). La fonction de d´pendance eme pirique et ses propri´t´s, un test non param´trique ee e d'ind´pendance. Bul letin de l'Acad´mie Royale de e e Belgique, Classe des Sciences, 274­292. Hyv¨rinen, A. (1999). Fast and robust fixed-point ala gorithms for independent component analysis. IEEE Trans. on Neural Networks, 626­634. Hyv¨rinen, A., Karhunen, J., & Oja, E. (2001). Indea pendent component analysis. New York: John Wiley. Hyv¨rinen, A., & K¨ster, U. (2006). FastISA: A a o fast fixed-point algorithm for independent subspace analysis. Proc. of ESANN. Learned-Miller, E. G., & Fisher, J. W. (2003). ICA using spacings estimates of entropy. JMLR, 4, 1271­ 1295. Nelsen, R. B. (2006). An introduction to copulas. Springer Series in Statistics. Springer. 2nd edition. P´czos, B., & Lrincz, A. (2005). Independent subo o space analysis using geodesic spanning trees. Proc. of ICML-2005 (pp. 673­680). Schweizer, B., & Wolff, E. F. (1981). On nonparametric measures of dependence for random variables. The Annals of Statistics, 9, 879­885. Sklar, A. (1959). Fonctions de r´partition ` n dimene a sions et leures marges. Publications de l'Institut de Statistique de L'Universit´ de Paris, 8, 229­231. e Szab´, Z., P´czos, B., & Lrincz, A. (2007). Undero o o complete blind subspace deconvolution. JMLR, 8, 1063­1095. Theis, F. J. (2005). Blind signal separation into groups of dependent signals using joint block diagonalization. Proc. of ISCAS. (pp. 5878­5881). Theis, F. J. (2007). Towards a general independent subspace analysis. Proc. of NIPS 19 (pp. 1361­ 1368).