Softwares

The methods of multivariate data analysis and clustering implemented the following R packages are designed for numerical data, categorical data or mixed data (mixture of numerical and categorical data).

The  package ClustOfVar

Participants: Marie Chavent, Amaury Labenne, Benoît Liquet, Vanessa Kuentz, Jérome Saracco

This R package is dedicated to the clustering of variables. Variables can be quantitative, qualitative or a mixture of both. It provides hierarchichal and k-means clustering of a set of variables. The center of a cluster of variables is a synthetic variable but is not a ’mean’ as for classical k-means or Ward clustering. This synthetic variable is the first principal component calculated by PCAmix. The homogeneity of a cluster of variables is defined as the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster, which is in all cases a numerical variable. This package deals with datasets of thousands of variables like gene expression data and can also be used for dimension reduction purpose.

Download:

  • the version on the CRAN is available here
  • the current development version from github is available here
  • the JSS paper is available here
  • slides are available here and here

The package PCAmixdata

Participants: Marie Chavent, Amaury Labenne, Benoît Liquet, Vanessa Kuentz, Jérome Saracco

This package is dedicated to factorial analysis and rotation of quantitative data, qualitative data, or mixed data. The PCAMIX method, proposed in this package includes the ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases. Orthogonal varimax rotation of the principal components of PCAMIX is also implemented in this package. Multiple Factorial Analysis (MFA) for data structured in groups of variables is also proposed. Variables can be quantitative, qualitative or a mixture of both within each group.

Download:

  • the version on the CRAN is available here
  • the current development version from github is available here
  • A vignette of introduction to the package is available here and a vignette  specific to supplementary observations, variables, groups is available here.
  • the arXiv paper with the description of the methods and R code examples is here
  • the ADAC paper on rotation is available here
  • slides of useR! 2015 are available here

The package divclust

Participants: Marie Chavent, Marc Fuentes

DIVCLUS-T is a divisive hierarchical clustering algorithm based on a monothetic bipartitional approach allowing the dendrogram of the hierarchy to be read as a decision tree. It is designed for numerical, categorical (ordered or not) or mixed data. Like the Ward agglomerative hierarchical clustering algorithm and the k-means partitioning algorithm, it is based on the minimization of the inertia criterion. However, it provides a simple and natural monothetic interpretation of the clusters. Indeed, each cluster is decribed by set of binary questions. The inertia criterion is calculated on all the principal components of PCAmix (and then on standardized data in the numerical case).

Download:

  • the current development version from github is available here
  • slides of Rencontres R (in french) with description of the method an R codes are here
  • the  paper with the description of the methods (not for mixed data) is available here

The  package ClusGeo

Participants: Marie Chavent, Amaury Labenne, Vanessa Kuentz, Jérome Saracco

This R package is dedicated to the clustering of objects with geographical positions. The clustering method implemented in this package allows the geographical constraints of proximity to be taken into account within the ascendant hierarchical clustering.

Download:

  • the version on the CRAN is available here
  • the current development version from github and a vignette is available here
  • a vignette of introduction to the package is available here
  • the preprint of the paper is available here
  • slides of Rencontres R (in french) are available here

The  package sparsePCA

Participant: Marie Chavent

This package performs sparse or group-sparse principal component analysis. Deflation and block algorithms are implemented. Five different definition of explained variance for a set of non orhogonal components are also implemented.

Download:

  • the version on Github is available here
  • the preprint of the paper is available here and slides in french are available here

The package modvarsel

Participants: Marie Chavent, Alexandre le Conanec, Marie-Pierre Ellies, Jérôme Saracco

This package implements a methodology to choose among several regression methods the best one to predict a numerical response variable and select simultaneously the most interesting covariates.

  • the version on Github is available here
  • a vignette of introduction to the package is available here
  • the paper is available here

The package vimpclust

Participants: Marie Chavent, Jérôme Lacaille, Alex Mourer, Madalina Olteanou

The package allows to perform sparse k-means clustering with a group penalty, so that it automatically selects groups of numerical features. It also allows toperform sparse clustering and variable selection on mixed data (categorical and numericalfeatures), by preprocessing each categorical feature as a group of numerical features. Several methods for visualizing and exploring the results are also provided.

Download:

  • the version on Github is available here,
  • the version on cran is available here
  • a vignette for sparse weighted k-means of mixed data is available here
  • a vignette for group-sparse weighted k-means for numerical data is available here
  • the conference paper is available here