Softwares | Marie Chavent

The methods of multivariate data analysis and clustering implemented the following R packages are designed for numerical data, categorical data or mixed data (mixture of numerical and categorical data).

The package ClustOfVar is dedicated to the clustering of variables.
The package PCAmixdata implements PCAmix (principal component analysis), PCArot (orthogonal rotation) and MFAmix (multiple factorial analysis) for mixed data.
The package divclust implements a monothetic divisive hierarchical clustering algorithm.
The package ClustGeo implements hierarchical clustering with geographical constraints.
The package sparsePCA implements sparse and group-sparse principal component analysis.
The package modvarsel implements a computational methodology for model and variables selection.
The package vimpclust implements methods related to sparse clustering and variable importance.

The package ClustOfVar

Participants: Marie Chavent, Amaury Labenne, Benoît Liquet, Vanessa Kuentz, Jérome Saracco

This R package is dedicated to the clustering of variables. Variables can be quantitative, qualitative or a mixture of both. It provides hierarchichal and k-means clustering of a set of variables. The center of a cluster of variables is a synthetic variable but is not a ’mean’ as for classical k-means or Ward clustering. This synthetic variable is the first principal component calculated by PCAmix. The homogeneity of a cluster of variables is defined as the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster, which is in all cases a numerical variable. This package deals with datasets of thousands of variables like gene expression data and can also be used for dimension reduction purpose.

Download:

the version on the CRAN is available here
the current development version from github is available here
the JSS paper is available here
slides are available here and here

The package PCAmixdata

Participants: Marie Chavent, Amaury Labenne, Benoît Liquet, Vanessa Kuentz, Jérome Saracco

This package is dedicated to factorial analysis and rotation of quantitative data, qualitative data, or mixed data. The PCAMIX method, proposed in this package includes the ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases. Orthogonal varimax rotation of the principal components of PCAMIX is also implemented in this package. Multiple Factorial Analysis (MFA) for data structured in groups of variables is also proposed. Variables can be quantitative, qualitative or a mixture of both within each group.

Download:

the version on the CRAN is available here
the current development version from github is available here
A vignette of introduction to the package is available here and a vignette specific to supplementary observations, variables, groups is available here.
the arXiv paper with the description of the methods and R code examples is here
the ADAC paper on rotation is available here
slides of useR! 2015 are available here

The package divclust

Participants: Marie Chavent, Marc Fuentes

DIVCLUS-T is a divisive hierarchical clustering algorithm based on a monothetic bipartitional approach allowing the dendrogram of the hierarchy to be read as a decision tree. It is designed for numerical, categorical (ordered or not) or mixed data. Like the Ward agglomerative hierarchical clustering algorithm and the k-means partitioning algorithm, it is based on the minimization of the inertia criterion. However, it provides a simple and natural monothetic interpretation of the clusters. Indeed, each cluster is decribed by set of binary questions. The inertia criterion is calculated on all the principal components of PCAmix (and then on standardized data in the numerical case).

Download:

the current development version from github is available here
slides of Rencontres R (in french) with description of the method an R codes are here
the paper with the description of the methods (not for mixed data) is available here

The package ClusGeo

Participants: Marie Chavent, Amaury Labenne, Vanessa Kuentz, Jérome Saracco

This R package is dedicated to the clustering of objects with geographical positions. The clustering method implemented in this package allows the geographical constraints of proximity to be taken into account within the ascendant hierarchical clustering.

Download:

the version on the CRAN is available here
the current development version from github and a vignette is available here
a vignette of introduction to the package is available here
the preprint of the paper is available here
slides of Rencontres R (in french) are available here

The package sparsePCA

Participant: Marie Chavent

This package performs sparse or group-sparse principal component analysis. Deflation and block algorithms are implemented. Five different definition of explained variance for a set of non orhogonal components are also implemented.

Download:

the version on Github is available here
the preprint of the paper is available here and slides in french are available here

The package modvarsel

Participants: Marie Chavent, Alexandre le Conanec, Marie-Pierre Ellies, Jérôme Saracco

This package implements a methodology to choose among several regression methods the best one to predict a numerical response variable and select simultaneously the most interesting covariates.

the version on Github is available here
a vignette of introduction to the package is available here
the paper is available here

The package vimpclust

Participants: Marie Chavent, Jérôme Lacaille, Alex Mourer, Madalina Olteanou

The package allows to perform sparse k-means clustering with a group penalty, so that it automatically selects groups of numerical features. It also allows toperform sparse clustering and variable selection on mixed data (categorical and numericalfeatures), by preprocessing each categorical feature as a group of numerical features. Several methods for visualizing and exploring the results are also provided.

Download:

the version on Github is available here,
the version on cran is available here
a vignette for sparse weighted k-means of mixed data is available here
a vignette for group-sparse weighted k-means for numerical data is available here
the conference paper is available here