The functions purity
and entropy
respectively compute the purity and the entropy of a
clustering given a priori known classes.
Entropy of a Clustering
purity(x, y, ...) entropy(x, y, ...) S4 (NMFfitXn,ANY) `purity`(x, y, method = "best", ...) S4 (NMFfitXn,ANY) `entropy`(x, y, method = "best", ...)
predict
, which gives the cluster membership
for each sample.x
is a contingency table.'best'
or
'mean'
to compute the best or mean purity
respectively.a single numeric value
the entropy (i.e. a single numeric value)
The purity and entropy measure the ability of a clustering method, to recover known classes (e.g. one knows the true class labels of each sample), that are applicable even when the number of cluster is different from the number of known classes. Kim et al. (2007) used these measures to evaluate the performance of their alternate least-squares NMF algorithm.
Suppose we are given l
categories, while the
clustering method generates k
clusters.
The purity of the clustering with respect to the known categories is given by:
Purity = \frac{1}{n} \sum_{q=1}^k \max_{1 \leq j \leq l} n_q^j ,where:
n
is the total number of
samples; n_q^j
is the number of samples in
cluster q
that belongs to original class j
(1 \leq j \leq l
). The purity is therefore a real number in [0,1]
. The
larger the purity, the better the clustering performance.
The entropy of the clustering with respect to the known categories is given by:
- 1/(n log2(l) ) sum_q sum_j n(q,j) log2( n(q,j) / n_q ),where:
n
is the total number of
samples; n_q
is the total number of
samples in cluster q
(1 \leq q \leq k
); n(q,j)
is the number of samples in cluster
q
that belongs to original class j
(1
\leq j \leq l
). The smaller the entropy, the better the clustering performance.
signature(x = "table", y =
"missing")
: Computes the purity directly from the
contingency table x
.
This is the workhorse method that is eventually called by all other methods.
signature(x = "factor", y = "ANY")
:
Computes the purity on the contingency table of x
and y
, that is coerced into a factor if necessary.
signature(x = "ANY", y = "ANY")
:
Default method that should work for results of clustering
algorithms, that have a suitable predict
method
that returns the cluster membership vector: the purity is
computed between x
and predict{y}
signature(x = "NMFfitXn", y =
"ANY")
: Computes the best or mean entropy across all NMF
fits stored in x
.
signature(x = "table", y =
"missing")
: Computes the purity directly from the
contingency table x
signature(x = "factor", y = "ANY")
:
Computes the purity on the contingency table of x
and y
, that is coerced into a factor if necessary.
signature(x = "ANY", y = "ANY")
:
Default method that should work for results of clustering
algorithms, that have a suitable predict
method
that returns the cluster membership vector: the purity is
computed between x
and predict{y}
signature(x = "NMFfitXn", y =
"ANY")
: Computes the best or mean purity across all NMF
fits stored in x
.
Kim H and Park H (2007). "Sparse non-negative matrix
factorizations via alternating non-negativity-constrained
least squares for microarray data analysis."
_Bioinformatics (Oxford, England)_, *23*(12), pp.
1495-502. ISSN 1460-2059,
# generate a synthetic dataset with known classes: 50 features, 18 samples (5+5+8)
n <- 50; counts <- c(5, 5, 8);
V <- syntheticNMF(n, counts)
cl <- unlist(mapply(rep, 1:3, counts))
# perform default NMF with rank=2
x2 <- nmf(V, 2)
purity(x2, cl)
## [1] 0.7222
entropy(x2, cl)
## [1] 0.438
# perform default NMF with rank=2
x3 <- nmf(V, 3)
purity(x3, cl)
## [1] 1
entropy(x3, cl)
## [1] 0
sparseness