Simulating Datasets


The function syntheticNMF generates random target matrices that follow some defined NMF model, and may be used to test NMF algorithms. It is designed to designed to produce data with known or clear classes of samples.


syntheticNMF(n, r, p, offset = NULL, noise = TRUE, factors = FALSE, seed = NULL)


number of rows of the target matrix.
specification of the factorization rank. It may be a single numeric, in which case argument p is required and r groups of samples are generated from a draw from a multinomial distribution with equal probabilities, that provides their sizes. It may also be a numerical vector, which contains the number of samples in each class (i.e integers). In this case argument p is discarded and forced to be the sum of r.
number of columns of the synthetic target matrix. Not used if parameter r is a vector (see description of argument r).
specification of a common offset to be added to the synthetic target matrix, before noisification. Its may be a numeric vector of length n, or a single numeric value that is used as the standard deviation of a centred normal distribution from which the actual offset values are drawn.
a logical that indicate if noise should be added to the matrix.
a logical that indicates if the NMF factors should be return together with the matrix.
a single numeric value used to seed the random number generator before generating the matrix. The state of the RNG is restored on exit.


a matrix, or a list if argument factors=TRUE.

When factors=FALSE, the result is a matrix object, with the following attributes set:

  1. coefficientsthe true underlying coefficient matrix (i.e. H);
  2. basisthe true underlying coefficient matrix (i.e. H);
  3. offsetthe offset if any;
  4. pDataa list with one element 'Group' that contains a factor that indicates the true groups of samples, i.e. the most contributing basis component for each sample;
  5. fDataa list with one element 'Group' that contains a factor that indicates the true groups of features, i.e. the basis component to which each feature contributes the most.

Moreover, the result object is an ExposeAttribute object, which means that relevant attributes are accessible via $, e.g., res$coefficients. In particular, methods coef and basis will work as expected and return the true underlying coefficient and basis matrices respectively.


# generate a synthetic dataset with known classes: 50 features, 18 samples (5+5+8)
n <- 50
counts <- c(5, 5, 8)

# no noise
V <- syntheticNMF(n, counts, noise=FALSE)
## Not run: aheatmap(V)

# with noise
V <- syntheticNMF(n, counts)
## Not run: aheatmap(V)