Affinity propagation

From Dave's wiki
Jump to navigation Jump to search

Affinity propagation (AP) was developed by Frey and Dueck.[1] It is:

An algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

AP only requires a similarity matrix and these values can be in any range, including negative and positive values; a value of negative infinity is interpreted as "absolute dissimilarity" and the higher a value, the more similar two samples are considered. Note that negative values do not indicate a negative correlation.

The most important input parameter of AP is the input preference, which is the tendency of a data sample to become an exemplar. The input preference can either be chosen individually for each data sample or it can be a single value shared among all data samples. Input preferences largely determine the number of clusters. To determine the parameter, Frey and Dueck introduced the following rule of thumb: "The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters)."

Similarity matrices

AP does not make any specific assumption about the similarity measure only that the higher the value, the more similar two samples are. Negative squared distances must be used if one wants to minimise squared errors. When calculating a similarity matrix, it is assumed that the input data matrix is arranged such that rows are samples and columns are features.

Affinity propagation in R

In R, use the package apcluster. The package also allows exemplar based agglomerative clustering, which can be used as a clustering method on its own or for creating a hierarchy of clusters that have been computed previously by AP. There are several ways for clustering data, but the simplest way is:

apres1a <- apcluster(negDistMat(r = 2), x1)

The above code first computes a similarity matrix for the input data x1 using the similarity function passed as the first argument; the choice negDistMat(r = 2) is the standard similarity measure used by Frey and Dueck, which are negative squared distances. However, users should compute the similarity matrix beforehand.

s1 <- negDistMat(x1, r = 2)
apres1b <- apcluster(s1)

The function apcluster() creates an object belonging to the S4 class APResult; to get more information:

help(APResult)

Use details=TRUE as an argument in apcluster() if you want to keep a detailed log of the process.

apres1c <- apcluster(s1, details = TRUE)

This allows the plotting of the three performance measures that AP uses internally in each iteration. The performance measures are:

  1. Sum of exemplar preferences
  2. Sum of similarities of exemplars to their cluster members
  3. Net fitness: sum of the two former

The convits parameter is used to control how long AP waits for a chance until it terminates; the default is convits=100.

Input preference

The default input preference in apcluster is the median rule. The argument q can be used to set the input preference to certain quantile of the input similarities: the median is q = 0.5 and the minimum is q = 0. It should be noted that setting q = 0 does not necessarily result in a very small number of clusters. This is due to the fact that input preferences need not necessarily be exactly in the range of the similarities. To determine a meaningful range, an auxiliary function is available to calculate the minimum value (for which one or at most two clusters would be obtained) and a maximum value (for which as many clusters as data samples would be obtained). The computations are done approximately by default; to obtain the exact bounds supply exact = TRUE to the preferenceRange() function though this results in longer computational times.

preferenceRange(apres2b@sim)

The input preference used by AP can be recovered from the output object; it is stored in the @p slot.

apres2c@p

Defining the number of clusters

A search algorithm that adjusts input preferences in order to produce the desired number of clusters can be used to obtain a fixed number of clusters: apclusterK().

# force two clusters
apres2d <- apclusterK(negDistMat(r = 2), x2, K = 2, verbose = TRUE)

Exemplar-based agglomerative clustering

Exemplar-based agglomerative clustering can be used to merged clusters.

# an AggExResult object
aggres1a <- aggExCluster(negDistMat(r = 2), x1)

The output object aggres1a contains the complete cluster hierarchy, which can be shown via the plot() function. The heights of the merges in the dendrogram correspond to the merging objective: the higher the vertical bar of a merge, the less similar the two clusters.

Merging clusters obtained from affinity propagation

The aggExCluster() function can be used for creating a hierarchy of clusters starting from a set of clusters previously computed by AP. Exemplar-based agglomerative clustering on affinity propagation provides an additional measure for determining the right number of clusters.

# apply aggExCluster() on AP results
aggres2a <- aggExCluster(x = apres2c)

We can supply a k argument to plot() to visualise the clusters.

plot(aggres2a, x2, k = 5, main = "5 clusters")

Applying aggExCluster() to an AP result only makes sense if the number of clusters to start from is at least as large as the number of true clusters in the data set. For example, if the number of clusters is already too small, then merging will decrease the number of clusters. In each step, aggExCluster() considers all pairs of clusters in the current cluster set and joins that pair of clusters whose merging objective is maximal.

Leveraged affinity propagation

Leveraged AP is based on the idea that, for large data sets with many samples, the cluster structure is already visible on a subset of samples. Instead of evaluating the similarity matrix for all sample pairs, the similarities of all samples to a subset of samples are computed, which results in a non-square similarity matrix. Clustering is performed on this reduced similarity matrix, which clusters larger data sets more efficiently. Several rounds of AP are executed with different sample subsets, which iteratively improves the clustering result. The memory consumption is reduced and clustering is faster.

The two main parameters controlling leveraged AP clustering are the fraction of data points that should be used for clustering (frac) and the number of sweeps or repetitions of individual clustering runs (sweeps). Initially, a sample subset is selected randomly but for the subsequent repetitions, the exemplars of the previous run are kept in the sample subset and the other samples in the subset are randomly chosen again. The best result of all sweeps with the highest net similarity is kept as the final clustering result.

# p is the input preference
apres3 <- apclusterL(s = negDistMat(r = 2), x = x3, frac = 0.1, sweeps = 5, p = -0.2)

Results returned by leveraged AP can be further processed using exemplar-based agglomerative clustering.

Sparse affinity propagation

The functions apcluster() and apclusterK() can also handle similarity matrices stored in sparse format. As far as I'm aware, the original data matrix still needs to be a regular matrix. Use the function as.SparseSimilarityMatrix() to convert a dense matrix to a sparse matrix.

sdim <- negDistMat(x2, r = 2)
# converts to sparse matrix AND removes pairwise similarities that are -0.2 or lower
ssim <- as.SparseSimilarityMatrix(dsim, lower = -0.2)

Links

Official website: http://www.psi.toronto.edu/index.php?q=affinity%20propagation

References

  1. Clustering by Passing Messages Between Data Points https://www.ncbi.nlm.nih.gov/pubmed/17218491