From Dave's wiki
Jump to navigation Jump to search

t-distributed stochastic neighbor embedding (t-SNE)

If X is the original data, P will be a matrix that holds affinities (~distances) between points in X in the high (original) dimensional space, and Q will be the matrix that holds affinities between data points the low dimensional space. If we have n data samples, both Q and P will be n by n matrices (distance from any point to any point including itself). t-SNE has special ways to measure distances between things; a certain way to measure distance between data points in the high dimensional space and another way for data points in the low dimensional space and a third way for measuring the distance between P and Q. From the original paper: the similarity between one point x_j to another point x_i is given by p_j|i, that x_i would pick x_j as its neighbour if neighbours were picked in proportion to their probability density under a Gaussian centred at x_i.

  • The algorithm accepts two inputs, one is the data itself, and the other is called the perplexity (Perp)
  • Perplexity is how you want to balance the focus between local (close points) and global structure of your data in the optimisation process; higher perplexity means a data point will consider more points as its close neighbours and lower means less

See https://distill.pub/2016/misread-tsne/ to see how perplexity affects the results






tsne_out <- Rtsne(matrix, check_duplicates = FALSE)