T-SNE

From Dave's wiki
Jump to navigation Jump to search

t-distributed stochastic neighbor embedding (t-SNE)

If X is the original data, P will be a matrix that holds affinities (~distances) between points in X in the high (original) dimensional space, and Q will be the matrix that holds affinities between data points the low dimensional space. If we have n data samples, both Q and P will be n by n matrices (distance from any point to any point including itself). t-SNE has special ways to measure distances between things; a certain way to measure distance between data points in the high dimensional space and another way for data points in the low dimensional space and a third way for measuring the distance between P and Q. From the original paper: the similarity between one point x_j to another point x_i is given by p_j|i, that x_i would pick x_j as its neighbour if neighbours were picked in proportion to their probability density under a Gaussian centred at x_i.

  • The algorithm accepts two inputs, one is the data itself, and the other is called the perplexity (Perp)
  • Perplexity is how you want to balance the focus between local (close points) and global structure of your data in the optimisation process; higher perplexity means a data point will consider more points as its close neighbours and lower means less

See https://distill.pub/2016/misread-tsne/ to see how perplexity affects the results

https://medium.com/towards-data-science/reducing-dimensionality-from-dimensionality-reduction-techniques-f658aec24dfe

https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/

http://www.wikicoursenote.com/wiki/Visualizing_Data_using_t-SNE

https://lvdmaaten.github.io/tsne/

https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm

library(Rtsne)
set.seed(42)
tsne_out <- Rtsne(matrix, check_duplicates = FALSE)