Viewing a single comment thread. View all comments

ZombieRickyB t1_iy70dqa wrote

Visualization does not mean it's good for working in. The nonisometric nature is the killer. Your space is fundamentally distorted, what you input is not necessarily reflected in the visualizations.

Here is a stackexchange thread discussing the main issue (same holds for UMAP) but I'll highlight a different example: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

Probably the most famous example of showing why you really can't assume meaning in general comes from the stereographic projection. This is a diffeomorphism between the sphere without a point and the plane. Points close by on the punctured sphere, specifically around the puncture, absolutely need not be close on the plane; they'll actually be quite far apart. The minute you start working with diffeomorphisms of arbitrarily high conformal distortion, the worse you get your results. Is there data that is reasonably "sphere-valued"? Absolutely, I see it a bunch. Any attempt to flatten it will destroy the geometry in some fashion. This is just one example.

You have two things with these setups that put you at risk: you're taking a minimum of some sort, which never needs to be a smooth mapping, and generally a log of something appears somewhere, which fundamentally changes geometry. It's there, it's just usually hidden because someone at some point is assuming something is approximately Gaussian. That's why the energies look the way they do.

2

olmec-akeru OP t1_iy7bv3u wrote

You can resolve the isometric constraint by using a local distance metric dependent on the local curvature: hint, look at the Riemann curvature tensor.

1

backtorealite t1_iy7z5lu wrote

But I feel like this is equivalent to a statistician telling me to not trust my XGBoost model with 99% accuracy but is fine with my linear model with 80% accuracy. If it works, it works. Unrealistic model data transformations happen in all types of models and as long as you aren’t just selecting the prettiest picture that you arrived on by chance I see now problem with relying on a unsupervised transformation that may consistent of some unrealistic transformations if it fundamentally is still highly effective in getting what you want. If I know my data has interaction and non linear effects but don’t know which variables will have such effects, it seems like a UMAP or tsne transformation to two dimensions is a perfectly reasonable option and preferable to PCA in that situation. I feel like the problems you describe are mostly solved by just adjusting the parameters and making sure the clusters you find are robust to those alterations.

1

ZombieRickyB t1_iy94v38 wrote

I mean, yeah, you're not wrong. If it works for you, it works for you. It's problem space dependent, and there's virtually no research that exists suggesting how much, if at all, things will be distorted in the process for given scenarios. For my work, I need to have theoretical bounds on conformal/isometric distortion, the distances involved are infinitely more important than classification accuracy. I work in settings where near perfect classification accuracy is absolutely not expected, so well separation of clusters just will call for question of results.

There have been a number of cases, both theoretical and in practice, where t-SNE and UMAP give results with questionable reliability. I'm sure I could get an example in my space with little effort as well, and I'd rather go through some nonlinear transforms I know well in terms of how they work than spend a bunch of time tuning optimization hyperparameters that could take forever.

1