Viewing a single comment thread. View all comments

ProdigyManlet t1_iy5xbam wrote

Why's UMAP not good for downstream clustering?

2

ZombieRickyB t1_iy60acn wrote

There is a large amount of distortion done in the last step of UMAP, namely the whole cross entropy optimization. The output is designed to look pretty and separate things, in the process being at risk of distorting distances and the underlying "space" for the sake of looking pretty. Clustering quality risks being questionable.

3

backtorealite t1_iy6g1c8 wrote

Does this hold true for tsne? Can tsne be used for downstream clustering?

1

ZombieRickyB t1_iy6md04 wrote

Yep, same idea. You're optimizing a function, there is zero guarantees on any particular distance based distortion. You can use it for visualization but little else

1

backtorealite t1_iy6uivr wrote

But saying it’s good for visualization is equivalent to saying it’s good for decomposing data into a 2D framework. So either it’s completely useless and shouldn’t be used for visualization or has some utility in downstream analysis. Doesn’t really make sense to say both. And we all know it’s not completely useless so I think it’s a bit unfair to say it should only ever be used for visualization.

2

ZombieRickyB t1_iy70dqa wrote

Visualization does not mean it's good for working in. The nonisometric nature is the killer. Your space is fundamentally distorted, what you input is not necessarily reflected in the visualizations.

Here is a stackexchange thread discussing the main issue (same holds for UMAP) but I'll highlight a different example: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

Probably the most famous example of showing why you really can't assume meaning in general comes from the stereographic projection. This is a diffeomorphism between the sphere without a point and the plane. Points close by on the punctured sphere, specifically around the puncture, absolutely need not be close on the plane; they'll actually be quite far apart. The minute you start working with diffeomorphisms of arbitrarily high conformal distortion, the worse you get your results. Is there data that is reasonably "sphere-valued"? Absolutely, I see it a bunch. Any attempt to flatten it will destroy the geometry in some fashion. This is just one example.

You have two things with these setups that put you at risk: you're taking a minimum of some sort, which never needs to be a smooth mapping, and generally a log of something appears somewhere, which fundamentally changes geometry. It's there, it's just usually hidden because someone at some point is assuming something is approximately Gaussian. That's why the energies look the way they do.

2

olmec-akeru OP t1_iy7bv3u wrote

You can resolve the isometric constraint by using a local distance metric dependent on the local curvature: hint, look at the Riemann curvature tensor.

1

backtorealite t1_iy7z5lu wrote

But I feel like this is equivalent to a statistician telling me to not trust my XGBoost model with 99% accuracy but is fine with my linear model with 80% accuracy. If it works, it works. Unrealistic model data transformations happen in all types of models and as long as you aren’t just selecting the prettiest picture that you arrived on by chance I see now problem with relying on a unsupervised transformation that may consistent of some unrealistic transformations if it fundamentally is still highly effective in getting what you want. If I know my data has interaction and non linear effects but don’t know which variables will have such effects, it seems like a UMAP or tsne transformation to two dimensions is a perfectly reasonable option and preferable to PCA in that situation. I feel like the problems you describe are mostly solved by just adjusting the parameters and making sure the clusters you find are robust to those alterations.

1

ZombieRickyB t1_iy94v38 wrote

I mean, yeah, you're not wrong. If it works for you, it works for you. It's problem space dependent, and there's virtually no research that exists suggesting how much, if at all, things will be distorted in the process for given scenarios. For my work, I need to have theoretical bounds on conformal/isometric distortion, the distances involved are infinitely more important than classification accuracy. I work in settings where near perfect classification accuracy is absolutely not expected, so well separation of clusters just will call for question of results.

There have been a number of cases, both theoretical and in practice, where t-SNE and UMAP give results with questionable reliability. I'm sure I could get an example in my space with little effort as well, and I'd rather go through some nonlinear transforms I know well in terms of how they work than spend a bunch of time tuning optimization hyperparameters that could take forever.

1

trutheality t1_iy953rr wrote

It's "good" for visualization in the sense that it can give you something to look at, but it's not really good for visualization. You can't even guarantee that the nearest neighbor of a point in the projection is its nearest neighbor in the input space.

This paper demonstrates that you can make the output look like anything you want and still minimize the UMAP/tSNE objectives: https://www.biorxiv.org/content/10.1101/2021.08.25.457696v3

1

resented_ape t1_iybprpj wrote

FWIW, I don't think that is what the paper demonstrates. The Picasso method the authors introduce uses a totally different cost function based on distance reconstruction. For a specific set of metrics the authors are interested in, they say that Picasso produces results comparable with UMAP and t-SNE. But it's not the UMAP or t-SNE objective.

With the scRNAseq dataset in the python notebook at the github page for picasso, I found that for the metrics one might usually be interested in with UMAP and t-SNE, e.g. neighborhood preservation (what proportion of the k-nearest neighbors in the input and output space are preserved) or Spearman rank correlation (or triplet ordering preservation) of input vs output distances, Picasso did (quite a bit) worse than UMAP and t-SNE.

This might not be relevant to for downstream scRNAseq workflows -- I will take the authors' word on that. At any rate, on my machine Picasso runs very slowly, and I found its own output to be visually unsatisfactory with the default settings with other datasets that I tried it with (e.g. MNIST), so I have been unable to generate a similar analysis for a wide range of datasets. So take that for what it's worth.

1

proMatrixMultiplier t1_iy6hobk wrote

Does this point hold true for parametric umap as well?

1

ZombieRickyB t1_iy6qf71 wrote

No reason it wouldn't. You might be able to prove some amount of bounds on the amount of distortion but the fundamental issue is still present

1

olmec-akeru OP t1_iy7bppp wrote

Said differently: its not an embedding on a continuous manifold? The construction of simplexes is such that infinite curvature could exist?

1

ZombieRickyB t1_iy957ed wrote

I mean I imagine it is an embedding in the rigorous sense in some cases. Can't really prove it, though. The issue isn't really about curvature so much as distances, though. Can't flatten an object homeopathic (not necessarily diffeomorphic) to a sphere without introducing mega distortion in the metric space sense

1

trutheality t1_iy92ygu wrote

As u/ZombieRickyB said, the short answer is that it distorts distances to the point that you can't rely on them in downstream clustering.

There are two papers that do a really good deep dive into it:

This one: https://www.biorxiv.org/content/10.1101/2021.08.25.457696v1 where they both show that the distances pretty much have to be distorted and that the minimizer of the objective is such that you can make the output look pretty much like anything while minimizing.

And this one: https://jmlr.org/papers/volume22/20-1061/20-1061.pdf that studies which aspects of the objective functions of these methods affect the structure at different scales.

2