Submitted by olmec-akeru t3_z6p4yv in MachineLearning
new_name_who_dis_ t1_iy3jblq wrote
I’d say that PCA is the most useful method still. The fact that it’s very quick and is a linear transform makes it very easy to use and interpretable
cptsanderzz t1_iy4ni89 wrote
How does it make it more interpretable? I feel like wrapping 12 different features into 2 loses a lot of explain ability. I’m just learning about trade offs to dimensionality reduction, so this is a genuine question.
ZombieRickyB t1_iy4ox9u wrote
PCA doesn't just give a visual embedding, it gives an explicit coordinate system. (1,0) in a 2d dimensionality reduction example naturally corresponds to a particular unit vector in the initial space. If you know what your coordinates mean in that space, that gives guidance. Those unit vectors are generalized eigenvectors in the usual sense
A nice example of this: if you have a bunch of, say, black and white images of faces, vectorize them, perform PCA, take one of the eigenvectors, turn it back into an image, and display it, you get something resembling a face for at least the first few dimensions. By construction, these vectors are orthogonal, so there's some mechanism of variation that bears to be a little interpretable
new_name_who_dis_ t1_iy4ol7g wrote
It depends on what data you're working with and what you're trying to do. For example for me, I've worked a lot with 3d datasets of meshes of faces and bodies that are in correspondence. And I actually used autoencoders to compress them to the same dimensions as PCA and compared the two.
Basically with the network I'd get less error on reconstruction (especially at lower dimensions). However, the beauty of the PCA reduction was that one dimension was responsible for the size of the nose on the face, another was responsible for how wide or tall the head is, etc.
And you don't get such nice properties from a fancy VAE latent space. Well you can get a nice disentangled latent space but they don't happen for free usually, you often need to add even more complexity to get so nice and disentangled. With PCA, it's there by design.
olmec-akeru OP t1_iy7ai6s wrote
>beauty of the PCA reduction was that one dimension was responsible for the size of the nose
I don't think this always holds true. You're just lucky that your dataset contains confined variation such that the eigenvectors represent this variance to a visual feature. There is no mathematical property of PCA that makes your statement true.
There have been some attempts to formalise something like what you have described. The closest I've seen is the beta-VAE: https://lilianweng.github.io/posts/2018-08-12-vae/
new_name_who_dis_ t1_iy84a83 wrote
It’s not really luck. There is variation in sizes of noses (it’s one of the most varied features of the face) and so that variance is guaranteed to be represented in the eigenvectors.
And beta-VAEs are one of the possible things you can try to get a disentangled latent space yes, although they don’t really work that well in my experience.
olmec-akeru OP t1_iy8ajq0 wrote
> the beauty of the PCA reduction was that one dimension was responsible for the size of the nose
You posit that an eigenvector will represent the nose when there are meaningful variations of scale, rotation, and position?
This is very different to saying all variance will be explained across the full set of eigenvectors (which very much is true).
new_name_who_dis_ t1_iy8b0jr wrote
It was just an example. Sure not all sizes of nose are found along the same eigenvector.
vikigenius t1_iy4tb05 wrote
That's not PCA's fault. No matter what kind of technique you use, if you try to represent a 768d vector in a 2d plane by using dimensionality reduction, you will lose lots of explainabilility.
The real question is about what properties do you care to preserve when doing dimensionality reduction and validating that your technique tries to preserve as much of it as possible.
olmec-akeru OP t1_iy7a8fe wrote
So this may not be true: the surface of a Riemannian manifold is infinite, so you can encode infinite knowledge onto its surface. From there the diffeomorphic property allows one to traverse the surface and generate explainable, differentiable, vectors.
vikigenius t1_iy7kxuu wrote
Huh? Diffeomorphisms are dimensionality preserving. You can't have a diffeomorphism between Rn to R2 unless n=2. That's the only way your differential mapping is bijective
So I am not sure how the diffeomorphisms guarantee that you can have lossless dimensionality reduction.
What can happen is that if your data inherently lies on a lower dimensional manifold. For instance if you have A subset of Rn that has an inherent dimensionality of just 2 then you can trivially just represent it in 2 dimensions. For example if you have a 3d space where the 3rd dimension is an exact linear combination of the 1st and 2nd then it's inherent dimensionality is 2 and you can obviously losslessly reduce it to 2d.
But most definitely not all datasets have an inherent dimensionality of 2.
gooblywooblygoobly t1_iydf4z2 wrote
A super trivial example is that a (hyper)plane is a Riemannian manifold. Since we know that PCA is lossy, and PCA projects to a (hyper)plane, it can't be that projecting to manifolds are enough to perfectly preserve information.
Dylan_TMB t1_iy6ie5a wrote
Compared to non-linear methods like kernel PCA and T-SNE it is super interpretable. Also I don't know what dimensionality reduction technique doesn't squeeze features into less features, that's kind of the point no?
olmec-akeru OP t1_iy7amgl wrote
I'm not sure: if you think about t-SNE its trying to minimise some form of the Kullback–Leibler divergence. That means its trying to group similar observations into the embedding space. Thats quite different to "more features into less features".
Dylan_TMB t1_iy7brke wrote
I would disagree. t-SNE is taking points in a higher dimensional space and is attempting to find a transformation that places the points in a lower dimensional embedding space while preserving the similarities in the original dimensional space. In the end each point will have it's original vector (more features) mapped to a lower dimensional vector (less features). The mapping is non-linear but that is all that's the result of the operation.
olmec-akeru OP t1_iy7i546 wrote
Heya! Appreciate the discourse, its awesome!
As a starting point, I've shared the rough description from wikipedia on the t-SNE algorithm:
> The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate.
So the algorithm is definitely trying to minimise the KL divergence. In trying to minimise the KLD between the two distributions it is trying to find a mapping such that dissimilar points are further apart in the embedding space.
Viewing a single comment thread. View all comments