Submitted by Desi___Gigachad t3_126rgih in MachineLearning
FermiAnyon t1_jee03oc wrote
Reply to comment by turnip_burrito in [D] Turns out, Othello-GPT does have a world model. by Desi___Gigachad
My pretty tenuous grasp of the idea makes me thing stuff like... if you're measuring Euclidean distance or cosine similarity between two points that represent concepts that are completely unrelated, what would that distance or that angle be? And that, ideally, all things that are completely unrelated, if you did a pairwise comparison, would have that distance or that angle. And that the embedding space is large enough to accommodate that. And it sounds to me like kind of a limit property that it may only be possible to approximate because there's like lots of ideas and only so many dimensions to fit them in...
turnip_burrito t1_jee0mso wrote
Here's what GPT4 chimed in with (it lines up with what I've heard and read before):
===========
In an N-dimensional space with N >> 1, where M random vectors are dispersed with each coordinate sampled from a uniform distribution within a bounded range, we can make a few observations about the distances between these vectors:
High-dimensional space behavior: In high-dimensional spaces, the behavior of distances between random points differs significantly from what we would expect in lower-dimensional spaces like 2D or 3D. In high-dimensional spaces, most points tend to be far away from each other, and the distances between them are often more similar than they would be in lower-dimensional spaces.
Concentration of distances: As the dimensionality N increases, the pairwise distances between the M random vectors tend to concentrate around a specific value. The distances between points in high-dimensional spaces tend to be less varied than in lower-dimensional spaces, and the majority of the distances will be relatively close to the expected value. This phenomenon is known as the "concentration of measure."
Sparse representation: In high-dimensional spaces, the points are often sparsely distributed, which means that there is a lot of empty space between them. This can lead to a counterintuitive property, where increasing the number of dimensions might actually make it easier to distinguish between the points, as they become more "isolated" in the space.
Curse of dimensionality: While the above-mentioned properties might sometimes be useful, high-dimensional spaces can also give rise to the "curse of dimensionality." This term refers to various challenges that arise when working with high-dimensional data, such as increased computational complexity, difficulty in visualizing the data, and the need for a larger number of samples to obtain statistically meaningful results.
To summarize, in an N-dimensional space with N >> 1, the pairwise distances between M random vectors will generally be far from each other and concentrate around a specific value. High-dimensional spaces exhibit unique properties like the concentration of measure and sparse representation, but they also come with challenges like the curse of dimensionality
FermiAnyon t1_jee3nc1 wrote
What did you prompt it with? And what do you think of its answer?
turnip_burrito t1_jefysiz wrote
My prompt:
> Suppose I have an N>>1 dimensional space, finite in extent along any given axis, in which a set of M random vectors are dispersed (each coordinate of each vector is randomly sampled from a uniform distribution spanning some bounded range of the space). What can we say about the distances in this space between the M vectors?
I left my prompt open ended to not give it any ideas one way or another.
Its response makes sense to me. The standard deviation of a set of random samples from a uniform distribution centered at mean 0, which is proportional to the distance calculated here, should shrink as dimension N grows. If N is large, then the distribution of pairwise distances will narrow until nearly all points are roughly the same distance from each other. (The random sampling is a way to build in lack of correlation, like how you mentioned unrelated ideas)
Of course, the reverse is also true: if dimension N is small, then originally "far" points will become closer or farther (which one effect exactly is unpredictable depending on which dimensions are removed) because the averaging over random sample fluctuations disappears.
FermiAnyon t1_jegiycj wrote
Pretty neat stuff. Fits well with the conversation we were having. I guess a salient question how large an embedding space do you need before performance in any given task plateaus.
Except that they're not random vectors in the original context.
turnip_burrito t1_jegu7uk wrote
Yeah I made the simplification of random vectors myself just to approximate what uncorrelated "features" in an embedding space could be like.
One thing that's relevant for embedding space size Takens theorem: https://en.wikipedia.org/wiki/Takens%27s_theorem?wprov=sfla1
If you have an originally D dimensional system (measured using correlation or information dimension for example), and you time delay embed data from the system, you at most (can be lower) need 2*D+1 embedding dimensions to ensure no false nearest neighbors.
This sets an upper bound if you use time delays. Now, for a *non-*time delayed embedding, I don't know the answer. I asked GPT4 and it said no analytical method for determining embedding dimension M presently exists ahead of time. An experimental method does exist that you can perform before training a model: You need to grow the number of embedding dimensions M and calculate FNN every time M grows. Once FNN drops to near zero, then you've finally found a suitable M.
One neat part about all this is that if you have some complex D-dimensional manifold or distribution with features that "poke out" into different directions in the embedding space (imagine a wheel hub with spokes), then increasing the embedding space size M will also increase the distance between the spokes. If M gets large enough, all the spokes should be nearly equal in distance from each other, but points along a singular spoke are also far from each other in most directions except for just a small subset.
I don't think that making it super large would actually make learning on the data any easier though. Best to stick with close to the minimum embedding dimension M. If you get larger, then measurement noise in your data becomes more represented in the embedded distribution. These dynamics also unfold when you increase M, which means if you're trying to only predict the D-dimensional system, you'll have harder time because now you're predicting a (D+large#) dimensional system and the obviousness of the D-dimensional system distribution gets lost in the larger distribution.
Viewing a single comment thread. View all comments