Comments

You must log in or register to comment.

step21 t1_jearoln wrote

It means he says it has a representation of its world, not just statistics. He may or may not be right. (Also I didn’t read all of it yet, fing long.

2

Jadien t1_jeastxv wrote

I've only skimmed the link (and its sub-links), but the basic idea is this:

If you've trained a model to predict the next move in an Othello game, given the board state as an input, you can not necessarily conclude that the model also has the ability to perform similar tasks, like "Determine whether a given move is legal" or "Determine what the board state will be after executing a move". Those abilities might help a model predict the next move but are not required.

However:

> Context: A recent paper trained a model to play legal moves in Othello by predicting the next move, and found that it had spontaneously learned to compute the full board state - an emergent world representation.

In the process of optimizing the model's ability to predict moves, the model did also develop the ability to compute the next board state, given the initial state previous moves and predicted move (Thank you /u/ditchfieldcaleb).

The author's contribution:

> I find that actually, there's a linear representation of the board state! > This is evidence for the linear representation hypothesis: that models, in general, compute features and represent them linearly, as directions in space! (If they don't, mechanistic interpretability would be way harder)

Which is to say that the model's internal prediction of the next board state is fairly interpretable by humans: There's some square-ish set of activations in the model that correspond to the square-ish Othello board. That's another property of the model that is a reasonable outcome but isn't a foregone conclusion.

50

FermiAnyon t1_jeauumy wrote

This topic in general is super interesting...

So the big difference between humans and these large transformers, on paper, is that humans learn to model things in their environments whether it's tools or people or whatever and it's on that basis that we use analogy and make predictions about things. But we ultimately interact with a small number of inputs, basically our five senses... so the thing I find super interesting is the question of whether these models, even ones that just interact with text, are learning to model just the text itself or if they're actually learning models of things that, with more data/compute would enable them to model more things...

I guess the question at hand is whether this ability to model things and make analogies and abstract things is some totally separate process that we haven't started working with yet, or whether it's an emergent property of just having enough weights to basically be properly isotropic with regard to the actual complexity of the world we live in

27

yaosio t1_jebomzk wrote

A world model doesn't mean a model of the world, it means a model from the data it's been given. Despite not being told what an Othello board looks like there's an internal representation of an Othello board.

13

FermiAnyon t1_jebpsg3 wrote

Yeah, isotropic as in being the same in all directions. So we're probably all familiar with embedding space and the fact that the positional relationships between concepts in embedding space basically encodes information about those relationships. Isotropy in language models refers to the extent to which concepts which are actually unrelated appear unrelated in embedding space.

In other words, a model without this property might havre an embedding space that isn't large enough, but you're still teaching it things and the result is that you're cramming things into your embedding space that's too small, so unrelated concepts are no longer equidistant from other unrelated concepts, implying a relationship that doesn't really exist with the result being that the language model confuses things that shouldn't be confused.

Case in point: I asked chatgpt to give me an example build order for terrans in Broodwar and it proceeded to give me a reasonable sounding build order, except that it was mixing in units from Starcraft 2. Now no human familiar with the games would confuse units like that. I chalk that up to a lack of relevant training data, possibly mixed with an embedding space that's not large enough for the model to be isotropic.

That's my take anyway. I'm still learning ;) please someone chime in and fact check me :D

15

mattsverstaps t1_jedlch4 wrote

So is that saying that there is a kind of linear transformation happening between some space (the reality? Our personal model?) and the embedding space? I don’t know what embedding space is and I shouldn’t be here but you are saying interesting things.

5

turnip_burrito t1_jedrykf wrote

>In other words, a model without this property might havre an embedding space that isn't large enough, but you're still teaching it things and the result is that you're cramming things into your embedding space that's too small, so unrelated concepts are no longer equidistant from other unrelated concepts, implying a relationship that doesn't really exist with the result being that the language model confuses things that shouldn't be confused.

So False Nearest Neighbors?

5

turnip_burrito t1_jeds3mv wrote

>unrelated concepts are no longer equidistant from other unrelated concepts,

Are distances normally the same for all unrelated concepts in a very high dimensional space? Does this have to do with unrelated concepts having low correlation in coordinates, so random distances in each axis, and therefore on average the same distance between each pair of unrelated concepts as any other unrelated pair?

3

FermiAnyon t1_jee03oc wrote

My pretty tenuous grasp of the idea makes me thing stuff like... if you're measuring Euclidean distance or cosine similarity between two points that represent concepts that are completely unrelated, what would that distance or that angle be? And that, ideally, all things that are completely unrelated, if you did a pairwise comparison, would have that distance or that angle. And that the embedding space is large enough to accommodate that. And it sounds to me like kind of a limit property that it may only be possible to approximate because there's like lots of ideas and only so many dimensions to fit them in...

3

turnip_burrito t1_jee0mso wrote

Here's what GPT4 chimed in with (it lines up with what I've heard and read before):

===========

In an N-dimensional space with N >> 1, where M random vectors are dispersed with each coordinate sampled from a uniform distribution within a bounded range, we can make a few observations about the distances between these vectors:

High-dimensional space behavior: In high-dimensional spaces, the behavior of distances between random points differs significantly from what we would expect in lower-dimensional spaces like 2D or 3D. In high-dimensional spaces, most points tend to be far away from each other, and the distances between them are often more similar than they would be in lower-dimensional spaces.

Concentration of distances: As the dimensionality N increases, the pairwise distances between the M random vectors tend to concentrate around a specific value. The distances between points in high-dimensional spaces tend to be less varied than in lower-dimensional spaces, and the majority of the distances will be relatively close to the expected value. This phenomenon is known as the "concentration of measure."

Sparse representation: In high-dimensional spaces, the points are often sparsely distributed, which means that there is a lot of empty space between them. This can lead to a counterintuitive property, where increasing the number of dimensions might actually make it easier to distinguish between the points, as they become more "isolated" in the space.

Curse of dimensionality: While the above-mentioned properties might sometimes be useful, high-dimensional spaces can also give rise to the "curse of dimensionality." This term refers to various challenges that arise when working with high-dimensional data, such as increased computational complexity, difficulty in visualizing the data, and the need for a larger number of samples to obtain statistically meaningful results.

To summarize, in an N-dimensional space with N >> 1, the pairwise distances between M random vectors will generally be far from each other and concentrate around a specific value. High-dimensional spaces exhibit unique properties like the concentration of measure and sparse representation, but they also come with challenges like the curse of dimensionality

3

FermiAnyon t1_jee34lx wrote

Glad you're here. This would be a really interesting chat for like a bar or a meetup or stunting ;)

But yeah, I'm just giving my impressions. I don't want to make any claims of authority or anything as I'm self taught with this stuff...

But yeah, I have no idea how our brains do it, but when you're building a model whether it's a neural net or you're just factoring a matrix, you'll end up with a high dimensional representation that'll get used as an input to another layer or that'll just be used straight away for classification. It may be overly broad, but I think of all of those high dimensional representations as embeddings and the dimensionality available for encoding an embedding as the embedding space.

Like if you were into sports and you wanted to organize your room so that distance represents relationships between equipment. Maybe the baseball is right next to the softball and the tennis racket is close to the table tennis paddle, but they're a little farther away from the baseball stuff, then you've got some golf clubs and they're kind of in one area of the room because they all involve hitting things with another thing. Then your kite flying stuff and your fishing stuff and your street luge stuff is kind of as far apart as possible from the other stuff because it's not obvious to me anyway that they're related. Your room is a two dimensional embedding space.

When models do it, they just do it with more dimensions and more concepts, but they learn where to put things so that the relationships are properly represented and they just learn all that from lots of cleverly crafted examples.

4

FermiAnyon t1_jeetqrm wrote

I did say "basically". The point is it's finite and then we do lots of filtering and interpreting. But based on those inputs, we develop some kind of representation of the world and how we do that is completely mysterious to me, but I heard someone mention that maybe we use our senses to kind of "fact check" each other to develop more accurate models of our surroundings.

I figure multi modal models are really going to be interesting...

5

monks-cat t1_jefqotb wrote

Context radically changes the "distance" between concepts. So in your example isotropy isn't necessarily a desired property of a LLM. In poetry, for example, we combine two concepts that would seemingly be very far apart in the original space but should be mapped rather closely in the embedding.

​

The problem I see with this whole idea though is that a "concept" doesn't inherently seem to be represented by list of features. Two concepts interacting aren't necessarily the intersection of their features.

I'll try to see if I can come up with concrete examples in language.

2

mattsverstaps t1_jefvu45 wrote

So the extra dimensions are unnecessary? I just realised that there could be some situations in which non orthogonal dimensions are preferable. I can’t exactly think of them. Doesn’t it suggest a pattern in data if a mapping is found that reduces the dimension? Like I picture from linear algebra 101 finding a line that everything is a multiple of so one dimension would do and that line is a ‘pattern’? Sorry I’m high.

2

turnip_burrito t1_jefysiz wrote

My prompt:

> Suppose I have an N>>1 dimensional space, finite in extent along any given axis, in which a set of M random vectors are dispersed (each coordinate of each vector is randomly sampled from a uniform distribution spanning some bounded range of the space). What can we say about the distances in this space between the M vectors?

I left my prompt open ended to not give it any ideas one way or another.

Its response makes sense to me. The standard deviation of a set of random samples from a uniform distribution centered at mean 0, which is proportional to the distance calculated here, should shrink as dimension N grows. If N is large, then the distribution of pairwise distances will narrow until nearly all points are roughly the same distance from each other. (The random sampling is a way to build in lack of correlation, like how you mentioned unrelated ideas)

Of course, the reverse is also true: if dimension N is small, then originally "far" points will become closer or farther (which one effect exactly is unpredictable depending on which dimensions are removed) because the averaging over random sample fluctuations disappears.

2

Pas7alavista t1_jeg5dhh wrote

>so the extra dimensions are unnecessary

Yes one reason for embedding is to get extract relevant features.

Also, any finite dimensional inner product space has an orthonormal basis, and the math is easiest this way so there's not much of a reason to describe a space using non orthogonal dimensions. There is also nothing stopping you from doing so though.

>Doesn't it suggest a pattern in data if a mapping is found that reduces dimension

Yeah generally you wouldn't attempt to use ML methods on data where you think there is no pattern

>Something something Linear algebra

I think you might be thinking about the span and or basis but it's hard for me to interpret your question

2

mattsverstaps t1_jegdsq4 wrote

Yes the span, so if we discover that a set of points is actually all in the span of a line, that line is a kind of fact or pattern about the points. So probably there is an equivalent in higher dimensions. I am seeing there is a problem whereby we introduce our own bias in creating our model.

2

FermiAnyon t1_jegh3hd wrote

In this case, I'm using a fuzzy word "concept" to refer to anything that's differentiable from another thing. That includes things like context and semantics and whether a word is polysemantic and even whether things fit a rhyme scheme. Basically anything observable.

But again, I'm shooting from the hip

2

FermiAnyon t1_jegiycj wrote

Pretty neat stuff. Fits well with the conversation we were having. I guess a salient question how large an embedding space do you need before performance in any given task plateaus.

Except that they're not random vectors in the original context.

2

turnip_burrito t1_jegu7uk wrote

Yeah I made the simplification of random vectors myself just to approximate what uncorrelated "features" in an embedding space could be like.

One thing that's relevant for embedding space size Takens theorem: https://en.wikipedia.org/wiki/Takens%27s_theorem?wprov=sfla1

If you have an originally D dimensional system (measured using correlation or information dimension for example), and you time delay embed data from the system, you at most (can be lower) need 2*D+1 embedding dimensions to ensure no false nearest neighbors.

This sets an upper bound if you use time delays. Now, for a *non-*time delayed embedding, I don't know the answer. I asked GPT4 and it said no analytical method for determining embedding dimension M presently exists ahead of time. An experimental method does exist that you can perform before training a model: You need to grow the number of embedding dimensions M and calculate FNN every time M grows. Once FNN drops to near zero, then you've finally found a suitable M.

One neat part about all this is that if you have some complex D-dimensional manifold or distribution with features that "poke out" into different directions in the embedding space (imagine a wheel hub with spokes), then increasing the embedding space size M will also increase the distance between the spokes. If M gets large enough, all the spokes should be nearly equal in distance from each other, but points along a singular spoke are also far from each other in most directions except for just a small subset.

I don't think that making it super large would actually make learning on the data any easier though. Best to stick with close to the minimum embedding dimension M. If you get larger, then measurement noise in your data becomes more represented in the embedded distribution. These dynamics also unfold when you increase M, which means if you're trying to only predict the D-dimensional system, you'll have harder time because now you're predicting a (D+large#) dimensional system and the obviousness of the D-dimensional system distribution gets lost in the larger distribution.

2

Pas7alavista t1_jegu8de wrote

The span describes the entire space. It's a set of vectors that you can combine using addition and multiplication in order to obtain any other vector in the space. For example a spanning set over the real number plane would be {(1,0), (0,1)}. This particular set is also an orthonormal basis and you can think of each vector as representing two orthogonal dimensions. This is because their dot product is 0.

However, any set of two vectors that are not on the same line will span the real number plane. For example, {(1,1), (0,1)} spans the real number plane, but they are not orthogonal.

Overall though it is always important to be aware of your input space, and the features/dimensions that you use to represent it. You can easily introduce bias or just noise in a number of ways if you aren't thorough. One example would be not normalizing your data.

2

FermiAnyon t1_jeh5jr0 wrote

Yeah, I was just saying it's a limited number and that the specific number isn't important. The important thing is that there a limited number. That doesn't imply anything about infinity except that infinity is off the table as an option.

2