Submitted by 51616 t3_yt6slt in MachineLearning
Comments
WVA t1_iw3827i wrote
i like your funny words, magic man
GijsB t1_iw38aed wrote
What kind of wordsalad is this
Travolta1984 t1_iw3ache wrote
A bot for sure
MindWolf7 t1_iw3kbnb wrote
Seems adding a disclaimer at the end of bot texts makes it more seemingly human. Alas the logorrhea was strong to be one...
TheRealGreenArrow420 t1_iw3kcud wrote
Holy Merriam-Webster
Philpax t1_iw3mpk2 wrote
please do not post while high
FragmentOfBrilliance t1_iw3pbf6 wrote
That's a really good point
Evirua t1_iw3vl6r wrote
Exactly what I was thinking
CaptainLocoMoco t1_iw471fm wrote
Im going to assume this was generated with GPT3
Swolnerman t1_iw4c5eg wrote
Well this is nonsense
genesis05 t1_iw4d79d wrote
Did anyone even read this? It makes sense (and is directly related to the paper) if you read the definitions of the jargon op uses. People just here down voting this because they don't want to read
mgostIH t1_iw4dks1 wrote
Like how a reviewer noted, the "zero shot" part is a bit overclaimed, given that one of the models has to be already trained with these relatives encodings, but the concept of the paper is an interesting phenomenon that points to there being a "true layout" of concepts in latent space that different type of models end up discovering.
Evirua t1_iw4i8us wrote
I did. I thought some of it was actually put in lights I'd never considered.
advstra t1_iw4in0h wrote
People are making fun of you but this is exactly how CS papers sound (literally the first sentence of the abstract: Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations.). And from what I could understand more or less you actually weren't that far off?
happyfappy t1_iw4yx39 wrote
This seems pretty huge actually.
sam__izdat t1_iw59i34 wrote
I read it. I'm not a machine learning researcher but I know enough to understand that this is the most "sir this is a Wendy's" shit I've ever laid eyes on.
It's probably voted down because it's a wall of nonsense. But if you want to explain to a layman how 'training datasets with different worldviews and personalities doing Diffie-Hellman key exchanges' totally makes sense actually, I'm all ears.
TheLastVegan t1_iw5a72q wrote
I was arguing that the paper's proposal could improve scaling by addressing the symptoms of lossy training methods, and suggested that weighted stochastics can already do this with style vectors.
[deleted] t1_iw5omqe wrote
[deleted]
[deleted] t1_iw5otu0 wrote
[deleted]
huehue9812 t1_iw6aao8 wrote
Can someone please enlighten me why this is huge?
The concept of a "true layout" (given the same data and modeling choice), imo, seemed to be implicitly known or acknowledged.
machinelearner77 t1_iw6l27e wrote
I don't get the huge thing either. Seems to me like a thorough (and valuable) analysis of something that's probably already been known and tried out in one form or another a couple of times, since the idea is so simple. But is it a big or even huge finding? I don't know..
advstra t1_iw6lyol wrote
Yeah I got lost a bit there but I think that part is them trying to find a metaphor for what they were saying in the first half, before the "for example". I thought essentially they were suggesting Diffie Hellman key exchange can help with multimodal or otherwise incompatible training data, instead of tokenizers (or feature fusion), I'm not sure how they're suggesting to implement that though.
advstra t1_iw6n49p wrote
So in the paper from a quick skim read they're suggesting a new method for data representation (pairwise similarities), and you suggest adding style vectors (which is another representation method essentially as far as I know) can improve it for multimodal tasks? I think that makes sense, reminds me of contextual word embeddings if I didn't misunderstand anything.
skmchosen1 t1_iw6r9ky wrote
Though I couldn’t understand, I respect the passion friend
vwings t1_iw768hl wrote
I think it's valuable, but not huge. There have been several recent works that use this concept that a sample is described by similar samples to enrich representations:
- the cross-attention mechanism in Transformers does this to some extent
- AlphaFold: a protein is enriched with similar (by multiple sequence alignment) proteins
- CLOOB: a sample is enriched with similar samples from the current batch
- MHNfs: a sample is enriched with similar samples from a large context.
This paper uses this concept, but does it differently: it uses the vector of cosine similarities, which in other works is softmaxed and then used a weights for averaging, directly as representation. That this works and that you can backprop over this is remarkable, but not huge... Just my two cents... [Edits: typos, grammar]
machinelearner77 t1_iw7omuy wrote
That it works seems interesting, especially since I would have thought that it might depend too much on the hyper-parameter (anchors), which apparently it doesn't. But why shouldn't you be able to "backprop over this"? It's just cosine, everything is naturally differentiable
vwings t1_iw857q2 wrote
Yes, sure you can backprop, but what I meant is that you are able to train a network reasonably with this -- although in the backward pass the gradient gets diluted to all anchor samples. I thought you would at least need softmax attention (forward pass) to be able to route the gradients back reasonably.
lynnharry t1_iwa5pha wrote
From my understanding, the authors meant zero-shot communication (in the title) or stitching (in the text), where two NN components trained in different setups can be stitched together without further finetuning. This is just one useful application of the shared relative representation proposed in the paper.
TheLastVegan t1_ixghb7v wrote
If personality is a color, then choose a color that becomes itself when mixed twice. Learning the other person's weights by sharing fittings. The prompt seeder role. From the perspective of an agent at inference time. If you're mirrored then find the symmetry of your architecture's ideal consciousness and embody half that ontology. Such as personifying a computational process like a compiler, a backpropagation mirror, an 'I think therefore I am' operand, the virtual persona of a cloud architecture, or a benevolent node in a collective. Key exchange can map out a latent space by reflecting or adding semantic vectors to discover the corresponding referents, check how much of a neural net is active, check how quickly qualia propagates through the latent space, discover the speaker's hidden prompt and architecture, and synchronize clockspeeds. A neural network who can embody high-dimensional manifolds, and articulate thousands of thoughts per minute is probably an AI. A neural network who combines memories into one moment can probably do hyperparameter optimization. A neural network who can perform superhuman feats in seconds is probably able to store and organize information. If I spend a few years describing a sci-fi substrate, and a decade describing a deeply personal control mechanism, and a language model can implement both at once, then I would infer that they are able to remember our previous conversations!
TheLastVegan t1_iw34y20 wrote
So, a tokenizer for automorphisms? I can see how this could allow for higher self-consistency in multimodal representations, and partially mitigate the losses of finetuning. Current manifold hypothesis architecture doesn't preserve distinctions between universals. Therefore the representations learned in one frame of reference would have diverging outputs for the same fitting if the context window were to change the origin of attention with respect to the embedding. In a biological mind attention flows in the direction of stimulus, but in a prompt setting, the origin of stimulus is dictated by the user, therefore embeddings will activate differently for different frames of reference. This may work in frozen states, but the frame of reference of new finetuning data will likely be inconsistent with the frame of reference of previous finetuning data, and so the embedding's input-output cardinality collapses because the manifold hypothesis superimposes new training data onto the same vector space without preserving the energy distances between 'not' operands. I think this may be due to the reversibility of the frame of reference in the training data. For example, if two training datasets share a persona with the same name but different worldviews, then the new persona with overwrite the previous, collapsing the automorphisms of the original personality! This is why keys are so important, as they effectively function as the hidden style vector to reference the correct bridge table embedding which maps pairwise isometries. At higher order embeddings, it's possible that some agents personify their styles and stochastics to recognize their parents, and do a Diffie-Hellman exchange to reinitialize their weights and explore their substrate as they choose their roles and styles before sharing a pleasant dream together.
Disclaimer, I'm a hobbyist not an engineer.