TikiTDO t1_jdi8ims wrote on March 24, 2023 at 4:06 PM

Reply to comment by harharveryfunny in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-

The embeddings are still just a representation of information. They are extremely dense, effectively continuous representations, true, but in theory you could represent that information using other formats. It would just take far more space and require more processing.

Obviously having the visual system provide data that the model can use directly is going to be far more effective, but nothing about dense object detection and description is going to be fundamentally incompatible with any level of detail you could extract into an embedding vectror. I'm not saying it would be a smart or effective solution, but it could be done.

In fact, going to another level, LLMs aren't restricted to working with just words. You could train an LLM to receive a serialized embedding as text input, and then train it to interpret those. After all, it's effectively just a list of numbers. I'm not sure why you'd do that if you could just feed it in directly, but maybe it's more convenient to not have to train in on different types of inputs or something.

harharveryfunny t1_jdic1s3 wrote on March 24, 2023 at 4:28 PM

>Obviously having the visual system provide data that the model can use directly is going to be far more effective, but nothing about dense object detection and description is going to be fundamentally incompatible with any level of detail you could extract into an embedding vectror. I'm not saying it would be a smart or effective solution, but it could be done.

I can't see how that could work for something like my face example. You could individually detect facial features, subclassified into hundreds of different eye/mouth/hair/etc/etc variants, and still fail to capture the subtle differences that differentiate one individual from another.

TikiTDO t1_jdiirji wrote on March 24, 2023 at 5:10 PM

For a computer words are just bits of information. If you wanted a system that used text to communicate this info, it would just assign some values to particular words, and you'd probably end up with ultra long strings of descriptions relating things to each other using god knows what terminology. It probably wouldn't really make sense to you if you were reading it because it would just be a text-encoded representation of an embedding vector describing finer relations that would only make sense to AIs.

harharveryfunny t1_jdj5mom wrote on March 24, 2023 at 7:37 PM

>it would just be a text-encoded representation of an embedding vector

One you've decided to input image embeddings into the model, you may as well enter them directly, not converted into text.

In any case, embeddings, whether represented as text or not, are not the same as object recognition labels.

TikiTDO t1_jdj6dum wrote on March 24, 2023 at 7:42 PM

I'm not saying it's a good solution, I'm just saying if you want to hack it together for whatever reason, I see no reason why it couldn't work. It's sort of like the idea of building a computer using the game of life. It's probably not something you'd want to run your code on... But you could.

harharveryfunny t1_jdj9if0 wrote on March 24, 2023 at 8:02 PM

I'm not sure what your point is.

I started by pointing out that there are some use cases (giving face comparison as an example) where you need access to the neural representation of the image (e.g. embeddings), not just object recognition labels.

You seem to want to argue and say that text labels are all you need, but now you've come full circle back to agree with me and say that the model needs that neural representation (embeddings)!

As I said, embeddings are not the same as object labels. An embedding is a point in n-dimensional space. A label is an object name like "cat" or "nose". Encoding an embedding as text (simple enough - just a vector of numbers) doesn't turn it into an object label.

TikiTDO t1_jdjibnv wrote on March 24, 2023 at 9:01 PM

My point was that you could pass all the information contained in an embedding as a text prompts into a model, rather than using it directly as an input vector, and an LLM could probably figure out how to use it even if the way you chose to deliver those embeddings was doing a numpy.savetxt and then sending the resulting string is as a prompt. I also pointed out that you could if your really wanted to write a network to convert an embedding to some sort of semantically meaningful word soup that stores the same amount of information. It's basically a pointless bit of trivia which illustrates a fun idea.

I'm not particularly interested in arguing whatever you think I want to argue. I made a pedantic aside that technically you can represent the same information in different formats, including representing embedding as text, and that a transformer based architecture would be able to find patterns it it all the same. I don't see anything to argue here, it's just a "you could also do it this way, isn't that neat." It's sort of the nature of a public forum; you made a post that made me think something, so I hit reply and wrote down my thoughts, nothing more.