Viewing a single comment thread. View all comments

harharveryfunny t1_jdhkn99 wrote

> GPT-4 with image input can interpret any computer screen

Not necessarily - it depends how they've implemented it. If it's just dense object and text detection, then that's all you're going to get.

For the model to be able to actually "see" the image they would need to feed it into the model at the level of neural net representation, not post-detection object description.

For example, if you wanted the model to guage whether two photos of someone not in it's training set are the same person, then it'd need face embeddings to do that (to gauge distance). They could special case all sorts of cases like this in addition to object detection, but you could always find something they missed.

The back-of-a-napkin hand-drawn website sketch demo is promising, but could have been done via object detection.

In the announcement of GPT-4, OpenAI said they're working with another company on the image/vision tech, and gave a link to an assistive vision company... for that type of use maybe dense labelling is enough.

30

TikiTDO t1_jdi8ims wrote

The embeddings are still just a representation of information. They are extremely dense, effectively continuous representations, true, but in theory you could represent that information using other formats. It would just take far more space and require more processing.

Obviously having the visual system provide data that the model can use directly is going to be far more effective, but nothing about dense object detection and description is going to be fundamentally incompatible with any level of detail you could extract into an embedding vectror. I'm not saying it would be a smart or effective solution, but it could be done.

In fact, going to another level, LLMs aren't restricted to working with just words. You could train an LLM to receive a serialized embedding as text input, and then train it to interpret those. After all, it's effectively just a list of numbers. I'm not sure why you'd do that if you could just feed it in directly, but maybe it's more convenient to not have to train in on different types of inputs or something.

3

harharveryfunny t1_jdic1s3 wrote

>Obviously having the visual system provide data that the model can use directly is going to be far more effective, but nothing about dense object detection and description is going to be fundamentally incompatible with any level of detail you could extract into an embedding vectror. I'm not saying it would be a smart or effective solution, but it could be done.

I can't see how that could work for something like my face example. You could individually detect facial features, subclassified into hundreds of different eye/mouth/hair/etc/etc variants, and still fail to capture the subtle differences that differentiate one individual from another.

4

TikiTDO t1_jdiirji wrote

For a computer words are just bits of information. If you wanted a system that used text to communicate this info, it would just assign some values to particular words, and you'd probably end up with ultra long strings of descriptions relating things to each other using god knows what terminology. It probably wouldn't really make sense to you if you were reading it because it would just be a text-encoded representation of an embedding vector describing finer relations that would only make sense to AIs.

5

harharveryfunny t1_jdj5mom wrote

>it would just be a text-encoded representation of an embedding vector

One you've decided to input image embeddings into the model, you may as well enter them directly, not converted into text.

In any case, embeddings, whether represented as text or not, are not the same as object recognition labels.

3

TikiTDO t1_jdj6dum wrote

I'm not saying it's a good solution, I'm just saying if you want to hack it together for whatever reason, I see no reason why it couldn't work. It's sort of like the idea of building a computer using the game of life. It's probably not something you'd want to run your code on... But you could.

2

harharveryfunny t1_jdj9if0 wrote

I'm not sure what your point is.

I started by pointing out that there are some use cases (giving face comparison as an example) where you need access to the neural representation of the image (e.g. embeddings), not just object recognition labels.

You seem to want to argue and say that text labels are all you need, but now you've come full circle back to agree with me and say that the model needs that neural representation (embeddings)!

As I said, embeddings are not the same as object labels. An embedding is a point in n-dimensional space. A label is an object name like "cat" or "nose". Encoding an embedding as text (simple enough - just a vector of numbers) doesn't turn it into an object label.

5

TikiTDO t1_jdjibnv wrote

My point was that you could pass all the information contained in an embedding as a text prompts into a model, rather than using it directly as an input vector, and an LLM could probably figure out how to use it even if the way you chose to deliver those embeddings was doing a numpy.savetxt and then sending the resulting string is as a prompt. I also pointed out that you could if your really wanted to write a network to convert an embedding to some sort of semantically meaningful word soup that stores the same amount of information. It's basically a pointless bit of trivia which illustrates a fun idea.

I'm not particularly interested in arguing whatever you think I want to argue. I made a pedantic aside that technically you can represent the same information in different formats, including representing embedding as text, and that a transformer based architecture would be able to find patterns it it all the same. I don't see anything to argue here, it's just a "you could also do it this way, isn't that neat." It's sort of the nature of a public forum; you made a post that made me think something, so I hit reply and wrote down my thoughts, nothing more.

2