Viewing a single comment thread. View all comments

trnka t1_j3g5uer wrote

You're right that it's just a matrix multiply of a one-hot encoding. Though representing it as an embedding layer is just faster.

I wouldn't call it a fully-connected layer though. In a fully-connected layer, the input to the matrix multiply is the output of everything in the previous layer, not just the output of a single unit. The weights that multiply the output(s) of the first unit are not the same weights multiplying the output of any other unit.

It's more like a length 1 convolution that projects the one-hot vocab down to the embedding space.

3

throwaway2676 t1_j3h780s wrote

> In a fully-connected layer, the input to the matrix multiply is the output of everything in the previous layer, not just the output of a single unit.

But if the previous layer is 0 everywhere except for one unit, the result is the same, no?

My mental picture is that input layer 0 has V = <token vocabulary size> neurons, and layer 1 has E_d = <embedding dimension> neurons. Layer 0 is 1 in 1 neuron, 0 everywhere else, as one-hot encoding normally goes. The embedding layer 1 is then given by x@W, where x is the layer 0 as a row vector, and W is the weight matrix with dimensions V x E_d. The matrix multiplication then "picks out" the desired row. That would be a fully connected linear layer with no bias.

1

trnka t1_j3i3vk4 wrote

If your input is only ever a single word, that's right.

Usually people work with texts, or sequences of words. The embedding layer maps the sequence of words to a sequence of embedding vectors. It could be implemented as a sequence of one-hot encodings multiplied by the same W though.

2