Submitted by AutoModerator t3_100mjlp in MachineLearning
throwaway2676 t1_j39vamk wrote
Is an embedding layer (or at least a simple/standard one) the same thing as a fully connected layer from one-hot encoded tokens to a hidden layer of length <embedding dimension>? The token embeddings would be the weight matrix, but with the biases set to 0.
trnka t1_j3g5uer wrote
You're right that it's just a matrix multiply of a one-hot encoding. Though representing it as an embedding layer is just faster.
I wouldn't call it a fully-connected layer though. In a fully-connected layer, the input to the matrix multiply is the output of everything in the previous layer, not just the output of a single unit. The weights that multiply the output(s) of the first unit are not the same weights multiplying the output of any other unit.
It's more like a length 1 convolution that projects the one-hot vocab down to the embedding space.
throwaway2676 t1_j3h780s wrote
> In a fully-connected layer, the input to the matrix multiply is the output of everything in the previous layer, not just the output of a single unit.
But if the previous layer is 0 everywhere except for one unit, the result is the same, no?
My mental picture is that input layer 0 has V = <token vocabulary size> neurons, and layer 1 has E_d = <embedding dimension> neurons. Layer 0 is 1 in 1 neuron, 0 everywhere else, as one-hot encoding normally goes. The embedding layer 1 is then given by x@W, where x is the layer 0 as a row vector, and W is the weight matrix with dimensions V x E_d. The matrix multiplication then "picks out" the desired row. That would be a fully connected linear layer with no bias.
trnka t1_j3i3vk4 wrote
If your input is only ever a single word, that's right.
Usually people work with texts, or sequences of words. The embedding layer maps the sequence of words to a sequence of embedding vectors. It could be implemented as a sequence of one-hot encodings multiplied by the same W though.
Viewing a single comment thread. View all comments