mildresponse

mildresponse t1_j4xjmvw wrote

My interpretation is that the words should have different embedding values when they have different positions (context) in the input. Without a positional embedding, the learned word embeddings will be forced into some kind of positional average. The positional offsets give the model more flexibility to resolve differently in different contexts.

Because the embeddings are high dimensional vectors of floats, I'd guess the risk of degeneracy (i.e. that the embeddings could start to overlap with one another) is virtually 0.

1

mildresponse t1_j4xhvkg wrote

Are there any easy and straightforward methods for moving ML models across different frameworks? Does it come down to just manually translating the parameters?

For instance, I am looking at a transformer model in PyTorch, whose parameters are stored within a series of nested objects of various types in an OrderedDict. I would like to extract all of these parameter tensors for use in a similar architecture constructed in Tensorflow or JAX. The naive method of manually collecting the parameters themselves into a new dict seems tedious. And if the target is something like Haiku in JAX, the corresponding model will initialize its parameters into a new nested dict with some default naming structure, which will then have to be connected to the interim dict created from PyTorch. Are there any better ways of moving the parameters or models around?

1

mildresponse t1_j46sh8k wrote

Why do some tokenizers assign negative floats to each token? For instance, I am looking at this json file, and the tokens start about 1/3 of the way down the page. Each one is part of a two-element list with the structure "[<token>, negative decimal number with 15 digits of accuracy]"

1