ElectronicCress3132

ElectronicCress3132 t1_j629tix wrote

> implement a gradient descent optimization process at inference time

Could you expand on what this means? At inference time, I thought all weights were frozen, so how could the attention layers be somehow performing gradient descent?

Edit: I read the paper in detail and understood it (walk through the math in Section 3). Basically, the sentence itself X has some weights that go through the attention layer (recall how attention works: it embeds the sentence, then multiplies it by key, value, query matrices). If you give it some examples, X', to learn from, well, of course there are going to be weights for both X, and X'. Turns out those weights for X' end up being equivalent to stepping in gradient descent.

24

ElectronicCress3132 t1_j29c108 wrote

Btw, one should take care not to implement the worst-case O(n) algorithm (which is Quickselect + Median of Medians), because it has high constant factors in the time complexity which slow it down in the average case. QuickSelect + Random Partitioning, or Introselect (the C++ standard library function mentioned) have good average time complexities and rarely hit the worst case.

1