ElectronicCress3132
ElectronicCress3132 t1_j45f3yo wrote
Reply to comment by VirtualHat in [D] Has ML become synonymous with AI? by Valachio
> And integrating good-old-fashioned-ai > (GOFAI) with more modern ML is becoming an area of increasing research interest.
Any papers you recommend on this topic?
ElectronicCress3132 t1_j2v4vy4 wrote
Reply to comment by learn-deeply in [R] Massive Language Models Can Be Accurately Pruned in One-Shot by starstruckmon
Could you elaborate what you mean by undertrained?
ElectronicCress3132 t1_j29c108 wrote
Reply to comment by RingoCatKeeper in [P]Run CLIP on your iPhone to Search Photos offline. by RingoCatKeeper
Btw, one should take care not to implement the worst-case O(n) algorithm (which is Quickselect + Median of Medians), because it has high constant factors in the time complexity which slow it down in the average case. QuickSelect + Random Partitioning, or Introselect (the C++ standard library function mentioned) have good average time complexities and rarely hit the worst case.
ElectronicCress3132 t1_j29byah wrote
Reply to comment by Steve132 in [P]Run CLIP on your iPhone to Search Photos offline. by RingoCatKeeper
I think the one in the standard library is introselect, which is a hybrid of QuickSelect
ElectronicCress3132 t1_ir49w2b wrote
Curious - what are the primary differences between this, and the "information correction system" in LaMDA? https://arxiv.org/pdf/2201.08239.pdf figure 3
ElectronicCress3132 t1_j629tix wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
> implement a gradient descent optimization process at inference time
Could you expand on what this means? At inference time, I thought all weights were frozen, so how could the attention layers be somehow performing gradient descent?
Edit: I read the paper in detail and understood it (walk through the math in Section 3). Basically, the sentence itself X has some weights that go through the attention layer (recall how attention works: it embeds the sentence, then multiplies it by key, value, query matrices). If you give it some examples, X', to learn from, well, of course there are going to be weights for both X, and X'. Turns out those weights for X' end up being equivalent to stepping in gradient descent.