firejak308 t1_iqvqnii wrote on October 3, 2022 at 1:34 PM

Reply to comment by AristocraticOctopus in [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187

Thanks for this explanation! I've heard the general reasoning that "transformers have variable weights" before, but I didn't quite understand the significance of that until you provided the concrete example of relationships between x1 and x3 in one input, versus x1 and x2 in another input.