That_Violinist_18 t1_j88ilse wrote on February 12, 2023 at 1:12 PM

Reply to comment by currentscurrents in The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D] by norcalnatv

I keep hearing this argument, but I also keep hearing that models are hitting 60%+ of peak throughput for GPUs when optimizations like FlashAttention and other things are considered.

So how much room is there for alternative architectures when the current hardware only leaves at most 40% of its peak performance on the table?

currentscurrents t1_j8agutn wrote on February 12, 2023 at 9:36 PM

GPU manufacturers are aware of the memory bandwidth limitation, so they don't put in more tensor cores than they would be able to feed with the available memory bandwidth.

>Moving away from transistors, the A100 has 6,912 FP32 CUDA cores, 3,456 FP64 CUDA cores and 422 Tensor cores. Compare that to the V100, which has 5,120 CUDA cores and 640 Tensor cores, and you can see just how much of an impact the new process has had in allowing NVIDIA to squeeze more components into a chip that’s only marginally larger than the one it replaces.

Notice that the A100 actually has less tensor cores than the V100. The tensor cores got faster, but they're still memory bottlenecked, so there's no advantage to having more of them.

That_Violinist_18 t1_j8ed3j9 wrote on February 13, 2023 at 6:15 PM

So should we expect much higher peak throughput numbers from more specialized hardware?

I have yet to hear of any startups in the ML hardware space advertising this.

currentscurrents t1_j8em94v wrote on February 13, 2023 at 7:15 PM

Samsung's working on in-memory processing. This is still digital logic and Von Neumann, but by putting a bunch of tiny processors inside the memory chip, each has their own memory bus they can access in parallel.

Most research on non-Von-Neumann architectures is focused on SNNs. Both startups and big tech are working on analog SNN chips. So far these are proof of concept; they work and achieve extremely low power usage, but they're not at a big enough scale to compete with GPUs.