That_Violinist_18 t1_j88ilse wrote
Reply to comment by currentscurrents in The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D] by norcalnatv
I keep hearing this argument, but I also keep hearing that models are hitting 60%+ of peak throughput for GPUs when optimizations like FlashAttention and other things are considered.
So how much room is there for alternative architectures when the current hardware only leaves at most 40% of its peak performance on the table?
currentscurrents t1_j8agutn wrote
GPU manufacturers are aware of the memory bandwidth limitation, so they don't put in more tensor cores than they would be able to feed with the available memory bandwidth.
Notice that the A100 actually has less tensor cores than the V100. The tensor cores got faster, but they're still memory bottlenecked, so there's no advantage to having more of them.
That_Violinist_18 t1_j8ed3j9 wrote
So should we expect much higher peak throughput numbers from more specialized hardware?
I have yet to hear of any startups in the ML hardware space advertising this.
currentscurrents t1_j8em94v wrote
Samsung's working on in-memory processing. This is still digital logic and Von Neumann, but by putting a bunch of tiny processors inside the memory chip, each has their own memory bus they can access in parallel.
Most research on non-Von-Neumann architectures is focused on SNNs. Both startups and big tech are working on analog SNN chips. So far these are proof of concept; they work and achieve extremely low power usage, but they're not at a big enough scale to compete with GPUs.
Viewing a single comment thread. View all comments