Viewing a single comment thread. View all comments

currentscurrents t1_j86gori wrote

In the long run, I think this is something that will be solved with more specialized architectures for running neural networks. TPUs and Tensor Cores are great first steps, but the Von Neumann architecture is holding us back.

Tensor Cores are very fast. But since the Von Neumann architecture has separate compute and memory connected by a bus, the entire network has to travel through the memory bus for every step of training or inference. The overwhelming majority of time is spent waiting on this:

>200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.

A specialized architecture that physically implements neurons on silicon would no longer have this bottleneck. Since each neuron would be directly connected to the memory it needs (weights, data from previous layer) the entire network could run in parallel regardless of size. You could do inference as fast as you could shovel data through the network.

13

That_Violinist_18 t1_j88ilse wrote

I keep hearing this argument, but I also keep hearing that models are hitting 60%+ of peak throughput for GPUs when optimizations like FlashAttention and other things are considered.

So how much room is there for alternative architectures when the current hardware only leaves at most 40% of its peak performance on the table?

4

currentscurrents t1_j8agutn wrote

GPU manufacturers are aware of the memory bandwidth limitation, so they don't put in more tensor cores than they would be able to feed with the available memory bandwidth.

>Moving away from transistors, the A100 has 6,912 FP32 CUDA cores, 3,456 FP64 CUDA cores and 422 Tensor cores. Compare that to the V100, which has 5,120 CUDA cores and 640 Tensor cores, and you can see just how much of an impact the new process has had in allowing NVIDIA to squeeze more components into a chip that’s only marginally larger than the one it replaces.

Notice that the A100 actually has less tensor cores than the V100. The tensor cores got faster, but they're still memory bottlenecked, so there's no advantage to having more of them.

3

That_Violinist_18 t1_j8ed3j9 wrote

So should we expect much higher peak throughput numbers from more specialized hardware?

I have yet to hear of any startups in the ML hardware space advertising this.

1

currentscurrents t1_j8em94v wrote

Samsung's working on in-memory processing. This is still digital logic and Von Neumann, but by putting a bunch of tiny processors inside the memory chip, each has their own memory bus they can access in parallel.

Most research on non-Von-Neumann architectures is focused on SNNs. Both startups and big tech are working on analog SNN chips. So far these are proof of concept; they work and achieve extremely low power usage, but they're not at a big enough scale to compete with GPUs.

1