I keep hearing this argument, but I also keep hearing that models are hitting 60%+ of peak throughput for GPUs when optimizations like FlashAttention and other things are considered.
So how much room is there for alternative architectures when the current hardware only leaves at most 40% of its peak performance on the table?
That_Violinist_18 t1_j8ed3j9 wrote
Reply to comment by currentscurrents in The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D] by norcalnatv
So should we expect much higher peak throughput numbers from more specialized hardware?
I have yet to hear of any startups in the ML hardware space advertising this.