norcalnatv OP t1_j84wfs7 wrote
"Our model is built from the ground up on a per-inference basis, but it lines up with Sam Altman’s tweet and an interview he did recently. We assume that OpenAI used a GPT-3 dense model architecture with a size of175 billion parameters, hidden dimension of 16k, sequence length of 4k,average tokens per response of 2k, 15 responses per user, 13 million daily active users, FLOPS utilization rates 2x higher than FasterTransformer at <2000ms latency, int8 quantization, 50% hardware utilization rates due to purely idle time, and $1 cost per GPU hour. Please challenge our assumptions"
LetterRip t1_j85b07d wrote
Why not int4? Why not pruning? Why not various model compression tricks? int4 halves latency. At minimum they would do mixed int4/int8.
https://arxiv.org/abs/2206.01861
Why not distillation?
https://transformer.huggingface.co/model/distil-gpt2
NVidia using FasterTransformer and Triton inference server has a 32x speed up over baseline GPT-J,
I think their assumptions are at least an order of magnitude pessimistic.
As someone else notes, the vast majority of queries can be cached. Also there would likely be a Mixture of experts. No need for the heavy duty model when a trivial model can answer the question.
norcalnatv OP t1_j84wt52 wrote
If the ChatGPT model were ham-fisted into Google’s existing search
businesses, the impact would be devastating. There would be a $36
Billion reduction in operating income. This is $36 Billion of LLM
inference costs.
Himalun t1_j8593ax wrote
It’s worth noting that both MS and Google own the data centers and hardware so it is likely cheaper for them to run. But still expensive.
Downchuck t1_j8500e1 wrote
Perhaps the number of unique queries is overstated: through vector similarity search and result caching, the vast majority of lookups would be duplicate searches already materialized. OpenAI has now introduced a "premium" option suggesting a market for premium search - suggesting room for more cash inflows. This may change their spend strategy, perhaps spending less on marketing and more on hardware.
Viewing a single comment thread. View all comments