Screye t1_jcmpd5i wrote 2 years ago

Reply to comment by VarietyElderberry in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

This is more derived from extensive personal experience with prompt engineering / fine tuning over the last 2 years.

Simply put:

The model learns what it sees. Or, throw enough data of a certain type and emergent properties relating to that data will shop given enough data & compute.
If it has never seen data past 8k tokens in the past (due to context window limitations), the model won't need to learn to reason over more than 8k tokens.
The source data (humans) have limitations on the complexity of thoughts that can be captured within 8k tokens vs 32k tokens
That's not say that the model doesn't reason over longer windows using latent knowledge, which makes its implicit 'reasoning window' much larger than just 8k tokens. But, that is fundamentally different than explicitly reasoning over a 32k window.
The model today can only assemble a chain-of-thought prompt of 8k tokens. If there is never any human feedback or loss-landscape-optimization for when it fails to reason past 8k tokens, then any ability the model gains there will be purely incidental.
On the other hand, when you have chain-of-thought prompt chains that are 32k tokens long, we can naturally expect it to contain more axioms, postulates and relationships between those postulates/axioms.
Those completions will get evaluated against human feedback & just self-supervised-scenarios, which should explicitly optimize the loss landscape to reason over far more complex logical statements.

Idk if that makes sense. Our field keeps moving away from math, and as embarrassing as it is to antromorphize the model, it does make it easier to get the point across.

Screye t1_jcl549n wrote 2 years ago

Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Context length is also a hard limit on how many logical-hops the model can make.

If each back-n-forth takes 500-ish tokens, then the model can only reason over 16 hops over 8k tokens. With 32k tokens, it can reason over 64 hops. This might allow for emergent behaviors towards tasks that have previously been deemed impossible due to needing at least a minimum number of hops to reason about.

For what it's worth, I think memory retrieval will work just fine for 90% of scenarios and will stay relevant even for 32k tokens. Esp. if the wiki you are retrieving from is millions of lines.

Screye t1_j6tu8mc wrote 2 years ago

Reply to [D] What does a DL role look like in ten years? by PassingTumbleweed

> in ten years?

10 years ago was 2012. Deep Learning didn't even exist as field back then.

Tempting as it might be, I'd recommend caution in predicting the future of a field that went from non-existence to near-dominance within its profession in the last 10 years.

Screye t1_j6688sf wrote 2 years ago

Reply to [D] MusicLM: Generating Music From Text by carlthome

I am done man. How is someone supposed to keep up with this pace of research ?

Screye t1_izauw5w wrote 2 years ago

Reply to comment by Tejas_Garhewal in [D] If you had to pick 10-20 significant papers that summarize the research trajectory of AI from the past 100 years what would they be by versaceblues

He is the UIUC of Deep learning's mount rushmore.

Just as people think of Stanford, MIT, CMU, Berkley as the big CS universities and forget that UIUC is almost just as good.....people take the names of Hinton, LeCun, Bengio and forget that Schmidhuber(' lab) did a lot of important foundational work in deep learning.

Sadly, he is a curmudgeon who complains a lot and claims even more than he has actually achieved.....so people have kind of soured on him lately.

Screye t1_ivzs8xs wrote 2 years ago

Reply to comment by bumbo-pa in [D] Current Job Market in ML by diffusion-xgb

AFAIK, There aren't a lot of series A or seed rounds happening.

But pre-established startups like Jasper are getting funded because premier investors have already invested a ton into them. In for a penny, in for a pound.

Screye t1_ivzcjb3 wrote 2 years ago

Reply to [D] Current Job Market in ML by diffusion-xgb

Big companies are firing more non-essential members of the team. Also, research & unreliable money makers get cut first.

So it makes sense that SWEs don't get fired because they maintain the systems. On the other hand, a lot of AI products are not making a shit ton of money just yet, the research costs are very high and the AI Scientists don't usually do the job of maintaining an AI service.
So they get fired with a higher priority than SWEs.

Now, in a downturn, cost cutting takes major priority.

AI tools allow expensive humans to be replaced with cheaper algorithms
3rd party startups can sell their AI toolkit for lower prices than Azure AI / Google AI
If you didn't expect the startup to make money for 3-5 years anyway, then the market conditions don't really matter that much
All other startup industries are in the dumpster. Gig economy startups burn too much money. End users stop using convenience based startups in times of high inflation. And don't even get me started on crypto. So really, health-tech and ML are the only 2 startup sectors where it still makes some sense to invest.

Those 4 things have made it a rather decent time to be in an ML startup, but not so great time to be in ML at a bigtech company.