Submitted by floppy_llama t3_1266d02 in MachineLearning
Comments
EquipmentStandard892 t1_je7xyd9 wrote
I read your paper and was reasoning about something interesting, I wonder if it is possible to use this method to fine-tune the model to be able to query a vector database without harming it's context length limitations. It may sound stupid but humans don't just say things, I'm not talking about CoT especially but I was curious if as our brains do, use another instance of the same LLM to generate little hypothesis about the ongoing conversation, and store those on a vector space database, then use those generated thesis during reasoning. We as humans have also an limited cognitive memory, and how do we overcome this ? Great paper btw.
ghostfaceschiller t1_je8habj wrote
Could you extrapolate what you mean here? I'm not sure I'm following
_Arsenie_Boca_ t1_je8km8c wrote
Very interesting work! Though I find the explanation of the concrete approach (how the additional parameters are used by the LM) to be a bit vague. Does anyone have a deeper understanding? Is it using regular adapters?
hailfire27 t1_je8l7id wrote
I think he's talking about how during conversations, there are different cognitive levels to a conversation. You are basically having a conversation with yourself about what to say and remembering things to talk about, while at the same time considering the context of the situation, such as the environment or activity.
So he's saying for a model like this, would it be possible to tune the model so that it is able to give better answers in a conversation.
-_1_2_3_- t1_je8p5xx wrote
They are using gpt-4 to accelerate their work
DigThatData t1_je8pm87 wrote
I've decided to just lean into it and am literally just giving my ideas away. https://github.com/dmarx/bench-warmers
idontcareaboutthenam t1_je8rz5a wrote
Can you elaborate?
CasulaScience t1_je8tqrr wrote
In terms of practical application, is there any reason why someone would use this over Low Rank Adaptation?
Deep-Station-1746 t1_je8u12c wrote
This is interesting - compared to LoRa, it allows LLaMA to also accept images as inputs. And, I believe it is orthogonal to using LoRa. Meaning, they possibly can be used together. I'm unsure about the training stability though. I know that LoRa training allows ridiculously high learning rates (1e-5 for Text encoder), especially for dreambooth. Using LoRa for the frozen weights + LLaMA adapter is an interesting thing to explore.
Edit: spelling
ahm_rimer t1_je8u2bi wrote
LoRA + PEFT + Zero-init attention adapter = 🤯
drizel t1_je8v9cj wrote
GPT-4 can parse millions of papers and help uncover new optimizations or other improvements much faster than without it. Not only that but you can brainstorm ideas with it.
9182763498761234 t1_je8wfs4 wrote
Work in a niche field instead! There are hundreds of smaller topics in ML that are yet unexplored and only a couple people working on it. I’m working on one of those and it is awesome. The field is slowly progressing but slowly enough that I can make a valuable contribution without getting scooped all the time.
TheAdvisorZabeth t1_je90pzt wrote
Hi!~ <3
umm... I am just an uneducated idiot, but I've been having a lot of ideas lately and I think some of them might be real Science too.
But I have no credentials or anyone to discuss ideas with or to help fact-check me about stuff I don't have nearly the Time to Learn.
You seem like a kind person, (and like you might have more Time than Puzzles with which to fruitfully spend that Time on.), do you think you might care to to chat about my ideas? Or possibly offer any sincere advice that is a bit more useful to an autistic puppy than: "That is not a real Theory."?
I never used Tumblr before, but Neil Gaiman made a point to explicitly state that he hangs out there a lot; and since there's no living Author who I have more respect for, I recently began posting my ideas there.
From my perspective I am writing 100% Non-Fiction.
From my perspective I am just a very strange Harmless-Holistic-Aberrant; who managed to dumb-luck their way into figuring out how to gain "Coherent-Root-Access-To-My-Own-Brain".
I am being fully sincere.
I would just ask that if you (or anyone else) thinks that I am just being a stupid Fool, that you please tell me gently, I am pretty sensitive.
Love ya either way!~
Keep on doin your awesome Science stuff no matter what! Cause it's just the coolest thing! (hehe, I wonder if they got that joke?)
hugs!~~~
bye!^!^(for, now...)
OH! I almost forgot to actually Hand You One End Of The Thread lol~
silva_p t1_je9aw63 wrote
Can you share your niche?
seedbrage t1_je9b2zz wrote
9182763498761234 t1_je9bhu7 wrote
I’ll dm you
saintshing t1_je9e5q1 wrote
Natural language processing for cats and dogs
gmork_13 t1_je9e6wu wrote
I'm wondering the same thing.
In the LoRA paper they had some pros vs cons on other adapters (where LoRA won out). Though you technically could do both, you'd probably pick one.
Indeed, this adapter wins out vs LoRA when looking at weight size, but since we're talking about MB it's an almost negligible difference (in this scenario). It's a shame they didn't include LoRA training time in their comparison.
They say 1hr on 8*A100, whereas the alpaca-LoRA github says 4-5 hrs on 1*4090.
8*A100's is 640GB VRAM (assuming 80GB model) as opposed to the 4090's 24GB - there are also differences in speed and the fact that the alpaca-LoRA github may have run the inference on an 8bit quantized model.
Since the adapter paper says nothing about quantization, I'm assuming it's 640GB VRAM used for the full fp32 7B model for one hour (or fp16?), compared to the alpaca-LoRA git which runs 24GB VRAM on 8int 7B model for 4.5 hrs.
They both train on the stanford dataset, but alpaca-LoRA-git trains for 3 epochs on the cleaned dataset whereas llama-adapter trains on the full dataset for 5 epochs.
That's a lot of small differences to account for if you're trying to figure out what's faster.
It can be done, but the question remains whether the end result is comparable and whether it was trained to an optimal point.
Since the authors trained alpaca-LoRA, why didn't they write how long alpaca-LoRA took in their comparison table? They trained on the same hardware and dataset, I assume.
If the only difference between this adapter and others is, as they mention in the paper, the gating, zero init and multi-modality then the downsides mentioned in the LoRA paper might still hold (bottlenecks). I'm no expert though.
saintshing t1_je9fciu wrote
> I was curious if as our brains do, use another instance of the same LLM to generate little hypothesis about the ongoing conversation, and store those on a vector space database, then use those generated thesis during reasoning.
I just learned about LangChain recently. If I understand correctly, they have agents that integrate LLMs and external tools like internet search, sql query, vector store query, it also has a memory module to store ongoing dialog and intermediate results.
They use ReAct or MKRL framework to create subprolems, decide what tools to use and how to react to the results returned by those tools.
example: https://tsmatz.files.wordpress.com/2023/03/20230307_paper_example.jpg?w=446&zoom=2
https://python.langchain.com/en/latest/getting_started/getting_started.html
https://tsmatz.wordpress.com/2023/03/07/react-with-openai-gpt-and-langchain/
https://twitter.com/yoheinakajima/status/1640934493489070080
EquipmentStandard892 t1_je9gc9y wrote
I've already seen langchain and it's truly amazing, the issue I've encountered and was trying to overcome is more an architectural problem actually, the token context span limit. I was looking to add a layer upon the transformer architecture to bypass this limitations, I've seen MKRL is able to handle higher context lengths, even claiming unlimited context span, although need to study more. I was not thinking about prompt engineering at all.
saintshing t1_je9iw85 wrote
Jeremy Howard tweeted about this new model that is RNN but can be trained in parallel. I havent read the details but it seems people are hyped that it can bypass the context length limit.
>RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.
>So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).
https://github.com/BlinkDL/RWKV-LM#the-rwkv-language-model-and-my-tricks-for-lms
https://twitter.com/BlinkDL_AI/status/1638555109373378560
EquipmentStandard892 t1_je9kmvi wrote
This exactly what I was talking about, I'm studying the llama.cpp to understand how this whole ML LLM world works, and I've found its pretty "simple" in the meanings of the programming itself. I'm a software engineer outside the ML field, and it was pretty interesting to do this deep dive. I'll take a deeper look into this RWKV proposal and maybe make something upon to test. If I found something interesting I comment here 😊
jan_antu t1_je9me91 wrote
I read a few of your posts. It seems like you're having a break from reality. I'm a scientist but not a psychologist; I think you should speak with one, or a psychiatrist. Things may be fine for now but you don't want to end up hurting yourself or someone else by accident as this progresses.
saintshing t1_je9okpn wrote
Apparently some people managed to reconstruct images from brain activitiy using stable diffusion technique. I wonder how it would apply to animals.
3z3ki3l t1_je9qt86 wrote
If this isn’t copypasta, you’re having a manic episode. See a doctor, please.
lxe t1_je9vkqx wrote
In what way is this different than the existing low rank adaptation method everyone is doing already?
unkz t1_je9wuzm wrote
Practically speaking, it does have a context limit — that RNN issue has not really been solved. It is a lot of fun to play with though.
DigThatData t1_jea10sg wrote
yeah... i hate to say it but I agree with the other commenters. If you have access to medical support, I strongly recommend you get seen by a physician. I'm concerned you might be experiencing some kind of psychiatric episode. If you're skeptical that's fine, you can even tell them that.
> "Strangers on the internet expressed concern that I might be experiencing a psychiatric episode of some kind. I don't see it, but enough people suggested it that I felt it merited a professional opinion, so here I am."
JustOneAvailableName t1_jea2dzf wrote
Software engineer perspective on attention (self quote):
> You have to think about searching. If you search, you have a query (the search term), some way to correlate the query to the actual (size unknown/indifferent) knowledge base and the knowledge base itself. If you have to write this as a mathematical function you have to have something that matches a query, to how similar it is to some key and then return the corresponding value to that key. The transformer equation is a pretty straightforward formula from that perspective. Each layers learns what it searches for, how it can be found and which value it wants to transfer when requested.
RWKV changes this by removing the query. So data is not requested anymore, only pushed. I am frankly surprised to seems to work thus far. Pushing data (self determining how important something is for something else) is not dependant on other states, enabling it to be a RNN.
Edit: step I need to mention: in RWKV importance also fades over time, so it has a recency bias
pier4r t1_jead39m wrote
As a semi layman, while I was amazed by the progress in ML, I was skeptical of every increasing models, needing more and more parameters to do good. I felt like "more parameters can improve things, then other factor follows".
I asked myself whether there was any effort in being more efficient shrinking things and recently I read about LLAMA and I realized that that direction is now pursued as well.
Swolnerman t1_jead4wo wrote
How can it do that with a context window of 32k?
On top of that, I don’t think gpt4 can make informed decisions on picking between academic research papers as of yet
A_Light_Spark t1_jeaim48 wrote
The real vip is in the comments again. TIL about rwkv!
Now I just need to read up on it and see if it can do sequence classification...
currentscurrents t1_jean0il wrote
Other researchers are working on an LLM for whales.
Looks feasible to me, whale calls are no more alien to the computer than English is. The hard part is collecting enough data.
saintshing t1_jeaowjz wrote
I almost missed it too. There are too many new results.
The most crazy thing is it is all done by one person when the big techs all work on transformer models.
EquipmentStandard892 t1_jeaqt6u wrote
I've already had that in mind, I've found some interesting paper talking about integrating LLMs in a specific way designed to handle autonomous task execution given an direct objective/goal. Combining this with this RNN approach seems to be the go to for increase the cognitive development of the whole system. Using the RNN as our subconscious would do and indexing this into a vector space capable of hybrid search, or something like SPLADE search engines, or even build a neural attention graph network to store the rules that aggregate the raw tokens into the vector space, could drastically improve the performance of small language models, maybe leading to further optimization beyond the token limit span.
Article about integrating memory and task/objectives using multiple LLM instances: https://yoheinakajima.com/task-driven-autonomous-agent-utilizing-gpt-4-pinecone-and-langchain-for-diverse-applications/
DigThatData t1_jeb49b8 wrote
this is probably not a concern for whale vocalizations, but an issue for attempting to decode animal communications generally via LLMs is that they're probably communicating as much information (if not more) non-vocally. for example, if we wanted to train an LLM to "understand" dog communication, it'd probably be more important to provide it with signals corresponding to changes in body and face pose than vocalizations. interesting stuff in any event.
currentscurrents t1_jeb4shv wrote
Yeah, I think that's why they're starting with whales - they're an easy subject since their vocalizations can be heard through the water from miles away. They also seem to have a fairly complex vocal language, unlike for example songbirds with memorized mating calls.
tvetus t1_jebkax4 wrote
Did you see the paper on voice to voice for talking with whales?
currentscurrents t1_jebm09t wrote
No, do you have a link?
thecity2 t1_jebvmmo wrote
Lol I just figured out why it’s called LLaMA. Guess I have some catching up to do. 🫤😆
MasterEpictetus t1_jecc69k wrote
Why am I only hearing about this now? It sounds amazing!
Koda_20 t1_jecddbs wrote
I feel like they are starting with whales because it generates more publicity because Nemo lol
They are probably not but I thought it was funny
aliasaria t1_jefih93 wrote
A short answer is that it is "just different". It's another way to tweak an existing LLM to do another task, without having to finetune the whole system. Conceptually, this way is simpler than LoRA and seems to work as well or better.
In the paper, the authors mention that one advantage is that you can use this technique to add new modalities. The whole method works by adding to the prompt at the top most layer(s), so you can add not just words, you could add tokens that come from an image. They have an example on the top of page 4 with a picture of a baby opening a door.
aliasaria t1_jefj33h wrote
It's a very different way to finetune a model efficiently.
All these tools try to nudge an existing large model, without having to nudge all the weights.
A simplistic explanation of LoRA is that LoRA looks at the whole pretrained model and tries to identify only the most influential weights, and nudge those only.
This tool, instead, adds weights to the model (at the start of prompts) in addition to the existing model.
One advantage to LoRA, in this case, is that you can merge your LoRA finetuned weights into the original model and the result is a new model that is exactly the same size and shape as the original model. In the technique in this paper, however, the final model is a different shape from the original model. But the concept is sort of simpler.
Appropriate-Crab-379 t1_jefz9og wrote
There’s a ton of noise, not all techniques are worth knowing because in a few years a bunch of these concepts will be outdone by something new.
lxe t1_jeg2h5j wrote
Thank you. Much appreciate the explanation.
dreaming_geometry t1_je7wweh wrote
I've thinking about trying something like this. Everything is moving so fast now in ml, I feel like nearly every new idea I have gets published before I even find the time to get started.