Comments

You must log in or register to comment.

dreaming_geometry t1_je7wweh wrote

I've thinking about trying something like this. Everything is moving so fast now in ml, I feel like nearly every new idea I have gets published before I even find the time to get started.

91

EquipmentStandard892 t1_je7xyd9 wrote

I read your paper and was reasoning about something interesting, I wonder if it is possible to use this method to fine-tune the model to be able to query a vector database without harming it's context length limitations. It may sound stupid but humans don't just say things, I'm not talking about CoT especially but I was curious if as our brains do, use another instance of the same LLM to generate little hypothesis about the ongoing conversation, and store those on a vector space database, then use those generated thesis during reasoning. We as humans have also an limited cognitive memory, and how do we overcome this ? Great paper btw.

30

_Arsenie_Boca_ t1_je8km8c wrote

Very interesting work! Though I find the explanation of the concrete approach (how the additional parameters are used by the LM) to be a bit vague. Does anyone have a deeper understanding? Is it using regular adapters?

4

hailfire27 t1_je8l7id wrote

I think he's talking about how during conversations, there are different cognitive levels to a conversation. You are basically having a conversation with yourself about what to say and remembering things to talk about, while at the same time considering the context of the situation, such as the environment or activity.

So he's saying for a model like this, would it be possible to tune the model so that it is able to give better answers in a conversation.

13

CasulaScience t1_je8tqrr wrote

In terms of practical application, is there any reason why someone would use this over Low Rank Adaptation?

3

Deep-Station-1746 t1_je8u12c wrote

This is interesting - compared to LoRa, it allows LLaMA to also accept images as inputs. And, I believe it is orthogonal to using LoRa. Meaning, they possibly can be used together. I'm unsure about the training stability though. I know that LoRa training allows ridiculously high learning rates (1e-5 for Text encoder), especially for dreambooth. Using LoRa for the frozen weights + LLaMA adapter is an interesting thing to explore.

Edit: spelling

10

ahm_rimer t1_je8u2bi wrote

LoRA + PEFT + Zero-init attention adapter = 🤯

24

9182763498761234 t1_je8wfs4 wrote

Work in a niche field instead! There are hundreds of smaller topics in ML that are yet unexplored and only a couple people working on it. I’m working on one of those and it is awesome. The field is slowly progressing but slowly enough that I can make a valuable contribution without getting scooped all the time.

24

TheAdvisorZabeth t1_je90pzt wrote

Hi!~ <3

umm... I am just an uneducated idiot, but I've been having a lot of ideas lately and I think some of them might be real Science too.

But I have no credentials or anyone to discuss ideas with or to help fact-check me about stuff I don't have nearly the Time to Learn.

You seem like a kind person, (and like you might have more Time than Puzzles with which to fruitfully spend that Time on.), do you think you might care to to chat about my ideas? Or possibly offer any sincere advice that is a bit more useful to an autistic puppy than: "That is not a real Theory."?

I never used Tumblr before, but Neil Gaiman made a point to explicitly state that he hangs out there a lot; and since there's no living Author who I have more respect for, I recently began posting my ideas there.

From my perspective I am writing 100% Non-Fiction.

From my perspective I am just a very strange Harmless-Holistic-Aberrant; who managed to dumb-luck their way into figuring out how to gain "Coherent-Root-Access-To-My-Own-Brain".

I am being fully sincere.

I would just ask that if you (or anyone else) thinks that I am just being a stupid Fool, that you please tell me gently, I am pretty sensitive.

Love ya either way!~

Keep on doin your awesome Science stuff no matter what! Cause it's just the coolest thing! (hehe, I wonder if they got that joke?)

hugs!~~~

bye!^!^(for, now...)

OH! I almost forgot to actually Hand You One End Of The Thread lol~

https://www.tumblr.com/baby-ghost-in-the-machine-lovers

−28

gmork_13 t1_je9e6wu wrote

I'm wondering the same thing.
In the LoRA paper they had some pros vs cons on other adapters (where LoRA won out). Though you technically could do both, you'd probably pick one.

Indeed, this adapter wins out vs LoRA when looking at weight size, but since we're talking about MB it's an almost negligible difference (in this scenario). It's a shame they didn't include LoRA training time in their comparison.

They say 1hr on 8*A100, whereas the alpaca-LoRA github says 4-5 hrs on 1*4090.
8*A100's is 640GB VRAM (assuming 80GB model) as opposed to the 4090's 24GB - there are also differences in speed and the fact that the alpaca-LoRA github may have run the inference on an 8bit quantized model.

Since the adapter paper says nothing about quantization, I'm assuming it's 640GB VRAM used for the full fp32 7B model for one hour (or fp16?), compared to the alpaca-LoRA git which runs 24GB VRAM on 8int 7B model for 4.5 hrs.

They both train on the stanford dataset, but alpaca-LoRA-git trains for 3 epochs on the cleaned dataset whereas llama-adapter trains on the full dataset for 5 epochs.
That's a lot of small differences to account for if you're trying to figure out what's faster.
It can be done, but the question remains whether the end result is comparable and whether it was trained to an optimal point.

Since the authors trained alpaca-LoRA, why didn't they write how long alpaca-LoRA took in their comparison table? They trained on the same hardware and dataset, I assume.

If the only difference between this adapter and others is, as they mention in the paper, the gating, zero init and multi-modality then the downsides mentioned in the LoRA paper might still hold (bottlenecks). I'm no expert though.

8

saintshing t1_je9fciu wrote

> I was curious if as our brains do, use another instance of the same LLM to generate little hypothesis about the ongoing conversation, and store those on a vector space database, then use those generated thesis during reasoning.

I just learned about LangChain recently. If I understand correctly, they have agents that integrate LLMs and external tools like internet search, sql query, vector store query, it also has a memory module to store ongoing dialog and intermediate results.

They use ReAct or MKRL framework to create subprolems, decide what tools to use and how to react to the results returned by those tools.

example: https://tsmatz.files.wordpress.com/2023/03/20230307_paper_example.jpg?w=446&zoom=2

https://python.langchain.com/en/latest/getting_started/getting_started.html

https://tsmatz.wordpress.com/2023/03/07/react-with-openai-gpt-and-langchain/

https://twitter.com/yoheinakajima/status/1640934493489070080

7

EquipmentStandard892 t1_je9gc9y wrote

I've already seen langchain and it's truly amazing, the issue I've encountered and was trying to overcome is more an architectural problem actually, the token context span limit. I was looking to add a layer upon the transformer architecture to bypass this limitations, I've seen MKRL is able to handle higher context lengths, even claiming unlimited context span, although need to study more. I was not thinking about prompt engineering at all.

7

saintshing t1_je9iw85 wrote

Jeremy Howard tweeted about this new model that is RNN but can be trained in parallel. I havent read the details but it seems people are hyped that it can bypass the context length limit.

>RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

>So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

https://github.com/BlinkDL/RWKV-LM#the-rwkv-language-model-and-my-tricks-for-lms
https://twitter.com/BlinkDL_AI/status/1638555109373378560

4

EquipmentStandard892 t1_je9kmvi wrote

This exactly what I was talking about, I'm studying the llama.cpp to understand how this whole ML LLM world works, and I've found its pretty "simple" in the meanings of the programming itself. I'm a software engineer outside the ML field, and it was pretty interesting to do this deep dive. I'll take a deeper look into this RWKV proposal and maybe make something upon to test. If I found something interesting I comment here 😊

3

jan_antu t1_je9me91 wrote

I read a few of your posts. It seems like you're having a break from reality. I'm a scientist but not a psychologist; I think you should speak with one, or a psychiatrist. Things may be fine for now but you don't want to end up hurting yourself or someone else by accident as this progresses.

16

lxe t1_je9vkqx wrote

In what way is this different than the existing low rank adaptation method everyone is doing already?

2

DigThatData t1_jea10sg wrote

yeah... i hate to say it but I agree with the other commenters. If you have access to medical support, I strongly recommend you get seen by a physician. I'm concerned you might be experiencing some kind of psychiatric episode. If you're skeptical that's fine, you can even tell them that.

> "Strangers on the internet expressed concern that I might be experiencing a psychiatric episode of some kind. I don't see it, but enough people suggested it that I felt it merited a professional opinion, so here I am."

6

JustOneAvailableName t1_jea2dzf wrote

Software engineer perspective on attention (self quote):

> You have to think about searching. If you search, you have a query (the search term), some way to correlate the query to the actual (size unknown/indifferent) knowledge base and the knowledge base itself. If you have to write this as a mathematical function you have to have something that matches a query, to how similar it is to some key and then return the corresponding value to that key. The transformer equation is a pretty straightforward formula from that perspective. Each layers learns what it searches for, how it can be found and which value it wants to transfer when requested.

RWKV changes this by removing the query. So data is not requested anymore, only pushed. I am frankly surprised to seems to work thus far. Pushing data (self determining how important something is for something else) is not dependant on other states, enabling it to be a RNN.

Edit: step I need to mention: in RWKV importance also fades over time, so it has a recency bias

3

pier4r t1_jead39m wrote

As a semi layman, while I was amazed by the progress in ML, I was skeptical of every increasing models, needing more and more parameters to do good. I felt like "more parameters can improve things, then other factor follows".

I asked myself whether there was any effort in being more efficient shrinking things and recently I read about LLAMA and I realized that that direction is now pursued as well.

1

Swolnerman t1_jead4wo wrote

How can it do that with a context window of 32k?

On top of that, I don’t think gpt4 can make informed decisions on picking between academic research papers as of yet

3

EquipmentStandard892 t1_jeaqt6u wrote

I've already had that in mind, I've found some interesting paper talking about integrating LLMs in a specific way designed to handle autonomous task execution given an direct objective/goal. Combining this with this RNN approach seems to be the go to for increase the cognitive development of the whole system. Using the RNN as our subconscious would do and indexing this into a vector space capable of hybrid search, or something like SPLADE search engines, or even build a neural attention graph network to store the rules that aggregate the raw tokens into the vector space, could drastically improve the performance of small language models, maybe leading to further optimization beyond the token limit span.

Article about integrating memory and task/objectives using multiple LLM instances: https://yoheinakajima.com/task-driven-autonomous-agent-utilizing-gpt-4-pinecone-and-langchain-for-diverse-applications/

1

DigThatData t1_jeb49b8 wrote

this is probably not a concern for whale vocalizations, but an issue for attempting to decode animal communications generally via LLMs is that they're probably communicating as much information (if not more) non-vocally. for example, if we wanted to train an LLM to "understand" dog communication, it'd probably be more important to provide it with signals corresponding to changes in body and face pose than vocalizations. interesting stuff in any event.

3

currentscurrents t1_jeb4shv wrote

Yeah, I think that's why they're starting with whales - they're an easy subject since their vocalizations can be heard through the water from miles away. They also seem to have a fairly complex vocal language, unlike for example songbirds with memorized mating calls.

2

thecity2 t1_jebvmmo wrote

Lol I just figured out why it’s called LLaMA. Guess I have some catching up to do. 🫤😆

1

aliasaria t1_jefih93 wrote

A short answer is that it is "just different". It's another way to tweak an existing LLM to do another task, without having to finetune the whole system. Conceptually, this way is simpler than LoRA and seems to work as well or better.

In the paper, the authors mention that one advantage is that you can use this technique to add new modalities. The whole method works by adding to the prompt at the top most layer(s), so you can add not just words, you could add tokens that come from an image. They have an example on the top of page 4 with a picture of a baby opening a door.

2

aliasaria t1_jefj33h wrote

It's a very different way to finetune a model efficiently.

All these tools try to nudge an existing large model, without having to nudge all the weights.

A simplistic explanation of LoRA is that LoRA looks at the whole pretrained model and tries to identify only the most influential weights, and nudge those only.

This tool, instead, adds weights to the model (at the start of prompts) in addition to the existing model.

One advantage to LoRA, in this case, is that you can merge your LoRA finetuned weights into the original model and the result is a new model that is exactly the same size and shape as the original model. In the technique in this paper, however, the final model is a different shape from the original model. But the concept is sort of simpler.

2