coconautico

coconautico OP t1_ja8dbew wrote

No, I don't, because even if chatGPT could answer my question correctly, that doesn't mean that another assistant could.

Therefore, when I come up with a question that, from my point of view could be challenging to answer by a virtual assistant, and regardless of whether I have searched Google/Reddit/StackOverflow/ChatGPT/... for the answer, I end up typing it on OpenAssistant, (again, just my question).

2

coconautico OP t1_ja3ujgs wrote

According to OpenAI's terms of service, I'm the owner of the input (i.e., my question), which implies that they can use, modify, and distribute my input for the purpose of operating and improving the ChatGPT system, but they can't do anything to prevent me from using my data in other systems.
Link: https://openai.com/terms/

6

coconautico OP t1_ja3nvs7 wrote

I have manually copy-pasted a few interesting questions (i.e, my input) that I have asked chatGPT previously, that encouraged lateral thinking or required specialized knowledge.

However, I'm not so sure it would a good idea to load thousands of questions indiscriminately, because just as we wouldn't express a question on Reddit in the same way we would in person, when we ask a question to chatGPT (or Google), we slightly modify the way we talk by taking into account the weaknesses of the system. And given that we are looking for a high-quality dataset of natural conversations, I don't think this would be a very good strategy in the short term.

Moreover, we also have to consider that the project prioritizes quality above all else, and unless the number of volunteers ranking questions/replies increases considerably, the "ratio of trees to ready exported" wouldn't increase much either.

3

coconautico OP t1_ja1kdu6 wrote

Neither. OpenAssistant is the iniciative to build an open-source version of chatGPT that will fit in a consumer GPU.

However, the goal of this website is to collaborative create a specific type of dataset needed to transform a LLM such as GPT, OPT, Galactica, LLaMA,.. into a virtual assistant to which we can talk to, like chatGPT.

7

coconautico OP t1_ja1gd4g wrote

Indeed! Many of them are just copying and pasting answers out of laziness or because they don't know they're not supposed to. But you know what? That's okay! It doesn't matter. And it's all thanks to the magic of large-scale ranking! Let me explain.

If we had a LLM that just "reads" text indiscriminately, we would end up with a model that could hardly be better than the average human (...as the average human is just, the average). However, the moment we have multiple answers per question, and hundreds of people upvoting/downvoting, and ranking them relatively according to their quality (...and a few moderators like on reddit), we end up with a set of fairly high-quality question-answer pairs that are better than the average human answer, in the same way that a set of weak classifiers can result in a strong classifier (i.e. AdaBoost).

10