StablePunFusion t1_jd8qm6b wrote on March 22, 2023 at 5:17 PM

Thanks for releasing the training data (https://github.com/sahil280114/codealpaca/blob/master/data/code_alpaca_20k.json).

Where was the training data gathered from? Has the data been verified to be correct?

I'm a tad sad to see that most of the training data doesn't have the language tagged anywhere, some do but most don't, so the resulting model might not be super useful as it'll confuse languages, I guess.

immune_star OP t1_jd8qqsg wrote on March 22, 2023 at 5:17 PM

Data has been generated using text-davinci-003 , not verified to be correct

StablePunFusion t1_jd8r3xb wrote on March 22, 2023 at 5:20 PM

Do you (or anyone) know of any higher quality sources of training sets for code?

Seems to be lacking, at least when I searched around last time. Maybe it's time to spin up a community initiative around it?

[deleted] t1_jd8tbce wrote on March 22, 2023 at 5:33 PM

[removed]