Viewing a single comment thread. View all comments

StablePunFusion t1_jd8qm6b wrote

Thanks for releasing the training data (https://github.com/sahil280114/codealpaca/blob/master/data/code_alpaca_20k.json).

Where was the training data gathered from? Has the data been verified to be correct?

I'm a tad sad to see that most of the training data doesn't have the language tagged anywhere, some do but most don't, so the resulting model might not be super useful as it'll confuse languages, I guess.

1

immune_star OP t1_jd8qqsg wrote

Data has been generated using text-davinci-003 , not verified to be correct

0

StablePunFusion t1_jd8r3xb wrote

Do you (or anyone) know of any higher quality sources of training sets for code?

Seems to be lacking, at least when I searched around last time. Maybe it's time to spin up a community initiative around it?

2