Submitted by immune_star t3_11yh8x8 in MachineLearning
StablePunFusion t1_jd8qm6b wrote
Thanks for releasing the training data (https://github.com/sahil280114/codealpaca/blob/master/data/code_alpaca_20k.json).
Where was the training data gathered from? Has the data been verified to be correct?
I'm a tad sad to see that most of the training data doesn't have the language tagged anywhere, some do but most don't, so the resulting model might not be super useful as it'll confuse languages, I guess.
immune_star OP t1_jd8qqsg wrote
Data has been generated using text-davinci-003 , not verified to be correct
StablePunFusion t1_jd8r3xb wrote
Do you (or anyone) know of any higher quality sources of training sets for code?
Seems to be lacking, at least when I searched around last time. Maybe it's time to spin up a community initiative around it?
[deleted] t1_jd8tbce wrote
[removed]
Viewing a single comment thread. View all comments