Viewing a single comment thread. View all comments

in-your-own-words t1_irmn9d4 wrote

Randomly permute the rows of the table, and then take the first X% of them for training.

1

redditnit21 OP t1_irmniga wrote

Can you write the code for that?

−3

in-your-own-words t1_irmnurg wrote

Yes, there are dozens of ways of doing it. I encourage you to figure it out yourself. If you can't design and implement your own test & evaluation experiments for ML, you will end up doing the world more harm than good by dabbling in it. The entire ML field suffers from extremely weak T&E, and lots of people just learning to stuff inputs into functions.

Some hints:

  • There may be functions within standard machine learning software libraries that produce train/test splits given tabular data input.

  • There may be functions that will produce a random permutation of rows of a table.

  • There may be functions that produce random permutations of numbers from 0 to N, where you specify N. If N is the number of rows in your table, you could create a new column of these random numbers and then sort the table on that column.

  • You may want to consider class imbalance in your dataset. If this is the case, apply your train/test split independently to class 1 and class 0 such that your resulting split contains the same proportion of 1 and 0 in both train and test partitions.

  • Consider using an outer crossvalidation approach, where you do your experiment for k different train/test splits. When you report your metrics, look at the distribution of each metric over k experiments. Report the median, interquartile range, 5th and 95th percentiles, and outliers for each metric over k experiment trials.

  • version control your code and tag the commit that produces the results you report. Include this tag or the commit hash with your reporting of results.

3