in-your-own-words t1_irmn9d4 wrote on October 9, 2022 at 12:54 PM

Randomly permute the rows of the table, and then take the first X% of them for training.

redditnit21 OP t1_irmniga wrote on October 9, 2022 at 12:56 PM

Can you write the code for that?

in-your-own-words t1_irmnurg wrote on October 9, 2022 at 12:59 PM

Yes, there are dozens of ways of doing it. I encourage you to figure it out yourself. If you can't design and implement your own test & evaluation experiments for ML, you will end up doing the world more harm than good by dabbling in it. The entire ML field suffers from extremely weak T&E, and lots of people just learning to stuff inputs into functions.

Some hints:

There may be functions within standard machine learning software libraries that produce train/test splits given tabular data input.
There may be functions that will produce a random permutation of rows of a table.
There may be functions that produce random permutations of numbers from 0 to N, where you specify N. If N is the number of rows in your table, you could create a new column of these random numbers and then sort the table on that column.
You may want to consider class imbalance in your dataset. If this is the case, apply your train/test split independently to class 1 and class 0 such that your resulting split contains the same proportion of 1 and 0 in both train and test partitions.
Consider using an outer crossvalidation approach, where you do your experiment for k different train/test splits. When you report your metrics, look at the distribution of each metric over k experiments. Report the median, interquartile range, 5th and 95th percentiles, and outliers for each metric over k experiment trials.
version control your code and tag the commit that produces the results you report. Include this tag or the commit hash with your reporting of results.