Submitted by redditnit21 t3_xzkmlr in MachineLearning
in-your-own-words t1_irmn9d4 wrote
Randomly permute the rows of the table, and then take the first X% of them for training.
redditnit21 OP t1_irmniga wrote
Can you write the code for that?
in-your-own-words t1_irmnurg wrote
Yes, there are dozens of ways of doing it. I encourage you to figure it out yourself. If you can't design and implement your own test & evaluation experiments for ML, you will end up doing the world more harm than good by dabbling in it. The entire ML field suffers from extremely weak T&E, and lots of people just learning to stuff inputs into functions.
Some hints:
-
There may be functions within standard machine learning software libraries that produce train/test splits given tabular data input.
-
There may be functions that will produce a random permutation of rows of a table.
-
There may be functions that produce random permutations of numbers from 0 to N, where you specify N. If N is the number of rows in your table, you could create a new column of these random numbers and then sort the table on that column.
-
You may want to consider class imbalance in your dataset. If this is the case, apply your train/test split independently to class 1 and class 0 such that your resulting split contains the same proportion of 1 and 0 in both train and test partitions.
-
Consider using an outer crossvalidation approach, where you do your experiment for k different train/test splits. When you report your metrics, look at the distribution of each metric over k experiments. Report the median, interquartile range, 5th and 95th percentiles, and outliers for each metric over k experiment trials.
-
version control your code and tag the commit that produces the results you report. Include this tag or the commit hash with your reporting of results.
Viewing a single comment thread. View all comments