Submitted by redditnit21 t3_xzkmlr in MachineLearning
[removed]
Submitted by redditnit21 t3_xzkmlr in MachineLearning
[removed]
Randomly permute the rows of the table, and then take the first X% of them for training.
Can you write the code for that?
Yes, there are dozens of ways of doing it. I encourage you to figure it out yourself. If you can't design and implement your own test & evaluation experiments for ML, you will end up doing the world more harm than good by dabbling in it. The entire ML field suffers from extremely weak T&E, and lots of people just learning to stuff inputs into functions.
Some hints:
There may be functions within standard machine learning software libraries that produce train/test splits given tabular data input.
There may be functions that will produce a random permutation of rows of a table.
There may be functions that produce random permutations of numbers from 0 to N, where you specify N. If N is the number of rows in your table, you could create a new column of these random numbers and then sort the table on that column.
You may want to consider class imbalance in your dataset. If this is the case, apply your train/test split independently to class 1 and class 0 such that your resulting split contains the same proportion of 1 and 0 in both train and test partitions.
Consider using an outer crossvalidation approach, where you do your experiment for k different train/test splits. When you report your metrics, look at the distribution of each metric over k experiments. Report the median, interquartile range, 5th and 95th percentiles, and outliers for each metric over k experiment trials.
version control your code and tag the commit that produces the results you report. Include this tag or the commit hash with your reporting of results.
You might shuffle dataframe (df.sample(frac=1)) and then just take first 80 % of samples as train and other 20 as test
Also you might use sklearn train_test_split
But the images are present in some other folder. Can you send me the code for that?
If all images are in one folder you can make a variable like path_to_dataset = “your_path” Split normally and then just add dataset path to image name
Christ
What happened? What’s your problem? Am I not allowed to ask any question?
Well, you’re clearly way too lazy / unaware to look it up for yourself. You’re asking another commenter to actually write the code for you for this insanely trivial task. And then also, if this is a problem for you, actually doing anything remotely technical with respect to the actual machine learning will be way way way beyond you.
Sorry, if I am way too lazy for you. Just an advice, Please be kind for the world. You could have just given me an advice instead of just being harsh on me. Be kind and humble!
No. Sometimes people need harsh words. Your attitude of getting people to do your work for you is pathetic, and you should feel ashamed about it.
I am not asking people to do my task. I was just asking them to tell me the command for which I am sorry I already told you. You should be ashamed for behaving horribly instead of giving constructive criticism. Shame on you.
I am not remotely ashamed. I hope that my words encouraged you to apply yourself properly.
I saw your some previous post asking for datasets and some basic advice? Why are you asking for such basic questions? You could have just searched it on the internet.
Because I have already searched for and categorised all the public datasets for such tasks and contacted the appropriate people about commercial licenses. I was asking, in order to find more people to talk to about licensing their private data.
Nice try though pal. Maybe just move on.
And I searched on the internet to split the csv file according to image paths but I only found 1 method of splitting it into different folders. Didn’t found any solution based on pandas.
Nice try!
Your question is the ML equivalent of asking how to write a for loop
I must say, I find it really weird that someone who would ask people online to write trivially simple code for them would be this defensive. Can you not look at yourself and think, huh maybe something is wrong with my attitude?
u/in-your-own-words explaining nicely but I feel you lack basics, thus here are my two cents:
Do not ask for code:) As I have said, mostly pandas is your friend
My pedagogical method is more socratic and from an engineering perspective. I think discovering how to find the names of the mainstream tools, how to find the documentation, and how to learn to read, understand, and rely on it, is ultimately the most beneficial and empowering to the developer.
Thanks for such a good answer and I will keep in mind all you said and start learning basics. I don’t know why other guys are just straight up criticising me. What I did I split the images into 2 folders train and test and then further classified into folders (Class 1 and Class 2). Then I am thinking of using train data generator for training.
you are welcome. my advice is do not do anything manually, do it with Pandas. ie you can use pandas '.loc' command to filter training data, and write that data to training folder etc. İf stuck at any point, search the internet or ask it. good luck, have a nice day:)
Just a last question, do you know any good resource to learn the basics for Pandas?
Trust me it pays off well (even without the context of data science field, it gives you the ability to manage the tabular data effectively)
https://www.kaggle.com/learn/pandas
https://www.coursera.org/specializations/data-science-python
Thanks a lot man! I am really sorry for asking to write code.
PassionatePossum t1_irmr78w wrote
You seem to be quite new at this (no offense, but otherwise you wouldn't be asking for code for such a trivial task), I would like to give you some advice on how to do this right. Others have already told you how to implement a random split, which generally is good advice. However, the underlying assumption is, that the images themselves are not somehow correlated with one another.
I've actually seen people taking video frames (and of course every video frame doesn't look much different from the previous one) and randomly sample these frames into training/test sets and then bragging about their incredibly good performance. Of course any performance measurements you do on such a dataset will be worthless.
So how you want to sample training/test data is something you should think about carefully (i.e. are the training/validation/test set actually independent from one another).
So under the assumption, that the images are independent from one another a random split would be a good idea. If that isn't the case (and without more information, nobody here can tell you whether that is the case), you need some other way to split the data (e.g. by video).