Remote_Event_4290

Remote_Event_4290 t1_j3jrvah wrote

Hi! I am a student and have been very interested in the ways that bias can be removed from ML datasets, and I have some ideas of how bias could hypothetically be reduced but am by no means an expert. I would greatly appreciate any feedback, recommendations, or additions to some of the ideas that I currently have.

Right now, it seems that there is no specific way to completely remove bias from ML datasets, but I have been attempting to create a hypothetical design or a process to prevent bias as much as possible.

First off, the quality of the raw data is really the most important part of machine learning datasets, but collecting good data is more of a statistical problem. Based on what the learning model is trying to do, you would need to consult with statisticians on determining the quality of the data and if it is even valid, and if you should be generating a random sample, or using all raw data.

As far as the learning model itself, I have formulated a few suggestions for the dataset itself:

  • One of the first ideas that came to mind is excluding 'sensitive' demographic data, like age, sex, race, etc, which may work in certain cases, but could also backfire. * For example, one way of reducing bias is to use the demographics to pre-filter the data and ensure groups are accurately represented.
  • One thing you can do is create two datasets and run them through a machine learning model, one with the demographics, and one without, and then compare the results, audit for bias, and see if there is anything you can improve.
  • In some cases, it is also possible to only include variables relevant to the topic, but ultimately could be harmful as you lose more and more data points.
  • It's also possible that you could pick a subset of the data to do things like, ensure minority populations were represented or alternatively create a dataset to represent each option, run each through the model with known outcomes, and evaluate and/or train it against itself.

I also found that it must be necessary for there to be input and opinions on the dataset given by multiple professionals of different backgrounds to prevent any bias from the creator. * Most importantly, there must always be frequent checkups to monitor if any bias has arisen and if so, ways that it can be removed.

Does anyone have any feedback or suggestions for me?

1