xtof54

xtof54 t1_iyq2oxi wrote

good question but it depends on whether this source of randomness occurs between both models been compared at test time. or in other words what kind of generalization you want to support.

this contrasts with variability due to sampling data because we all assume data are iid, and so a confidence interval is usually computed.

one way is to fix the seed, compare the models with same seed, report significance for data sampling, and restart, and globally report proportion of significance across seeds.

but we shouldn't pay too much attention to stat significance, too many use it as a 'flag of truth', while all experiments are biased anyway, so better to always be suspicious and build confidence over time

1