Submitted by jsonathan t3_y5h8i4 in MachineLearning
Comments
Liorithiel t1_isk938r wrote
> I haven't really used Delaunay Triangulation in this manner but by my basic understanding of the algorithm, doesn't it attempt to create an optimal triangulation, and therefore would tend towards outputting rather uniformly distributed internal points, rather than learning the distribution of the input?
Delaunay triangulation itself—not really, well, not in the way that would do much harm. We use it for simulations of mobile networks, e.g. analyses at the boundary between urban (where density of base stations is high) and rural (less dense) areas. If each triangle creates one additional point, regardless of whether you have a large triangle (rural) or a small one (urban), then denser areas will get more points. It won't lead to smoothly changing density between more and less dense areas, but then, it's an assumption you'd have to put in addition to your data, not infer from data themselves.
Judging from the visualization though, this algorithm seems though to have a stopping condition dependant on the size of a triangle, which breaks this reasoning.
YaleLawJournal t1_iso0ffo wrote
Data visualization is such a powerful tool.
hellrail t1_isjl10j wrote
Well first the augmentation is totallly correlated with the original points, therefore they absolutely do not add any new information. Secondly, that approach enlarges the input size, typically one wants the opposite.
Therefore i say densifying pcls artifically for training purposes is nonsense
kakhaev t1_isjpe62 wrote
Your first point seems reasonable but not obvious for me, I would be convinced if model trained with augmented point clouds will perform better then one without it.
And not like we use all points in our model. For example for object detection from lidar you need a way to make their number variable, because in each iteration you will get different number of points from senior, of course you can do preprocessing, but I hope you got the point.
Usually augmentation allow you to increase sample of your input/output space that will lead to better map function that your model will learn.
I also have problem with that interpolation that OP uses is linear, but no one stopping you from modifying code yourself if necessary.
VaporSprite t1_isk3ryn wrote
Correct me if I'm wrong, I'm far from an expert, but couldn't training a model with more data which doesn't inherently add information potentially lead to overfitting?
hellrail t1_iskgwjz wrote
No, why should it.
This densification can make it easier to reach a generalizing training state, but the generalized state probably performs worse than a well generalized state without the augmentation as it changes the distribution to learn slightly by artificially imposing that a portion of the points are the center of mass of a triangulation of another portion of points. That is not generally the case for sensor data that will come in, therefore the modified distribution has low relevance to the real distribution that one wants to learn.
hellrail t1_iskhso6 wrote
@ Usually augmentation allow you to increase sample of your input/output space that will lead to better map function that your model will learn.
More data better results in general yes, but if the additional data is worthless, its a bit scam. That will be recognized in a comparison with an equally well trained state without that augmentation (might be harder to reach) tested on relevant data.
Technically put: the learned distribution is altered to a surrogate pointcloud which is quite similar to the relevant distribution of sensor data that will be produced measuring the real world, but is not the same anymore. Thats the price for more training data with this, and i wouldnt pay it because my primary goal is to capture the relevant distribution as Close as possible.
dingdongkiss t1_isl2pco wrote
Yeah densifying seems pointless if production inference data is gonna be as spare as the inputs into this. Estimating the distribution of points and sampling seems more useful
jsonathan OP t1_isjine3 wrote
I made this algorithm to "fill the gaps" in point clouds with synthetic data points to increase the density of the cloud. I figured this would be useful for reducing overfitting in machine learning models trained on point cloud data, or otherwise just enriching sparse point cloud datasets.
Please let me know what y'all think! Here's the Github repository.
osedao t1_isk832i wrote
>Here's the Github repository.
Hi, thanks for your contribution and sharing this. I have a question regarding if we can use this for labeled data. Have you found a chance to look into this, or seen similar method?
mansumi_ t1_iskhh3y wrote
You may want to consider using Alpha Shapes (https://en.m.wikipedia.org/wiki/Alpha_shape) instead of a pure delaunay triangulation. Imagine you had a point cloud of a table, your proposed data augmentation would give you new points in empty spaces in the domain, which would ruin the original point cloud. Excluding points outside of the alpha shape would be a bit better. Even still, I'm not sure if this augmentation scheme is "valid" for most shapes, I think it would probably harm training if anything.
ThrowThisShitAway10 t1_iso59z3 wrote
What's the point? No pun intended.
MrWrodgy t1_ism3a5v wrote
IS THIS A KIND OF INTERPOLATION?
ZestyData t1_isk14sf wrote
I think one aspect that's really crucial and missing is the statistical/mathematical justification for using this. Before using a tool we'd need to be certain its behaviour is valid.
You mention that you use Delaunay Triangulation (which should really be emphasized higher up, being the crucial aspect of this tool existing). But can you provide and make note of the references that justify Delaunay Triangulation as an effective method for generating data to fit an existing statistical distribution?
I haven't really used Delaunay Triangulation in this manner but by my basic understanding of the algorithm, doesn't it attempt to create an optimal triangulation, and therefore would tend towards outputting rather uniformly distributed internal points, rather than learning the distribution of the input? And the higher number, the greater that trend?
If that hypothesis were the case, it'd be less than useless as an artificial data source, it'd be harmful for the vast majority of use cases! I very well may be wrong, but my main point is that you should definitely make note of the method's performance if you're advertising it as a solution.