AmalgamDragon

AmalgamDragon t1_jbuu2e8 wrote

Microsoft/Windows isn't a startup. They don't need an MVP for their start menu. It's already been around for decades and been used by billions.

> Building out all features to 100% is actually the exact model that failed day in and day out before the MVP and priority based model. You’d know that if you delivered software for a living

I do. I also know you are dead wrong that there is a single best way to deliver software.

Enjoy all your well deserved downvotes.

7

AmalgamDragon t1_ja5lz5b wrote

This really comes down to how 'reward' is defined. I think we likely disagree on that definition, with yours being a lot narrower then mine is. For example, during the cooking process, there is usually a point before the meal is done where it 'smells good', which is a reward. There's dopamine release as well, which could be triggered when completing some of the steps (don't know if that's the case or not), but simply observing that a step is complete is rewarding for lots of folks.

> Pure RL will quickly teach you not to touch the burner, but it really struggles with tasks that involve planning or delayed rewards.

Depends on which algorithms you're using, but PPO can handle this quite well.

1

AmalgamDragon t1_j56uj5c wrote

I recently started using RL in my personal work on automated futures trading. After reviewing the libraries available in the RL space, I did try the one you linked too. Some of the samples were broken. While I did tweak the code to get the samples to work, I found it to be more straightforward to get up and running using PPO from stable-baselines3.

2

AmalgamDragon OP t1_iu25s20 wrote

> I'm not sure how it would really learn something from the input if you don't define a more useful task. How would this model penalize a "collapse" situation where both models always predict 0 for example or any random value?

Yeah, it may not work well. I haven't been able to track down if this is something that has been tried and been found wanting or not.

1