I think RL as a method which produces training data by model's predictions — It directly leads the model to extend its output range because of increased diversity of the data. However, fundamentally RL relies on bootstrapping and has moving target problem which are the reason of its poor stabili...

Source: [Hacker News](https://news.ycombinator.com/item?id=48624622)

Sponsored