Other than terminal bench which doesnt quite map to my experience, what are some other benchmarks to see how different models do in different harnesses?

Source: [Hacker News](https://news.ycombinator.com/item?id=48614029)

Sponsored