We gave our agent the exact metric definition. It still wrote the wrong SQL
1 points, 1 comments on Hacker News
1 points, 1 comments on Hacker News
Two years ago we released VQAScore: ask a VLM "does this image show {prompt}? " and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field (2M+ downloads on Hugging Face; used by groups at DeepMind, NVIDIA, ByteDan...
1 points, 0 comments on Hacker News