<div class="md">Working on evaluating some AI-generated outbound (SDR-style emails along with follow-ups), and I’m running into a weird problem. Everyone talks about better personalisation or higher reply rates, but when you actually try to benchmark quality it gets messy fast. A few things we’ve looked at: a)reply rate (obvious, but noisy with a delayed signal) b)positive vs negative replies (hard to label cleanly at scale) c)factual accuracy about the prospect/company d)how much editing a human has to do before sending e)whethe <div class="md">Working on evaluating some AI-generated outbound (SDR-style emails along with follow-ups), and I’m running into a weird problem. Everyone talks about better personalisation or higher reply rates, but when you actually try to benchmark quality it gets messy fast. A few things we’ve looked at: a)reply rate (obvious, but noisy with a delayed signal) b)positive vs negative replies (hard to label cleanly at scale) c)factual accuracy about the prospect/company d)how much editing a human has to do before sending e)whethe

This version is intentionally a light rewrite for admin review. It preserves the article's main claim, the named people or companies involved, and the practical importance of the development instead of turning it into a different lesson.

Why it matters: this story sits in AI & ML and may be useful as a NEWS byte because it gives readers direct context on a current development from r/MachineLearning. Before approval, an admin should verify that the emphasis, framing, and any implied conclusions still match the source article.

What benchmark would you build for “reply quality” in SDR generation? [D]