<!-- SC_OFF --><div class="md"><p>Working on evaluating some AI-generated outbound (SDR-style emails along with follow-ups), and I’m running into a weird problem. Everyone talks about better personalisation or higher reply rates, but when you actually try to benchmark quality it gets messy fast. </p> <p>A few things we’ve looked at: </p> <p>a)reply rate (obvious, but noisy with a delayed signal)</p> <p>b)positive vs negative replies (hard to label cleanly at scale) </p> <p>c)factual accuracy about the prospect/company </p> <p>d)how much editing a human has to do before sending </p> <p>e)whethe <!-- SC_OFF --><div class="md"><p>Working on evaluating some AI-generated outbound (SDR-style emails along with follow-ups), and I’m running into a weird problem. Everyone talks about better personalisation or higher reply rates, but when you actually try to benchmark quality it gets messy fast. </p> <p>A few things we’ve looked at: </p> <p>a)reply rate (obvious, but noisy with a delayed signal)</p> <p>b)positive vs negative replies (hard to label cleanly at scale) </p> <p>c)factual accuracy about the prospect/company </p> <p>d)how much editing a human has to do before sending </p> <p>e)whethe
This version is intentionally a light rewrite for admin review. It preserves the article's main claim, the named people or companies involved, and the practical importance of the development instead of turning it into a different lesson.
Why it matters: this story sits in AI & ML and may be useful as a NEWS byte because it gives readers direct context on a current development from r/MachineLearning. Before approval, an admin should verify that the emphasis, framing, and any implied conclusions still match the source article.