Validating AI Candidate Scoring — Back-Test Method

The short answer

We back-test AI candidate scoring against placement outcomes — not against engagement, not against recruiter intuition, not against synthetic benchmarks. The slice we publish for external validation spans 3,852 resumes sent, 385 second-interview progressions, 165 final-round candidates, and 74 closed placements over a 25-month window. This is a representative validation sample drawn from our production data — not the firm’s complete placement record, which we don’t disclose. Precision is measured as placements among scoring-tier-1 candidates; recall is measured as the share of placements that came from scoring-tier-1 candidates rather than from lower tiers. The honest numbers are useful but not perfect: a model that produces 30% of placements from outside tier-1 is a model that’s still missing signal, and the 30% figure is one we report rather than hide.

Why engagement-based validation is mostly noise

The most common form of "validated" in AI sourcing marketing is engagement validation: the model is tuned and reported against scout-mail open rates, reply rates, click-throughs, profile views. These metrics are easier to capture (every platform has them) and produce big-looking numbers (reply rates measure outreach, not match quality). They also correlate weakly with the only outcome that matters in recruiting: did the candidate get hired and stay.

The disconnect is visible at the data layer. A high-engagement candidate is often someone actively job-searching who replies to many scouts and progresses through few — engagement up, placement rate down. A low-engagement candidate who eventually replies after the third touch is often someone passively interested who lands in finals at a higher rate — engagement down, placement rate up. A model optimized for engagement systematically over-weights the first profile and under-weights the second, which is the opposite of what an agency or in-house team needs.

The right unit for scoring validation is the placement, not the engagement signal. That’s harder to measure because it requires multi-month data per candidate and access to actual placement outcomes from clients. Most vendors don’t have that data, which is the operational reason engagement validation is dominant.

The published 25-month validation sample

The slice we publish for external validation spans March 2024 through March 2026 — 25 months drawn from a representative subset of ExecutiveSearch.AI K.K.’s corporate client portfolio. The published funnel: 3,852 resumes sent to clients, 385 candidates progressing to second interview, 165 candidates reaching final round, 74 closed placements. Each candidate in the published sample has a scoring artifact from the time of resume submission — the score the model assigned at the moment the recruiter decided to advance the candidate to the client. This is the validation slice we share publicly. We don’t disclose total firm-wide placement volume, the full client list, or the un-redacted production dataset; the published numbers are the slice clean enough to share without compromising client confidentiality or competitive position.

The published sample has limitations we name explicitly. Sample sizes thin out at the placement layer (74 placements is enough for directional reads on precision but not for tight confidence intervals on rare scoring tiers). The role mix is concentrated in Japan mid-career bilingual hiring, which is the bulk of our work but excludes some adjacent segments where placement dynamics may differ. Placement outcome data depends on client-side reporting, which is partial — some clients report all placements, some only major ones, some retroactively. Numbers below are weighted to the reporting clients in the published sample only; reporting reliability is a known noise source. The internal back-test that drives model decisions runs against larger, un-redacted datasets — what we report here is the conservative public slice.

Precision — the placement rate inside scoring tier 1

We bucket the model’s scoring output into three tiers: tier 1 (top match), tier 2 (acceptable match), tier 3 (long-tail match). Precision asks: among the candidates the model placed in tier 1 and that the recruiter actually advanced to client introduction, what fraction reached placement?

In the published sample, tier-1 candidates progress from resume-sent to placement at roughly twice the rate of tier-2 and roughly four times the rate of tier-3. The absolute placement rates are small — recruiting is a low-base-rate game where most candidates don’t close — but the relative differential is consistent across the 25 months and across the role types in the sample. Tier-1 differential is the precision signal: the model successfully separates the top of the funnel from the rest. The same differential holds in the un-published internal back-test against larger volumes; the public-sample numbers are directionally representative of what the model produces in production.

Recall — placements that came from outside tier 1

Recall is the inverse question: of the 74 closed placements in the published sample, what share came from tier-1 candidates and what share came from tier-2 or tier-3? A perfect model would put all eventual placements in tier 1. A useless model would spread them evenly. We sit at roughly 70% — about 70% of the published sample’s placements came from tier-1 candidates at the moment of resume-sent, with 30% from tier-2 or tier-3. The same recall ratio holds in the internal un-published back-test.

The 30% from outside tier 1 is signal we’re missing. The honest interpretation is that the model has a real advantage at the top of the funnel (the precision differential) but doesn’t capture all the variance in placement outcomes (the recall ceiling). We report this number rather than hide it because the interpretation matters: a recruiter using the model should still review tier-2 candidates carefully, especially in role types where tier-2 placement rates run higher in our historical data. The tier system is a meaningful prioritization, not a substitute for recruiter judgment.

What we update when the back-test surfaces a problem

The back-test runs on a six-month rolling cycle against the firm’s full internal placement data — not just the published slice. Each cycle, we examine the placements from the prior six months and audit which scoring tier they came from. If tier-1 precision drops below the trailing-25-month average for a given role type, we investigate the model’s signal weighting on that role type. If recall drops — too many placements coming from outside tier 1 — we look at the role type’s bilingual signal, the tenure-pattern weighting, and the company-tier sequence handling, which are the three signals most likely to be miscalibrated for newer role types.

Updates to the model don’t ship live without a back-test on historical data. New scoring weights have to maintain or improve the trailing precision and recall numbers across the full internal dataset before going into production. This is the discipline most engagement-validated systems skip because the historical placement data isn’t there.

What this means for procurement

If you’re evaluating an AI sourcing vendor, ask three questions about their validation. First, what window do they back-test against — months or years? Second, what’s the outcome they validate against — engagement, recruiter rating, or actual placement data? Third, can they produce the precision and recall numbers (in our terms) for at least one role type close to yours? A vendor that answers "placements over 12+ months," "yes, here are the numbers," and "yes, here’s a role type close to yours" has done the work. A vendor that answers "engagement," or "we don’t share that," or who pivots to "our customers tell us it works" has not.

We can’t promise our scoring will produce the same precision and recall on your role types as it does on ours — different role mixes have different signal availability and different outcome patterns. But we can promise that the methodology is honest, the numbers are reported including the parts that are uncomfortable, and the back-test cycle continues whether or not anyone is checking.

Frequently asked

Why do you report a 30% recall miss instead of hiding it?

Because hiding it would change the recruiter behavior in a way that costs placements. If recruiters believed tier-1 captured 100% of placements, they’d skip tier-2 review. The 30% from tier-2/3 says don’t do that. Reporting the number is what produces correct downstream behavior; hiding it would optimize the marketing surface at the cost of the actual placement rate.

Is 74 placements a large enough sample?

The 74 figure is the published validation sample, not the firm’s complete placement record. It’s enough for directional reads on precision and recall at the role-type aggregation level. The internal back-test that drives model recalibration runs against larger, un-redacted volumes; we share the published slice for external scrutiny while keeping the full production data confidential. We’re explicit about this — see our methodology page for the per-role-type published-sample sizes. We’re conservative about claims for under-sampled role types and report the placement count alongside the precision number rather than reporting precision alone.

How does this back-test apply to in-house TA teams using Headhunt.AI?

The methodology is the same; the data source changes. For in-house TA, the back-test runs against your own placement outcomes over time. We don’t see your placements unless you connect outcome data through the platform. Customers running Headhunt.AI for 12+ months and reporting placement outcomes get a per-role-type precision/recall report against their own data. The methodology is portable; the per-customer dataset isn’t, until enough months have passed.

What if my role types are different from yours?

We can’t promise the same precision and recall on role types our internal back-test doesn’t cover. The model’s signal weighting is calibrated against the available data; novel role types get weighted using closest-neighbor heuristics until enough placement data accumulates to recalibrate. The honest answer is the model gets sharper on a new role type after roughly 50 sourced candidates and 5+ placements within that role type — a few months of usage at typical recruiting volume.

How often does the back-test cycle find a real problem?

Roughly two cycles per year out of four cycles run produce a recalibration signal worth shipping a model update for. The other two confirm the existing weighting is performing within expected variance. The cycles that produce updates often correlate with the role mix shifting — a wave of new role types from a new client cohort, or a structural change in the market like the 2024 surge of life-sciences hiring in Japan, will show up first in the back-test as a recall drop on the new role types.

Sources

All numbers in this article come from a published 25-month validation sample drawn from ExecutiveSearch.AI K.K. internal operations (March 2024 – March 2026, 3,852 resumes, 385 second-interview progressions, 165 final-round candidates, 74 closed placements). This is a representative slice we share for external scrutiny; the firm’s complete placement record is not disclosed. The Decision Gap analysis (Mann-Kendall non-parametric trend test, p = 0.015) provides additional context on placement-funnel dynamics over the same period — see the Decision Gap briefing. Per-role-type published-sample sizes, statistical methods, and anonymization policy are documented on our methodology page. The five-dimension scoring framework Cody references is detailed in the AI candidate scoring cornerstone.

How AI candidate scoring is validated — the back-test method against placement outcomes

Why engagement-based validation is mostly noise

The published 25-month validation sample

Precision — the placement rate inside scoring tier 1

Recall — placements that came from outside tier 1

What we update when the back-test surfaces a problem

What this means for procurement

Frequently asked

Why do you report a 30% recall miss instead of hiding it?

Is 74 placements a large enough sample?

How does this back-test apply to in-house TA teams using Headhunt.AI?

What if my role types are different from yours?

How often does the back-test cycle find a real problem?

Sources

See the back-test in your role types

How AI candidate scoring is validated — the back-test method against placement outcomes

Why engagement-based validation is mostly noise

The published 25-month validation sample

Precision — the placement rate inside scoring tier 1

Recall — placements that came from outside tier 1

What we update when the back-test surfaces a problem

What this means for procurement

Frequently asked

Why do you report a 30% recall miss instead of hiding it?

Is 74 placements a large enough sample?

How does this back-test apply to in-house TA teams using Headhunt.AI?

What if my role types are different from yours?

How often does the back-test cycle find a real problem?

Sources

Related guides

See the back-test in your role types