Hiring: hacks + pitfalls for candidate evaluation

Cait_Lion

Hiring: hacks + pitfalls for candidate evaluation

Cait_Lion

6 min readSep 13, 2023

Comments 1

Sorted by

New & upvoted

Richard Möhn

The outline structure makes this easy to skim. Thank you!

Comments

More from the author

Hiring: a couple of lessons

Cait_Lion·2y ago·5m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·3d ago·Curated 1h ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·2w ago·Curated 6d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

143

Let's taboo the V-word

lincolnq·3d ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

Recent opportunities to take action

EA Organisation Updates thread: July 2026

Dane Valerie·2d ago·1m read

Applications open for new supported programs on the GWWC donation platform (2026)

Aidan Whitfield🔸, Giving What We Can🔸·1h ago·3m read

Free, client-funded daily 1:1 accountability coaching for people active in the EA ecosystem (GoalsWon)

Guillermo D'Anna·20h ago·1m read

Hacks

Sharing customised-generic feedback

Rejected candidates often, very reasonably, desire feedback. Sometimes you don’t have capacity to tailor feedback to each candidate, particularly at earlier stages of the process. If you have some brief criteria describing what stronger applicants or stronger trial tasks submissions should look like, and if that’s borne out in your decisions about who to progress, I suggest writing out a quick description of what abilities, traits or competencies the successful candidates tended to demonstrate. This might be as quick as “candidates who progressed to the next stage tended to demonstrate a combination of strong attention to detail in the trial task, demonstrated a clear and direct writing style, and have professional experience in operations.” This shouldn’t take more than a few minutes to generate. My impression is it’s a significant improvement for the candidates over a fully generic response.

Consider borrowing assessment materials

Not sure how to test a trait for a given role? Other aligned organisations might have already created evaluation materials tailored to the competencies you want to evaluate. If so, that organization might let you use their trial task for your recruitment round.
- Ideally, you can do this in a way that’s pretty win-win for both orgs (e.g. Org A borrows a trial task from Org B. Org A then asks their candidates to agree that, should they ever apply to Org B, Org A will send over the results of the assessment).
- I have done this in the past and it worked out well.

Beta test your trial tasks!

I’m a huge proponent of beta testing new evaluation materials. Testing your materials before sending them to candidates can save you a world of frustration down the road by helping you tweak unclear instructions, inappropriate time limits, and a whole host of other pitfalls.

Mistakes

Taken from our internal hiring resources, here are some mistakes we’ve made in the past with our evaluation materials:

Trial tasks or tests that are laborious to grade

Some types of work tests take a long time to grade effectively. Possible issues can be: a large amount to read, multiple sources or links to check for information, a complicated or difficult-to-apply rubric. Every extra minute this takes you to grade is multiplied by the number of tasks. The ideal work sample test is quick and clear to grade.
Possible solutions:
- Think backwards from grading when you create the task.
- Where appropriate, be willing to sacrifice some assessment accuracy for grading speed
- Beta test!

Tasks that require multiple interactions from the grader

Some versions of trial tasks we used in the past had a candidate submit something to which the grader had to respond before the candidate could complete the next step. This turned out to be inefficient and frustrating.
Solution: avoid this, particularly at early stages.

Too broad

Some work tests look for generalist ability but waste the opportunity to test a job-specific skill. The more you can make the task specific to the role, the more information you get. If fast, clear email drafting is critical, test that instead of generically testing communication skill.

Too hard / too easy

If you don’t feel like anyone is giving you a reasonable performance on your task, you may have made it too hard.
- A common driver of this failure mode is assuming context the candidate won’t have or underrating the advantage conferred by context possessed by your staff but not by (most?) of your candidates
Ceiling effects are perhaps a larger problem. If everyone is doing well, you won’t be able to sort applicants by performance.
Solution: beta test, ideally with people outside your org

Not timed (with an enforcing tool)

Letting people self-time may be tempting, but this makes results harder to interpret. If someone has done well, you don’t want to have to spend time wondering if it’s because they spent more time than other applicants / more than they said they spent. Some people will also forget to record time spent (or “forget”) about their time limit and go over. If people spend hugely different amounts of time, you can find yourself comparing apples and oranges.
Possible solutions:
- Especially for early stage tasks, I recommend using software like ClassMarker.
- Alternatively, you can use e.g. google docs and spot check results for time infractions.

Overburdened task

Trying to make a task measure too many things at once can create noisy, hard to interpret / grade data. This trades off against making sure you're getting evidence for all key role criteria, but in my opinion it's often a worse mistake to create a task that tries to do too many things at once, and therefore doesn’t (as) successfully accomplish any. If you have an assessment material that tests a single thing clearly and efficiently, you can give that assessment first and then test other key criteria later in the round.

Results are opaque to other staff

In the past, some work sample tests spat out results only the hiring manager knew how to interpret. If only one person can understand the results, other stakeholders have to defer rather than being able to independently assess the candidate’s performance. This can be particularly frustrating if people disagree about how strong a particular candidate is. Also, if the sole capable interpreter becomes capacity constrained, your round is now bottlenecked.
- Sometimes skills may be so specialized that somewhat opaque results are a correct tradeoff.

Confusing task

Maybe everyone or a subgroup of applicants all misunderstand the task in the same way. Or, they answer in a different way to what you were looking for / different ways from other groups. This makes it hard to compare across answers.
It may be tempting to make “figure out what I want from you” a key part of what you’re testing for, but I recommend against this, unless that’s a vital skill for success in the role, as weakness on that "figuring out" trait then causes ~complete failure, whatever other skills they may have to offer.
Solution: again, beta test.

Evaluates candidates for the wrong role

Having a poorly scoped role vision can lead to this failure mode. If you’ve designed a role with too many constraints (hiring for an imaginary person!) or focused on a few aspects of the role to the exclusion of other important aspects, the work sample test may similarly target the wrong traits. I propose that the antidote to this failure mode is to spend significant time drawing up a role vision, and pressure test it. If people different than your imagined ideal could perform elegantly in this role, the work sample test (along with the rest of the process) should make it possible for less prototypical candidates to shine, too. More on that here.

Lack of clarity on what it measures / mistargeted tasks

Some tasks might test well for some key role qualities, but miss other important aspects of being able to do the job well (that could have been added into the task). If a role needs traits A, B, and C, but the work sample test only evaluates trait A, then people who would not perform well on the job will pass the work sample test. I think this is fine and often unavoidable for early stage assessment but you should be aware in what ways your assessment is incomplete. More on that in the second section here.
- After drafting your evaluation materials, you may also want to revisit your list of key competencies to see if there’s anything missing that you can easily add in.
One specific failure mode here is making success on the task totally reliant on a single trait (assuming that trait is not un-controversially role-critical). For example, even if you’d love to find a candidate who speaks excellent Italian, if it’s potentially possible to succeed in that role without that competency, don’t give the work sample test in Italian.

Including pet features

Beware of including any features that will bias you without adding information. If you include a reference to a movie you like and some candidates notice it and some miss the reference, can you be confident that that isn’t going to bias you towards the noticers, who are clearly awesome because they enjoy the same things you do? This is a special case of making sure you’re testing for (and evaluating on the basis of) the features you truly care about, and ideally those features alone.

Privileging similarity to self

In any type of evaluation, we as evaluators are likely to be biased towards people like ourselves. With work sample tests, there’s a temptation to make tests that look for a bunch of the virtues you most care about, which may be virtues you yourself possess.
Proposed solution: Read your work sample test and ask yourself, “Does this sound like a test for being me-like?” If so, be suspicious.