Hide table of contents

We examine what factors predicted advancement in our engineering hiring round. We show two trends which seem common in EA hiring[1]: first, candidates with substantial experience (including at prestigious employers) were often unsuccessful, and second, candidates with limited experience and/or limited formal education were sometimes successful.

We sometimes hear of people being hesitant to apply to jobs out of a fear that they are hard to get. This post gives quantitative evidence that people can receive EA job offers even if their seemingly more qualified peers are rejected (and, indeed, traditional qualifications are almost uncorrelated with getting an offer).

In summary:

  • None of the factors we looked at were statistically significant.
  • Having previously worked at a Big Tech “FAANG” company was the only factor which had a consistently positive central estimate, although with confidence intervals that comfortably included both positive and negative effect sizes.
  • Years of experience, typographical errors, and the level of university qualification seemed to have little predictive power.

This builds on our previous post which found that participation in EA had limited ability to predict success in our hiring round.

Context

There were 85 applicants for the role. The success rates for candidates in each stage are shown below. Some candidates voluntarily withdrew between the screening interview and trial task, hence there are fewer people taking part in the trial task than passed the interview.        

StageNumber participatingNumber passingSuccess rate
Initial application sift854857%
Screening interview484594%
Trial task35823%

After the recruitment process was completed, we aggregated information about each applicant using the CVs and LinkedIn profiles they provided with their application. The metrics we were interested in were[2]:

  • Did any previous role include the word “senior” in its title?
  • Did any previous role include the word “manager” in its title?
  • How many years of experience did the candidate have?
  • How many typos were in the application?
  • What was the highest degree obtained by the candidate?
  • We coded this as: 1 for a bachelor-level degree, 2 for a master-level degree, and 3 for a doctoral degree.
  • Has the applicant ever worked at a FAANG company?

This is not rigorous analysis; a “proper” model would include as many explanatory factors as possible, and the factors should be independent. This is reflected in the eventual predictive power of the models.

Findings

We fitted logistic regression models to the data; with the dependent variable being whether a candidate passed a given stage, and the independent variables being the factors listed above.

We then calculated modelled odds ratios and probabilities associated with each "predictor". The results of this are shown below, with a data table in the appendix.

  • Binary variables include examples like “Has senior in title” or “Has FAANG company”. The odds ratio indicates how much more likely it is that people who were successful were “exposed” to that variable than not.
  • Continuous variables include examples like “Years of experience” or “Highest degree”. The odds ratio indicates how much more likely it is that people who were successful were “exposed” to one unit increase in the variable than not.

Predictors for passing an initial sift

This model predicts whether all submitted applicants (N=85) would pass an initial sift and be invited to the screening interview, with sensitivity 56% and specificity 76%.

Predictors for passing screening interview

This model predicts whether all invited applicants (N=85) would pass the screening interview, with sensitivity 56% and specificity 70%.

Predictors for passing trial task

This model predicts whether applicants who did not withdraw prior to this point (N=83) would pass the trial task, with sensitivity 88% and specificity 58%.

Commentary from Ben

Discourse about EA hiring is sometimes simplified to "EA jobs are hard to get" (and therefore you shouldn't bother applying unless you are very qualified) or "there is a big talent gap" (and therefore everyone should apply).

This post gives evidence that “hard versus easy” isn’t really the right axis: it's hard to get a job (in the sense that well-qualified applicants were rejected) but also easy (in the sense that applicants with limited qualifications were accepted). 

"When in doubt, just apply" continues to seem like good advice to me.

From the hiring manager’s perspective: This builds on our previous post which found that participation in EA had limited ability to predict success in our hiring rounds. Together, these posts make me pessimistic that simple automated screening criteria like “you need X years of experience” will be useful.

Appendix: Summary of modelled parameters

 

Passing initial sift

(N=85)

Passing screening interview

(N=85)

Passing trial task

(N=75)

PredictorsOdds RatiospOdds RatiospOdds Ratiosp
(Intercept)

1.06

(0.43 – 2.63)

0.893

1.10

(0.45 – 2.74)

0.829

0.36

(0.08 – 1.29)

0.135

years of experience

mean=9.4

1.01

(0.95 – 1.08)

0.781

1.01

(0.95 – 1.08)

0.795

0.90

(0.74 – 1.03)

0.181

has senior in title

n=17

2.23

(0.70 – 7.99)

0.189

2.63

(0.82 – 9.53)

0.116

0.81

(0.04 – 7.19)

0.860

has manager in title

n=14

0.50

(0.14 – 1.67)

0.263

0.44

(0.12 – 1.48)

0.197

0.90

(0.04 – 7.09)

0.925

number of typos

mean=0.9

0.93

(0.66 – 1.31)

0.648

0.99

(0.70 – 1.43)

0.968

0.90

(0.41 – 1.50)

0.738

has faang company

n=7

1.92

(0.37 – 14.55)

0.464

2.44

(0.46 – 18.88)

0.325

2.04

(0.09 – 20.83)

0.575

highest degree

mean=1.0

1.10

(0.52 – 2.36)

0.807

0.85

(0.39 – 1.80)

0.667

0.82

(0.19 – 2.85)

0.765

 

 

 

  1. ^

     They seem common in the authors’ experience; we would appreciate feedback  in the comments from other hiring managers about their own experience.

  2. ^

    We collected other factors but ultimately chose to exclude them from the analysis:
    - University rankings - we could not obtain these for enough candidates, which reduced the sample size considered and affected their accuracy.
    - Likely salaries in the candidate’s previous position - we used online sources to estimate the typical salary for the candidate’s most recent position and company, but again could not obtain this for enough candidates.
    - Whether candidates had worked in a company with more than 1000 employees - we excluded this in favour of looking at whether candidates had worked at a FAANG company; it was not possible to include both since the variables are not independent.

Comments19


Sorted by Click to highlight new comments since:

Assets aren't showing up:

Images should be fixed now, thanks for pointing this out.

Yep, images are broken. My guess is the document was copy-pasted from a Google Doc, with the images hosted in a way that isn't publicly accessible.

Thanks for this post! I'll be interested in data from CEA hiring overall, even with the obvious caveat that hiring across different roles will require different skillsets and experiences.

Thanks! In case you haven't already seen that: this post is part of a sequence about EA hiring; other posts have information about hiring different roles.

This was a great write up, interesting topic, informational and easy to follow.

One question I had is if below were the only words you were looking for in a CV and why so. For example, you did not list "Lead", which I'd think is frequently used for engineering roles.
I'm assuming either these were just examples (so not a complete list), or applicants only used these 2 terms?

Did any previous role include the word “senior” in its title? 
Did any previous role include the word “manager” in its title?

Thanks! Yeah, maybe we should also have looked for "lead" but we didn't. No strong reason why this is a bad idea, I just didn't think of it.

So uh you guys/girls have n=7 samples of people in this FAANG group, and you're using this to get coefficients for one of the regressions. Then for the next regression for the FAANG people making it a cut further, you probably only have 3 observations that regression?

 

So I think the norm here is to show "summary stats" style of data, e.g. a table that says "For the FAANG applicants, of these 7 made it). I think this table would be better. 

Basically, a regression model doesn't add a lot, with this level of data. 

 

Also, at this extremely low amount of data, I'm unsure, but there might be weird "degree of freedom" sort of things, where due to an interaction, the signs/magnitudes explode/implode.

 

Can you share your code for the regressions that made this table?

Basically, a regression model doesn't add a lot, with this level of data

Yes, I agree that this is the conclusion of the piece, but I feel like you are implying that this means the methodology was flawed?

We aren't trying to do some broad scientific analysis, we are just practically trying to identify ways that we can speed up our hiring process. And given that we do, in practice, have a relatively small number of people applying to each round, we are (apparently) not able to use automated methods to identify the most promising candidates with high accuracy.

(Maybe my stats/prob/econometrics is rusty, feel free to stomp this comment)

Yeah, you guys have a 94% pass rate for one dataset you use in one regression.

So you could only be getting any inference from the literally 3 people who failed for the screening interview.

So, like, in a logical, "Shannon information sense", that is all the info you have to go with, to get magnitudes and statistical power, for that particular regression. Right?

So how are you getting a whole column of coefficients for it? 

 

No, "This model predicts whether all invited applicants (N=85) would pass the screening interview." So it's 45/85.

Yes, understood, thanks, I was just confused.

94% pass rate

Also, it does seem that, at least ex post, they might benefit from raising the bar a bit on this round. 

Yeah, the point of the screening interview is mostly for the candidate to ask questions. I endorse the belief that we should be measuring programmers through programming tests instead of interviews (i.e. the pass rate of the screening interview should be very high), but I go back and forth on whether the screening interview should come first or second.

Yes, raising the bar would make the interviews more useful. This is a good thought that makes a lot of sense to me. 

I think what you said makes sense and is logical. 

 

Since I'm far away and uninformed, I think I'm more reluctant to say anything about the process and there could be other explanations.

For example, maybe Ben or his team wanted to meet with many applicants because he/they viewed them highly and cared about their EA activities beyond CEA, and this interview had a lot of value, like a sort of general 1on1.

The "vision" for the hiring process might be different. For example, maybe Ben's view was to pass anyone who met resume screening. For the interview, maybe he just wanted to use it to make candidates feel there was appropriate interest from CEA, before asking them to invest in a vigorous trial exercise.

Ben seems to think hard about issues of recruiting and exclusivity, and has used these two posts to express and show a lot of investment in making things fair.

- Whether candidates had worked in a company with more than 1000 employees - we excluded this in favour of looking at whether candidates had worked at a FAANG company; it was not possible to include both since the variables are not independent.

I'm confused, why can't you include two predictors if they are not independent? I'm assuming that with "independent" you mean correlation 0, if you instead mean no collinearity, i.e., linearly independent vectors of predictors, then feel free to ignore my comment.

Am I reading correctly that you made an offer to 8 developers and had 85 applicants?

So a 9% offer rate? That seems very high, am I missing something?

There is an additional on-site after this, and some people withdrew. We ended up making three offers from this round.

To highlight (from this comment and reply) the hire rate for this position was 3.5%

Curated and popular this week
 ·  · 11m read
 · 
Confidence: Medium, underlying data is patchy and relies on a good amount of guesswork, data work involved a fair amount of vibecoding.  Intro:  Tom Davidson has an excellent post explaining the compute bottleneck objection to the software-only intelligence explosion.[1] The rough idea is that AI research requires two inputs: cognitive labor and research compute. If these two inputs are gross complements, then even if there is recursive self-improvement in the amount of cognitive labor directed towards AI research, this process will fizzle as you get bottlenecked by the amount of research compute.  The compute bottleneck objection to the software-only intelligence explosion crucially relies on compute and cognitive labor being gross complements; however, this fact is not at all obvious. You might think compute and cognitive labor are gross substitutes because more labor can substitute for a higher quantity of experiments via more careful experimental design or selection of experiments. Or you might indeed think they are gross complements because eventually, ideas need to be tested out in compute-intensive, experimental verification.  Ideally, we could use empirical evidence to get some clarity on whether compute and cognitive labor are gross complements; however, the existing empirical evidence is weak. The main empirical estimate that is discussed in Tom's article is Oberfield and Raval (2014), which estimates the elasticity of substitution (the standard measure of whether goods are complements or substitutes) between capital and labor in manufacturing plants. It is not clear how well we can extrapolate from manufacturing to AI research.  In this article, we will try to remedy this by estimating the elasticity of substitution between research compute and cognitive labor in frontier AI firms.  Model  Baseline CES in Compute To understand how we estimate the elasticity of substitution, it will be useful to set up a theoretical model of researching better alg
 ·  · 4m read
 · 
This post presents the executive summary from Giving What We Can’s impact evaluation for the 2023–2024 period. At the end of this post we share links to more information, including the full report and working sheet for this evaluation. We look forward to your questions and comments! This report estimates Giving What We Can’s (GWWC’s) impact over the 2023–2024 period, expressed in terms of our giving multiplier — the donations GWWC caused to go to highly effective charities per dollar we spent. We also estimate various inputs and related metrics, including the lifetime donations of an average 🔸10% pledger, and the current value attributable to GWWC and its partners for an average 🔸10% Pledge and 🔹Trial Pledge.  Our best-guess estimate of GWWC’s giving multiplier for 2023–2024 was 6x, implying that for the average $1 we spent on our operations, we caused $6 of value to go to highly effective charities or funds.  While this is arguably a strong multiplier, readers may wonder why this figure is substantially lower than the giving multiplier estimate in our 2020–2022 evaluation, which was 30x. In short, this mostly reflects slower pledge growth (~40% lower in annualised terms) and increased costs (~2.5x higher in annualised terms) in the 2023–2024 period. The increased costs — and the associated reduction in our giving multiplier — were partly due to one-off costs related to GWWC’s spin-out. They also reflect deliberate investments in growth and the diminishing marginal returns of this spending. We believe the slower pledge growth partly reflects slower growth in the broader effective altruism movement during this period, and in part that GWWC has only started shifting its strategy towards a focus on pledge growth since early 2024. We’ve started seeing some of this pay off in 2024 with about 900 new 🔸10% Pledges compared to about 600 in 2023.  All in all, as we ramp up our new strategy and our investments start to pay off, we aim and expect to sustain a strong (a
 ·  · 6m read
 · 
TLDR: This 6 million dollar Technical Support Unit grant doesn’t seem to fit GiveWell’s ethos and mission, and I don’t think the grant has high expected value. Disclaimer: Despite my concerns I still think this grant is likely better than 80% of Global Health grants out there. GiveWell are my favourite donor, and given how much thought, research, and passion goes into every grant they give, I’m quite likely to be wrong here!   What makes GiveWell Special? I love to tell people what makes GiveWell special. I giddily share how they rigorously select the most cost-effective charities with the best evidence-base. GiveWell charities almost certainly save lives at low cost – you can bank on it. There’s almost no other org in the world where you can be pretty sure every few thousand dollars donated be savin’ dem lives. So GiveWell Gives you certainty – at least as much as possible. However this grant supports a high-risk intervention with a poor evidence base. There are decent arguments for moonshot grants which try and shift the needle high up in a health system, but this “meta-level”, “weak evidence”, “hits-based” approach feels more Open-Phil than GiveWell[1]. If a friend asks me to justify the last 10 grants GiveWell made based on their mission and process, I’ll grin and gladly explain. I couldn’t explain this one. Although I prefer GiveWell’s “nearly sure” approach[2], it could be healthy to have two organisations with different roles in the EA global Health ecosystem. GiveWell backing sure things, and OpenPhil making bets.   GiveWell vs. OpenPhil Funding Approach What is the grant? The grant is a joint venture with OpenPhil[3] which gives 6 million dollars to two generalist “BINGOs”[4] (CHAI and PATH), to provide technical support to low-income African countries. This might help them shift their health budgets from less effective causes to more effective causes, and find efficient ways to cut costs without losing impact in these leaner times. Teams of 3-5