Anthropic took less than a year to set up large model training infrastructure from scratch but with the benefit of experience. This indicates that infrastructure isn’t currently extremely hard to replicate.
EleutherAI has succeeded at training some fairly large models (the biggest has like 20B params, compared to 580B in PaLM) while basically just being talented amateurs (and also not really having money). These models introduced a simple but novel tweak to the transformer architecture that PaLM used (parallel attention and MLP layers). This suggests that experience also isn’t totally crucial.
I think that the importance of ML experience for success is kind of low compared to other domains of software engineering.
My guess is that entrenched labs will have bigger advantages as time goes on and as ML gets more complicated.
You can get an estimate based on how many authors there are on the papers (it's often quite a lot, e.g. 20-40). Though this will probably become less reliable in the future, as such organizations develop more infrastructure that's needed that no longer qualifies as "getting you on the paper", but is nonetheless important and not publicly available.
One problem with this estimate is that you don’t end up learning how long the authors spent on the project, or how important their contributions were. My sense is that contributors to industry publications often spent relatively little time on the project compared to academic contributors.
Yeah, good point.
Interesting, thanks! Any thoughts on how we should think about the relative contributions and specialization level of these different authors? ie, a world of maximally important intangibles might be one where each author was responsible for tweaking a separate, important piece of the training process.
My rough guess is that it's more like 2-5 subteams working on somewhat specialized things, with some teams being moderately more important and/or more specialized than others.
Does that framing make sense, and if so, yeah, what do you think?
I haven't looked into it much, but the PaLM paper has a list of contributions in Appendix A that would be a good starting point.