Like with you and many other commenters here, I also find the large effect sizes quite puzzling. It definitely gives me "Hilgard's Lament" vibes -- "there's no way to contest ridiculous data because 'the data are ridiculous' is not an empirical argument". On the usefulness of Cohen's d/SD, I'm not sure. I guess it has little to no meaning if there seems to be issues surrounding the reliability and validity of the data. Bruce linked to their recruitment guidelines and it doesn't look very good.
Edit: Grammar and typos.
Could you clarify your comment about Cohen’s d? In my experience with experimental work, p-values are used to establish the ‘existence’ of an effect. But (low/>0.05) p-values do not inherently mean an effect size is meaningful. Cohen’s d are meant to gauge effect sizes and meaningfulness (usually in relation to Cohen’s heuristics of 0.2, 0.5, and 0.8 for small, medium, and large effect sizes). However, Cohen argued it was lit and context dependent. Sometimes tiny effects are meaningful. The best example I can think of are the Milkman et al megastudy on text-based vaccine nudges.
Thank you for linking to that appendix describing the recruitment process. Could the initial high scores be driven by demand effects from SM recruiters describing depression symptoms and then administering the PHQ-9 questionnaire? This process of SM recruiters describing symptoms to participants before administering the tests seems reminiscent of old social psychology experiments (e.g. power posing being driven in part by demand effects).
For option 3 to be compelling we certainly need a whole lot more than what's been given. Many EA charities have a lot of RCT/qual work buttressing them while this doesn't. It seems fundamentally strange then that EA orgs are pitching SM as the next greatest thing without the strong evidence that we expect from EA causes.