Hide table of contents

Introduction

Robust evaluation of large language models should not depend exclusively on centralized actors, proprietary data, or English-only benchmarks. Yet today, most widely used evaluation frameworks are shaped by a handful of well-resourced labs. This creates blind spots — linguistic, methodological, and geographic — that can distort our understanding of model capabilities and risks.

AI4Math is a pilot initiative that explores a different model: a small-scale but fully structured benchmark, built collaboratively by Latin American STEM students as part of a mentorship program focused on AI governance and technical safety. It consists of 105 original university-level math problems, written natively in Spanish, with step-by-step solutions and peer review. We used this dataset to evaluate six models — both proprietary and open-source — in Spanish and English, under zero-shot and chain-of-thought settings.

The goal of this post is not to make strong claims about model performance. Instead, we aim to contribute to a broader discussion on how decentralized, reproducible, and context-aware evaluations can serve as public infrastructure for AI oversight — especially in regions and languages currently underrepresented in frontier evaluations.

We see this as a testable and scalable methodology. A larger version of AI4Math is currently under development, with more problems, and domains. In the meantime, we are seeking feedback from the broader community on:

  • The soundness of our evaluation design and methodology
  • The potential to adapt or replicate this approach in other technical or policy-relevant domains
  • How decentralized benchmarks like this could complement institutional or regulatory evaluation frameworks

The paper is available on arXiv, and the dataset can be requested directly.
Critical feedback and collaboration ideas are welcome.

The problem with centralized evaluation

Most evaluation benchmarks today are created, controlled, and interpreted by a small number of well-resourced actors: major tech labs, elite universities, and private platforms. While this has enabled rapid iteration and impressive technical progress, it has also introduced structural limitations that make it difficult to assess models fairly or comprehensively.

Narrow scope and linguistic bias: Evaluation datasets are overwhelmingly English-centric. This reinforces a feedback loop in which capabilities in English are overrepresented and assumed to generalize universally, while performance in other languages remains underexplored (Weidinger et al., 2025).

Incentive misalignment: When model developers also control benchmark creation, there is a risk of optimizing toward favorable metrics, consciously or not. As Maini (2024) notes, internal evaluations often highlight selective strengths, limiting the credibility of comparative claims.

Opacity and lack of reproducibility: Many influential benchmarks and leaderboards are not open source. This creates dependencies on proprietary data and prevents independent verification of results.

Barriers to entry: Building, maintaining, and running evaluations often requires significant compute and engineering infrastructure. This makes it harder for smaller organizations, public-sector actors, or regional research groups to meaningfully contribute.

These issues have been widely acknowledged in the evaluation science literature (Weidinger et al., 2025; Solaiman et al., 2024). The field urgently needs more transparent, distributed approaches that can reflect the diverse contexts in which language models are deployed.

AI4Math as a pilot

AI4Math was developed through a mentorship program focused on AI governance and technical safety. A group of Latin American students — ranging from undergraduate to PhD level — collaborated over multiple sessions to design math problems across algebra, calculus, geometry, logic, combinatorics, number theory, and probability. All problems were written originally in Spanish, reviewed by peers, and accompanied by step-by-step human solutions.

Each problem was designed to be original (i.e., not copy-pasted from public sources) and solvable with a unique, clearly defined answer. As part of the internal quality control process, participants tested their problems against GPT-4 to calibrate difficulty, iterating when necessary to remove trivial or ill-posed items.

We then evaluated six models on the full dataset under four settings: Spanish and English, with and without chain-of-thought prompting. The results revealed meaningful variation across models and domains — especially in geometry and probability — but are not meant to support general claims about ranking or model superiority.

What we consider most valuable is the process: an end-to-end evaluation pipeline built by a technically trained team, outside of any major AI lab, using public tools and modest resources. This is the kind of decentralized experimentation we believe should be encouraged and scaled.

Why this approach matters

AI4Math is not a substitute for large-scale industrial benchmarks. It is a case study in what becomes possible when we lower the barriers to participation and shift the center of gravity in evaluation work.

Methodological clarity with minimal resources: The benchmark was built with limited funding, no proprietary infrastructure, and an emphasis on transparency. Peer review and internal testing ensured baseline rigor.

Cultural and linguistic relevance: A Spanish-native benchmark highlights reasoning structures, educational patterns, and problem formulations that are often excluded from English-dominant evaluations.

Expanding the contributor base: Involving university students and early-career researchers demonstrates that domain-specific expertise does not have to come only from top-tier labs or PhDs with long publication records. Many benchmarkable tasks can be designed and validated by technically proficient contributors with appropriate scaffolding.

Multilingual benchmarking as diagnostic tool: Comparing model performance across languages (with carefully reviewed translations) allowed us to detect performance inconsistencies that would have gone unnoticed in monolingual settings.

A note on scalability

This approach can be replicated. The core methodology — collaborative problem design, peer review, lightweight infrastructure, and focused domain evaluation — is not specific to math or Spanish.

Similar efforts could be applied to AI4Science, AI4Policy, or regional benchmarks in underrepresented languages. There is growing precedent for this, including the Uhura benchmark for African languages and the Te Reo Māori benchmark for indigenous language understanding.

Importantly, community-led evaluation also expands what counts as "expertise." While much attention is given to frontier research by senior scientists, there is significant untapped capacity in university classrooms, regional research centers, and technical training programs. We believe these communities should be seen not only as users or audiences for evaluation data — but as contributors to its creation.

Toward a more trustworthy evaluation ecosystem

Decentralized approaches like AI4Math are not only viable; they may be essential. As language models are deployed in critical domains — healthcare, education, legal systems — the integrity and coverage of their evaluation pipelines becomes a governance issue.

Independent benchmarks can serve as public infrastructure: reproducible, auditable, and accountable to a broader set of stakeholders. They reduce dependency on vendor claims and allow governments, civil society, and independent labs to assess capabilities and risks under transparent conditions.

Decentralization also broadens the definition of model quality. Not all evaluations need to optimize for performance on coding tasks or English academic benchmarks. Some may prioritize factual accuracy in low-resource languages, robustness under ambiguity, or alignment with pedagogical norms.

Ultimately, a trustworthy evaluation ecosystem should reflect the diversity of contexts in which AI systems operate. That includes diversity in geography, language, domain, and institutional setting.

Conclusion and next steps

AI4Math is an early-stage experiment. But it offers a concrete example of how technically grounded, community-led teams can design useful benchmarks with minimal resources. We see this as one building block toward a more distributed and plural evaluation landscape.

We are currently working to expand the benchmark, both in size and scope. In parallel, we’re looking for collaborators, funders, and critical readers who can help refine the methodology and assess its relevance for broader governance and safety agendas.

If you work on model evaluation, AI governance, education, or multilingual NLP, we would value your input. Feel free to share suggestions publicly or reach out directly.

Thanks for reading.

— The AI4Math / Carreras con Impacto team

References

Weidinger, L., Raji, I. D., Wallach, H., Mitchell, M., Wang, A., Salaudeen, O., Bommasani, R., Ganguli, D., Koyejo, S., & Isaac, W. (2025). Toward an Evaluation Science for Generative AI Systems. arXiv preprint arXiv:2503.05336. https://arxiv.org/abs/2503.05336

Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., Chen, C., Daumé III, H., Dodge, J., Duan, I., et al. (2024). Evaluating the Social Impact of Generative AI Systems in Systems and Society. arXiv preprint arXiv:2306.05949. https://arxiv.org/abs/2306.05949

21

3
0
1

Reactions

3
0
1

More posts like this

Comments1
Sorted by Click to highlight new comments since:

Executive summary: This exploratory post introduces AI4Math, a community-built, Spanish-native benchmark for evaluating language models on university-level math tasks, as a case study in decentralized, transparent, and culturally diverse evaluation methods that could complement centralized AI oversight infrastructures.

Key points:

  1. Centralized evaluation is limiting: Current evaluation systems are dominated by elite labs and rely heavily on English benchmarks and proprietary infrastructure, leading to bias, lack of reproducibility, and high barriers to entry.
  2. AI4Math offers a decentralized alternative: Developed by Latin American students through a mentorship program, AI4Math includes 105 original math problems in Spanish, with step-by-step solutions and peer review, evaluated across six LLMs in four settings.
  3. The emphasis is on process, not rankings: The authors do not claim definitive performance insights but highlight the value of transparent, end-to-end evaluation created outside major institutions with minimal resources.
  4. Multilingual and cultural inclusion is crucial: Benchmarking in Spanish revealed model behavior and inconsistencies missed by English-only evaluations, emphasizing the importance of linguistic and regional relevance.
  5. Scalable and replicable methodology: The framework could be extended to other domains (e.g., AI4Science, AI4Policy) and languages, supporting a broader, more inclusive definition of expertise and stakeholder participation.
  6. Call for feedback and collaboration: The team invites comments on the evaluation methodology, ideas for adapting it to other fields, and partnerships to grow decentralized evaluation efforts into credible governance tools.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
Relevant opportunities