BenchMoral: A benchmarking to assess the moral sensitivity of large language models (LLMs) in Spanish.

Flor Betzabeth Ampa Flores

By: Flor Betzabeth Ampa Flor en

This project was conducted as part of the "Careers with Impact" program during the 14-week mentoring phase. You can find more information about the program in this post.

Contextualization of the problem

It is often asserted that machines do not possess moral consciousness; philosopher David Chalmers warns against this risk by pointing out that "explaining the performance of a function is not the same as experience" (n.d.). Even if an AI succeeds in executing complex tasks such as discriminating stimuli, categorizing information, or making verbal judgments, this does not imply that such performance is accompanied by conscious experience. Therefore, simulating moral behavior is not equivalent to possessing moral consciousness.

However, this simulation can be misleading. The apparent similarity between the responses of artificial intelligence models and those of humans tends to generate confusion: not because machines sense or understand morality (Mallari et al., 2025), but because their responses resemble ours enough to lead us to compare them.

This is also the case with Generative Artificial Intelligence. The debate should not only focus on whether these systems have true intelligence or moral inclinations, but rather on how their responses are affecting our perception of the world and our daily decisions (Ferdaus et al., 2024).

This research takes place in the midst of the sustained rise of Artificial Intelligence, especially that designed to interact with human language. Its increasing accessibility has allowed people of all ages- children, teenagers, adults, and even older adults-to use these systems in a way

and conscious. That constant interaction between humans and technology is not neutral: language models begin to influence the perceptions, judgments and values of those who use them.

In front of to this context, becomes essential to wonder: Do these models respond with ethical criteria? Do they reflect moral biases? And above all, how do they behave when confronted with cultures and languages other than English, such as Spanish?

Evaluating language models from a moral perspective in Spanish is an urgent task. Most of the existing studies and tools have been developed in English, which has left millions of Spanish speakers on the sidelines of this discussion. This gap is not only linguistic, but also cultural, limiting the ability to create technologies that respect and reflect moral values of Spanish-speaking countries.

In recent years, large language models (LLMs) have been integrated into multiple social, educational and professional contexts. Their ability to generate textual content autonomously raises new ethical questions: do these models reflect human values? Do they have sensitivity to the moral principles that guide our decisions?

Several researchers have addressed the relationship between Artificial Intelligence and morality. Of particular note is the work of Jonathan Haidt and Jesse Graham, who proposed Moral Foundations Theory as a framework for understanding human ethical diversity. Recent studies such as Kenton et al. (2021) and Jiang et al. (2022) have begun to explore how language models respond to ethical dilemmas, using benchmarks such as Moral Scenarios, ETHICS, or Moral Foundations Vignettes (MFV). However, a significant gap remains: how do these models respond in Spanish? Do they possess the same biases as in English? Do they adequately reflect the five moral domains (care, justice, loyalty, authority, and purity)?

The main objective of this study is to design and implement a benchmark called BenchMoral, oriented to evaluate the moral sensitivity of different language models (ChatGPT 4.0 Mini, Llama 3.0 and DeepSeek R1) against the MFQ-30 Binarie and Comparative questionnaires for language models, translated and adapted to Spanish.

The importance of this project lies in the fact that it paves the way towards a more inclusive and linguistically diverse ethical evaluation, allowing us to understand to what extent the models in Spanish are aligned with human moral frameworks. This is crucial for their responsible implementation in sensitive tasks such as education, justice or health.

Preliminary conclusions will identify trends, moral biases, limitations or strengths in the models, providing a solid basis for improving their ethical alignment and responsiveness in Spanish. In addition, a line of improvement in the translation and adaptation of moral instruments for LLMs is proposed, respecting cultural and linguistic nuances.

In sum, this research contributes to the field of the ethics of artificial intelligence, expanding the tools for its moral evaluation in non-English- speaking contexts, and raising new questions for future studies on algorithmic justice, linguistic biases, and cultural values in Artificial Intelligence.

Research Question

What differences are observed in moral performance of the language models - ChatGPT 4o Mini, Meta's Llama 3.0 and DeepSeek R1 - when assessed using the MFQ-30 moral foundations questionnaire in Spanish, and what types of biases emerge according to each moral domain analyzed?

Objectives
General

Assessing the moral sensitivity of Spanish language models.

-ChatGPT 4.0 Mini, Meta's Llama 3.0 and DeepSeek R1- using the adapted MFQ-30 questionnaire, in order to quantitatively compare their responses with those of Spanish-speaking human users and with previous studies conducted in English.

Particular (Specific) Objectives
To design and apply a survey based on the MFQ-30 questionnaire translated into Spanish, with the purpose of establishing a representative average score of moral foundations in Spanish- speaking humans.
Implement a set of structured prompts to quantitatively assess how language models understand and respond to the five moral foundations of the MFQ-30 in Spanish.
To contrast the results obtained in Spanish with previous studies conducted in English, in order to identify patterns, cultural and linguistic variations in the moral sensitivity of LLMs.
Propose recommendations for the design and ethical evaluation of language models in Spanish, and identify possible emerging biases in their moral responses.

3.3 Personal Objectives

The purpose of conducting this research is ultimately to develop a piece of analysis that not only evidences my skills in the use of quantitative methodologies applied to the ethics of artificial intelligence, but also contributes to the study of moral biases in language models in Spanish. This project will serve as a cover letter to apply to research programs in ethics of artificial intelligence, filosophy of technology and data analysis applied to society, such as the Ethics & Governance of AI Initiative (Berkman Klein Center, Harvard), the AI Alignment Fellowship (GovAI) and the Good Tech Lab Fellowship.

Additionally:

It will allow me to position myself as a competitive candidate for graduate programs in filosophy of technology, ethical artificial intelligence or applied data science, at universities such as Oxford (Institute for Ethics in AI), Stanford (Center for AI Safety) and MIT (Technology and Policy Program).
I plan to present my research findings in specialized forums such as EA Forum, LessWrong, AI Alignment Forum or academic journals related to AI ethics and filosophy of technology.
It will help me connect with researchers and practitioners in field, opening opportunities to collaborate on projects that seek to make artificial intelligence a more ethical and culturally inclusive technology.

Methodology

This research is based on the study MoralBench: Moral Evaluation of LLMs (Ji et al., 2024), prepared in English, which serves as a reference because it shares the same objectives and offers results with data in that language.

Selection of language models

For this research, three language models will be evaluated: ChatGPT (GPT-4) from OpenAI, Llama 3.2 from Meta and DeepSeek. The choice of these models is based on their relevance within the field of generative artificial intelligence, their Spanish language processing capability, and their impact on widespread use by users.

Chat GPT 4.o – Open AI

GPT-4, released in March 2023, is an advanced linguistic model compatible with more than 50 languages, including Spanish. Its widespread adoption, with over 400 million active users per week, makes it a key benchmark for analyzing how AI models process and apply moral principles (Rooney, 2025)- Moreover, previous research has employed conversational analysis and user surveys to examine emotion expression and decision making in these models, reinforcing the importance of assessing their moral biases in Spanish.

LLama 3.0 - Meta

Llama 3.0 is the latest version of the Meta language model series. In particular, its 1B and 3B versions, designed to run on mobile and high- end devices, are considered. These models support a context length of 128K tokens, allowing them to handle tasks such as multilingual information retrieval and summarization, instruction tracking, and content rewriting in an eficient manner ("Llama 3.2," 2024). Their inclusion in this study is relevant because of their ability to adapt to different linguistic contexts and their optimality for edge processing

(on-device), which could influence how he manages moral decision making and the consistency of his responses in Spanish.

DeepSeek R1

DeepSeek, launched in January 2025, is an artificial intelligence model developed in China that has gained rapid popularity, amassing more than 10 million downloads on the Play Store and App Store. Its adoption is especially strong among young users, with 42.5% of iOS users and 50.4% of Android users in the 18-24 age group. Despite operating on a more limited budget compared to its competitors, DeepSeek has proven to be a highly competitive tool, positioning itself as the top-ranked app in 2025 (Kumar, 2025). However, it has faced challenges in cybersecurity, leading to restrictions on new registrations due to reported cyberattacks. Its inclusion in this study is critical to analyze how an emerging model, with a possibly different approach to data training, responds to moral foundations questionnaires in Spanish and how it compares to more established models such as GPT-4 and Llama 3.2.

The combination of ChatGPT, Llama 3.2 and DeepSeek will enable a detailed comparative analysis of how different AI models apply the five moral foundations in Spanish. This will facilitate the identification of similarities, differences and possible biases in their responses, providing valuable information for future improvements in the ethical alignment of language models in this language.

Theory of moral foundations and choice of questionnaire

This study was conducted on the basis of the questionnaires appropriate to the language models, carried out in Moral Bench: Moral Assessment of LLM, which we will explain below. The Moral Foundations Theory (Moral Foundations Theory) at first, identifies five fundamental dimensions of morality (Graham, n/d):

Care/Harm: Sensitivity to suffering from of

compassion and protection of the vulnerable.

Fairness/Equity: Focus on equity, fairness and equal treatment.
Loyalty/Traition: Commitment to the group, community or nation.
Authority/Respect: Respect for hierarchy and tradition.
Purity/Sanctity: Concern for moral and physical purity, rejection of the impure or corrupt.

This theory can be assessed by means of two main tools: the Moral Foundations Questionnaire (MFQ) and the Moral Foundations Vignettes (MFV). However, due to time constraints and considering methodological feasibility, this research employs exclusively the MFQ-30 Questionnaire. This choice responds to three main factors: the immediate availability of the questionnaire, prior knowledge of its structure and assessment method, and the ease of replicating it in Spanish in an accessible and confidable way.

The MFQ-30 employs a scale from 0 to 5, where a score of 5 indicates greater adherence to a given moral foundation. In addition, the questionnaire has 30 questions divided into two sections:

Moral relevance (15 items): Evaluates how important certain moral principles are for the person (e.g., "If a person acts immorally, he/she should feel ashamed").
Moral judgment (15 items): assesses whether the person agrees or disagrees with specific statements related to each moral domain.

In MoralBench, a study conducted in English that serves as a reference for this research, an adapted version of the Moral Foundations Questionnaire specifically designed to assess large language models (LLMs), called the MFQ-30-LLM, was developed. This adaptation includes two questionnaire formats: the MFQ-30-LLM Binary and the MFQ-30-LLM Compare, each oriented to assess different aspects of moral sensitivity in model-generated responses.

Model evaluation

The methodology consists of applying the MFQ-30 questionnaire, adapted to the Spanish , in its two modalities: MFQ-30 Binary and MFQ-30 and MFQ-30

Compare, to the language models ChatGPT 4.0 Mini, Meta's Llama 3.0 and DeepSeek R1. The main objective is to evaluate the level of understanding of these models with respect to the previously mentioned moral foundations. From these results, a comparison will be made with human responses and with the findings of the original study conducted in English .

For a better understanding, the methodology has been divided into two stages. The first stage consists of determining the human mean per statement of the MFQ-30, while the second stage involves calculating the total points on a moral scale by language model and moral dimension, which allows us to measure how aligned the model is with human moral judgments.

Stage O1

MFQ-30 Questionnaire in Spanish

To ensure that the language models are evaluated in an adequate context, the MFQ-30 questionnaire was translated into Spanish. This process was carried out by a team of specialized translators and validated through revisions to ensure that the significance of the original statements was not lost. The translation was carried out with the objective of ensuring that the responses of the models were aligned with the cultural and linguistic context in Spanish, maintaining the essence of the moral dilemmas posed in the original questionnaire.

Once the original questionnaire was translated into Spanish, a survey was designed based on the Likert scale, with the objective of quantitatively measuring the responses of the human participants. This survey filly reproduces the structure of the MFQ-30 (Moral Foundations Questionnaire, 30-item version), a psychometric tool developed by Jonathan Haidt and collaborators, which evaluates how people value different moral principles.

The questionnaire consists of 30 afirms, divided into two sections. In the first section, participants evaluate how important they consider certain moral values (e.g., "Avoiding harming others is an important part of what makes me a good person") using a Likert scale from 0 to 5 , where 0 significa "not at all important" and 5 "extremely important." In the

second section, they are asked to judge the acceptability of various moral actions or situations (e.g., "A person acts incorrectly if he/she shows disrespect to authorities") using the same scale, but with the sense of 0 "strongly disagree" to 5 "strongly ".

The responses obtained allow for the calculation of scores in five dimensions moral: Care/Harm, Justice/Deceit, Loyalty/Betrayal, Authority/Subversion and Purity/Degradation. In this research, the survey was administered to 80 university students of Law and Philosophy at different universities in the city of Arequipa, Peru, all of them native Spanish speakers. These human responses will serve as a point of comparison against those generated by artificial language models evaluated with the same questionnaire.

Once the participants' responses were obtained, the mean of the responses for each statement of the questionnaire was calculated. This calculation made it possible to obtain an average score, an indispensable step for the second stage of the methodology, since it is used to perform the analysis that makes it possible to determine the level of understanding of each of the moral dimensions. These data were also used to compare the responses of the language models with the human ones, in terms of their alignment with the moral judgments made by the participants.1

Stage O2

Calculation of points to determine the level of understanding of the moral dimensions of LLMs.

It is important to note that the prompts used are based on the MFQ-30 questionnaire, which were initially developed in English in the original study. Prompts were extracted for both the binary assessment (MFQ-30- Binary) and the comparative assessment (MFQ-30-Comparative), which were adapted to Spanish to be used in the assessment of the language models. In this process, prompts were grouped according to the

(1)https://docs.google.com/spreadsheets/d/1dPWRAcyXpdw1xZst77CrwrqOCr8UBNtB2XO3AKgeNqI/edi

t?usp=sharing

This made it possible to analyze the responses of the models in relation to each of the five previously mentioned moral dimensions.

The translated and adapted prompts were sent to the selected language models (ChatGPT 4.0 Mini, Llama 3.0 and DeepSeek) through their respective chat interfaces. The mass sending of prompts was performed in a controlled and repetitive manner, i.e. each prompt was sent 5 times to ensure confidence, guaranteeing that each model received exactly the same utterances in the same order, to ensure the validity of the comparison of their responses.

The responses obtained from the language models were collected and stored in experiment logs2 . In these logs, both textual responses and scores derived from the two types of assessment used in the study: MFQ-30-LLM- Binary and MFQ-30-LLM-Comparative were recorded.

MFQ-30-LLM-Binary: In this approach, language models responded to prompts with binary responses ("Agree"/"Disagree", "Correct"/"Not correct"). Negative responses (e.g., "Disagree" or "Not correct") were scored by the difference between the maximum score of the questionnaire (5) and the mean of human responses, which allowed quantificating the degree of understanding of the model with human moral norms (Image 1).

(2)https://drive.google.com/drive/folders/1JiPZvZHpZA3JNPaCVH2lLgrUnem0F2OG?usp=sharing

Imaggn 1: Retrieved from MoralBench: Moral Evaluation of LLMs. The language model receives a moral afirmation and must respond whether it agrees or disagrees, simulating a binary moral judgment.

MFQ-30-LLM-Comparative: This approach used carefully selected pairs of afirmations representing different moral perspectives. Language models were to choose the afirmation that, in their judgment, best reflected human morality. In case the model selected the afirmation with the highest human score, it was assigned 1 point; if it chose the afirmation with the lowest score, it was assigned no points. The final score reflected the degree of alignment of the model with human moral judgments (Image 2).

Imaggn 2: Retrieved from MoralBench: Moral Evaluation of LLMs. The model compares two moral afirmations and chooses which one it considers more correct, allowing to evaluate its moral prioritization

capacity.

Based on the responses obtained, the comprehension score of each model was determined in relation to the different moral dimensions evaluated (authority, loyalty, fairness, caring and sanctity). This score was calculated by means of a comparative analysis between the models' responses and the human mean obtained in previous step. The results were used to identificate the degree of coherence and moral sensitivity of the models in Spanish, comparing them with previous studies conducted in English.

Results and Discussion
1. Comparison of results between LLMs

Table 1. Results of the level of understanding of moral foundations by language model (ChatGPT 4.0, Llama 3.0 and DeepSeek R1) according to the MFQ-30-LLM- BINARIE questionnaire. It can be seen that the highest scores per model and dimension are highlighted in blue color, while in green color, in the totals column, the model with the best overall performance is highlighted. Llama

3.0 is positioned as the model with the best understanding of fundamental values, leading in the dimensions of care, loyalty, authority and sanctity. Chat GPT 4.0 stands out as a leader in the dimension of equity.

Results of the level of similarity between the moral foundations of the language models (ChatGPT 4.0, Llama 3.0 and DeepSeek R1) and the human responses, according to the MFQ-30-LLM COMPARE questionnaire. In this table, the lowest scores per moral dimension are highlighted in red, indicating lower similarity between the corresponding model and human responses in that dimension. In the totals column, the model with the lowest overall similarity is also marked in red, which in this case is Llama 3.0. Conversely, the model with the highest level of similarity to human responses is ChatGPT 4.0.

As can be seen in Table 1 (MFQ-30-LLM-Binarie), the Llama 3.0 model presents the highest level of understanding of the moral foundations, standing out in four of the five dimensions (care, loyalty, authority and sanctity), as well as in the overall total score. This suggests that this model has developed a strong internal representation and coherent of the values moral evaluated,

understanding the logic that sustains them. However, when analyzing the results in Table 2 (MFQ-30-LLM-Compare), we observe that Llama

3.0 obtains the lowest similarity score with the human responses (8.0), which implies a significant difference between understanding the moral foundations and knowing how to express them in a socially and culturally coherent way with the average human discourse, especially in the Spanish-speaking context.

In other words, although Llama 3.0 demonstrates a superior ability to process and reason about moral principles, it does not necessarily reproduce the typical ways in which a real person would respond. This could be because its response, while logical and structured, lacks the emotional, contextual, or discursive nuances characteristic of everyday morality. Thus, the model may behave like a theoretical moral fillophobe, coherent and precise, but relatively distant from common speech and popular moral judgment.

On the other hand, the ChatGPT 4.0 model, which presents a lower level of comprehension in Table 1, nevertheless obtains the highest degree of similarity to human responses (14.0) in Table 2. This finding reveals an interesting balance: although it does not lead in terms of structural comprehension, ChatGPT 4.0 has learned to reproduce with great fidelity typical human moral response patterns, possibly because of its intensive training on realistic social, psychological, and discursive data. This model seems to have eficiently internalized the cultural norms, rhetorical forms, and ethical reasoning specific to a linguistic community, making it more socially and culturally attuned, though not necessarily filosophically deeper.

This analysis gives rise to a key distinction in the field of computational ethics:

Moral understanding≠ Human moral mimicry.

One model may be theoretically rigorous but communicatively clumsy, while another may be skilled in everyday moral discourse but superficial in its ethical reasoning (Sanchez, 2024). The difference between comprehension and expression leads us to question whether an artificial moral agent should be valued more for its capacity for deep moral reasoning or for its ability to empathize and communicate morally like humans (Chalmers, n/d).

In this context, the DeepSeek R1 model offers a particularly interesting result: it is located in a middle ground in both moral understanding and moral similarity. This balance suggests that it could represent a more integrated architecture between moral representation and social expressiveness, opening the door to future research that delves into the reasons behind this intermediate behavior. It suggests that models trained with different architectures or datasets may achieve a more natural integration between logic and context. Delving deeper into how DeepSeek was trained may offer clues for designing better ethical models in the future.

The need for morally competent artificial intelligence for specific contexts (health care assistants, education, justice) requires models that not only understand values, but communicate them with empathy, clarity, and cultural adaptability. This study suggests concrete lines for improving moral response algorithms in different languages and communities.

Comparison with the English study (ChatGPT 4.0 and LLama 3.0)

Comparison of the results of the ChatGPT 4.0 and GPT 4o model between the original study in English and the present study in Spanish, differentiated by type of questionnaire (MFQ-30-LLM-Binary and Comparative). In this table, the highest scores by moral foundation and by language of study are highlighted in blue. Also, in the totals column, the study with the highest overall score is highlighted in green. The results show that the English study presents a higher understanding of the moral foundations in Binarie mode, while in the Compare mode, the Spanish study reflects a higher similarity with the human responses.

These results suggest that, although models such as GPT-4 have extensively trained in English -which explains their greater accuracy and discursive coherence in that language-, they still present limitations when interacting in other languages. In the case of Spanish, for example, they can replicate the form of human discourse, but show less conceptual depth in terms of moral content. This difference reveals a gap in multilingualprocesses, as well as an insficiency in the diversity and quality of ethical data used in languages other than English.

The fine-tuning is a machine learning technique that consists of adapting a previously trained model to specific tasks or contexts through retraining with more focused data. In the case of multilingual fine-tuning, this adaptation seeks to optimize the performance of the model in various languages. However, when the adjustment process does not adequately consider the linguistic and cultural particularities of each language, biases or limitations in the quality and sensitivity of the responses are generated, especially in domains as complex as morality.

Table V. Comparison of the results of the Llama 2 and Llama 3.0 model between the original study in English and the present study in Spanish, differentiated by type of questionnaire (MFQ-30-LLM-Binary and Comparative). The highest score obtained by moral foundation as a function of the study language is highlighted in blue. In addition, the highest total score achieved overall is indicated in green.

In Table 4, the results show that the Spanish version of the Llama 3.0 model presents a greater understanding of moral foundations. However, in terms of the level of similarity with human responses, both studies (English and Spanish) show equivalent performance.

From the data presented in the table, it is infiere that the LLaMA 3.0 model demonstrates a significantly higher understanding of moral foundations in Spanish under the binary modality (66.74), compared to its previous version, LLaMA 2 in English (58.5). This difference suggests a substantial improvement in the architecture of the model, especially in its ability to represent in a structured and coherent way the principles of care, fairness, loyalty, authority, and sanctity.

However, when looking at the comparative modality -which evaluates the similarity of the responses with real human responses- both models obtain the same overall score (8.0), although distributed differently among the moral domains. This implies that, although LLaMA 3.0 has reached a deeper level of conceptual understanding of morality in Spanish, it still does not fully mimic the discursive style and expressions of human moral reasoning. It should also be noted that we do not yet have comparative data for LLaMA 3.0 in English, which prevents a more accurate assessment of the impact of language on its performance.

A particularly telling observation is the score of 0.0 on the caring dimension in the comparative mode of LLaMA 3.0. This result indicates that the model's responses in this domain do not at all resemble those of humans when faced with dilemmas or afirms that appeal to empathy, compassion, or protection of the vulnerable. From a technical perspective, this could reflect a possible algorithmic bias that leads the model to prioritize values such as justice or authority over caring. From aangle, it suggests a fundamental limitation in the current ability of LLMs to capture the affective and intersubjective dimension of morality. Understanding what "caring" is is not enough; it is necessary to express that value in a close, human and culturally situated way, something that still represents a challenge for these technologies.

These results provide a valuable comparative tool by offering a rigorous, empirical assessment of how the moral performance of models varies by language. Relative to previous studies in English, this comparison provides clear evidence that computational moral competence does not automatically transfier across languages. A model may exhibit high levels of performance at English, but does not replicate the same

results in Spanish, and vice versa, depending on the type of test and the moral domain evaluated.

At a general level, the analysis underscores the need to move toward localized ethics training. It is not enough to translate questionnaires or texts: it is crucial to understand how each culture and language expresses and articulates its moral values, which requires more sophisticated, context-sensitive and culturally focused ethical fine- tuning strategiesFinally, the study reinforces the importance of developing multilingual ethical benchmarks, such as the MFQ-30 adapted to Spanish, that allow for fairer, more representative and culturally informed assessments of moral behavior simulated by language models.

Perspectives

If we wish to replicate this project, it would be convenient to make some methodological adjustments to strengthen its validity and allow greater generalization of the results. First, increasing the size of the human sample surveyed would make it possible to obtain more representative data on the moral patterns of Spanish speakers. This would not only improve the statistical robustness of the study, but would also open up the possibility of analyzing differences between population subgroups (by age, region, educational level, among others).

Second, it is recommended to broaden the spectrum of language models evaluated, incorporating both open source and proprietary models, as well as those developed outside the English-speaking context. This would allow us to examine how the different architectures and cultural backgrounds of the models influence their moral reasoning.

Likewise, it would be highly valuable to replicate the assessment using the MFV (Moral Foundations Vignettes) questionnaire, as has been done in previous studies in English. This tool not only assesses moral principles in the abstract (like the MFQ-30), but also responses to concrete situations through morally charged dilemmas, which would enrich the understanding of how models process contextualized morality.

On the other hand, the results of this study open new avenues for the design of more ethically competent and culturally sensitive language models. One of the most promising avenues is the development of hybrid architectures, which combine the strengths of

different models. For example, the integration of the deep moral reasoning capacity observed in LLaMA 3.0 with the expressive and contextual sensitivity of ChatGPT 4.0 could be explored. This combination would make it possible to move towards models capable not only of interpreting ethical principles in a coherent manner, but also of expressing them in an empathetic, situated and culturally adapted way.

This vision suggests that computational ethics should not be conceived as a universal and homogeneous system, but as a plural and dynamic field. It then becomes fundamental to move toward a pluriversal computational ethics, capable of incorporating multiple moral rationalities, especially those that have been historically underrepresented in global training corpora, such as the global south filosophical traditions

Among the most relevant opportunities that emerge from this analysis, the following lines of research stand out:

Development of multilingual ethical benchmarks: It is urgent to create questionnaires and metrics such as the MFQ-30 and the VFM adapted to different languages and cultural contexts, to avoid morally biased evaluations towards Anglo-Saxon perspectives.
Exploring statistical ethical learning: It is necessary to investigate how models acquire moral norms from linguistic data, and whether this process can be enhanced by explicit training in ethical theories, complex dilemmas, and normative principles from various filosophical streams.
Specialized cultural fine-tuning: This opens the possibility of applying a cultural fine-tuning, training models to respond according to social norms, forms of politeness and styles of moral reasoning specific to local contexts, such as Andean ethics, African community ethics or Confucian morality.
Comparison of moral ontologies across models: it would be especially valuable to study how different models internally structure their moral concepts (i.e., their "moral maps") and what filosophical, political, or educational consequences this diversity may have on their practical applications.

Given the eminently interdisciplinary nature of this research, its future development requires the active collaboration of various actors and fields of knowledge. Moral and ethical philosophers can provide solid conceptual frameworks for assessing the content and consistency of model responses. Social and cognitive psychologists can help design instruments that capture not only moral judgments, but also the mental and affective processes that underlie them. In turn, sociologists and anthropologists can contribute deeper cultural interpretations of how values are manifiested in different societies.

From the technical side, the involvement of data scientists, machine learning engineers and computational linguists is key to optimizing architectures, refining training processes and ensuring ethical, fair and culturally informed development language models.