October 25 ~ 26, 2025, Vienna, Austria
Davide AVESANI, Ammar KHEIRBEK, Isep - Institut Sup´erieur d’Electronique de Paris ´10 rue de Vanves, 92130 Issy-les-Moulineaux, France
Social media is now deeply integrated into people’s daily life, enabling rapid information exchange and global connectivity. Unfortunately, harmful content can be easily disseminated among all communities, including hate speech and biases against vulnerable groups such as people with disabilities. While social media platforms employ a mix of automated systems and skillful experts for content moderation, significant challenges remain in detecting nuanced hate speech, particularly when expressed through indirect or coded language. This paper proposes a novel approach to address these challenges through HEROL (Hate-speech Evaluation via RAG and Optimized LLM), a unified model that combines RAG-Enhanced Large Language Models with Prompt Engineering and Fine-Tuning. Experimental results, obtained through a structured evaluation methodology using annotated social media datasets, demonstrated that HEROL achieved an accuracy improvement by up to 10% compared to baseline models. This highlights its effectiveness in identifying subtle and indirect forms of hate speech and its potential to contribute to safer, more inclusive online environments.
Social Media – Hate Speech Detection – Disability – Natural Language Processing – Large Language Models – Prompt Engineering – Fine-Tuning – Retrieval-Augmented Generation – Knowledge Graph
Liya Wang, David Yi,Damien Jose,John Passarelli, James Gao, Jordan Leventis, and Kang Li, Atlassian, USA
Large Language Models (LLMs) enhance productivity through AI tools, yet existing benchmarks like Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom’s Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy data and costly annotation, we develop a scalable pipeline combining LLM-as-a-Labeler, LLM-as-a-Judge, and corrective retrieval-augmented generation (CRAG), curating a robust 9,700-sample benchmark. Evaluation of six leading models shows open-source contenders like DeepSeek R1 rival proprietary models in reasoning tasks but lag in judgment-based scenarios, likely due to overthinking. Our benchmark reveals critical enterprise performance gaps and offers actionable insights for model optimization. This work provides enterprises a blueprint for tailored evaluations and advances practical LLM deployment.
Large Language Models (LLMs), Evaluation Benchmark, Bloom’s Taxonomy, LLM-as-a-Labeler, LLM-as-a-Judge, corrective retrieval-augmented generation (CRAG).
Philipp Seitz, Jan Schmitt, and Andreas Schiffler, Institute of Digital Engineering, Technical University of Applied Sciences W¨urzburg- Schweinfurt, Germany
For a larger set of predictions of several differently trained machine learning models, known as bagging predictors, the mean of all predictions is taken by default. Nevertheless, this proceeding can deviate from the actual ground truth in certain parameter regions. A method is presented to determine a representative value ˜yBS from such a set of predictions and to evaluate it by an associated quality criterion βBS, called Bagging Score (BS), using nonlinear regression with Neural Networks (NN). The BS reflects the confidence of the obtained ensemble prediction and also allows the construction of a prediction estimation function δ(β) for specifying deviations that are more precise than using the variance of the bagged predictors themselves.
Machine Learning, Neural Network, Bagging Predictors, Bagging Score, Nonlinear Regression, Deviation Estimation.