Promoting positive mental health and well-being, especially in adolescents, is a critical yet underexplored area in natural language processing (NLP). Most existing NLP research focuses on clinical therapy or psychological counseling for the general population, which does not adequately address the preventative and growth-oriented needs of adolescents. In this paper, we introduce DeepWell-Adol, a domain-specific Chinese dialogue corpus grounded in positive psychology and coaching, designed to foster adolescents’ positive mental health and well-being. To balance the trade-offs between data quality, quantity, and scenario diversity, the corpus comprises two main components: human expert-written seed data (ensuring professional quality) and its mirrored expansion (automatically generated using a two-stage scenario-based augmentation framework). This approach enables large-scale data creation while maintaining domain relevance and reliability. Comprehensive evaluations demonstrate that the corpus meets general standards for psychological dialogue and emotional support, while also showing superior performance across multiple models in promoting positive psychological processes, character strengths, interpersonal relationships, and healthy behaviors. Moreover, the framework proposed for building and evaluating DeepWell-Adol offers a flexible and scalable method for developing domain-specific datasets. It significantly enhances automation and reduces development costs without compromising professional standards—an essential consideration in sensitive areas like adolescent and elderly mental health. We make our dataset publicly available.
As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence—whose mass-covering behavior risks overfitting to imperfect weak signals—with reverse KL divergence. Reverse KL divergence’s zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last linear layer, reverse KL guarantees that it outperforms its weak supervisor by the magnitude of their disagreement. Empirically, we demonstrate that reverse KL and reverse cross-entropy not only enable strong models to outperform those trained with forward KL and standard cross-entropy across most settings, but also exhibit greater robustness to noisy labels.
Certified defense methods have identified their effectiveness against textual adversarial examples, which train models on the worst-case text generated by substituting words in original texts with synonyms. However, due to the discrete word embedding representations, the large search space hinders the robust training efficiency, resulting in significant time consumption. To overcome this challenge, motivated by the observation that synonym embedding has a small distance, we propose to treat the word substitution as a continuous perturbation on the word embedding representation. The proposed method Text-RS applies random smooth techniques to approximate the word substitution operation, offering a computationally efficient solution that outperforms conventional discrete methods and improves the robustness in training. The evaluation results demonstrate its effectiveness in defending against multiple textual adversarial attacks.
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs’ trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs’ trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM’s pre-training checkpoints to enhance the LLM’s trustworthiness. Finally, inspired by the theoretical result that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field.