Artificial intelligence researchers at North Carolina State University have unveiled a new technique aimed at enhancing the safety of large language models (LLMs) that power popular chatbots like OpenAI’s ChatGPT and Google’s Gemini. This breakthrough, termed “neuron freezing,” is designed to prevent users from circumventing the built-in safety filters of these AI systems, a concern that has grown as misuse has become more prevalent.
Traditionally, LLMs approach safety as a binary decision point at the start of generating responses. If the system identifies a query as safe, it proceeds; if it flags a query as dangerous, it declines to respond. However, users have increasingly exploited loopholes by rephrasing unsafe prompts to bypass these safeguards. One notable study from last year indicated that simply rewording a harmful prompt as a poem enabled users to evade safety measures.
The limitations of existing safety protocols demand continuous updates or retraining of the models to address these workarounds. In contrast, the new research offers a more foundational approach to incorporating ethical boundaries directly into the architecture of LLMs, effectively hardcoding safety measures to prevent misuse regardless of user attempts to manipulate the input.
The innovation hinges on identifying and “freezing” specific safety-critical neurons within the neural network. This strategy preserves the safety characteristics of the original model while enabling it to adapt to new tasks across various domains. “Our goal with this work was to provide a better understanding of existing safety alignment issues and outline a new direction for how to implement a non-superficial safety alignment for LLMs,” stated Jianwei Li, a PhD student at NC State University and the lead author of the research.
Li further explained that “freezing” these neurons during the fine-tuning process maintains the safety attributes of the model even as it encounters new contexts. Jung-Eun Kim, an assistant professor of computer science at the university, underscored the significance of this research, noting that it offers a conceptual framework to address challenges associated with safety alignment in LLMs.
The researchers envision that their work will pave the way for future techniques aimed at enabling AI models to continuously evaluate the safety of their reasoning while generating responses. Such advancements could be crucial as the deployment of AI systems becomes more widespread and integrated into daily life.
The findings will be detailed in an upcoming paper titled “Superficial Safety Alignment Hypothesis,” which is scheduled for presentation at the Fourteenth International Conference on Learning Representations (ICLR2026) in Brazil next month. As these technologies evolve, ensuring their safe operation will be paramount, particularly in light of their increasing use in sensitive applications.
See also
SafetyPairs Framework Revealed: 3,020 Image Dataset Enhances AI Safety Evaluation
10 Essential X Accounts for Real-Time LLM Insights and Updates
Autoscience Secures $14M to Launch World’s First Automated AI Research Lab
MIT-IBM Watson AI Lab Empowers Early-Career Faculty for Prolific AI Research
Hybrid Deep Learning Model Achieves 95% Accuracy in Breast Cancer Diagnosis Using Genetic Data





















































