AAAI 2025-Artificial Intelligence Ethics And Security-12 Related Papers

Protecting model adaptation in unlabeled data from Trojan attacks Author: Lijun Sheng; Jian Liang; Ran He; Zilei Wang; Tieniu Tan Conference: AAAI Confer

Protect model adaptation in unlabeled data from Trojan attacks

Author: Lijun Sheng; Jian Liang; Ran He; Zilei Wang; Tan

Meeting: AAAI on

Publication time: 2025/02/16

Abstract: Model adaptation solves the distribution transfer problem through pre-trained models rather than raw data, which has become a popular paradigm due to its powerful privacy protection capabilities. Existing methods always assume that a clean target domain is adapted, ignoring the security risks of unlabeled samples. This article explores for the first time potential Trojan horse attacks to model adaptation initiated by carefully designed poisoning target data. Specifically, we provide two trigger modes and two poisoning strategies for different prior knowledge that an attacker has. These attacks achieve high success rates while maintaining normal performance of clean samples during the test phase. To prevent this backdoor injection, we propose a plug-and-play method called, which can be seamlessly integrated with existing adaptive algorithms. Experiments on commonly used benchmarks and adaptive methods have proved their effectiveness. We hope this work will help clarify the security of transfer learning using unlabeled data.

Original link

Artificial Intelligence Risk Atlas: Improve the Public Understanding of AI Risks

Author: Edyta; Sanja Šćć;

Meeting: AAAI on

Publication time: 2025/02/07

Abstract: Popular approaches to visualizing AI risks focus on technical issues such as data bias and inaccurate modeling, often overlooking broader social risks such as unemployment and supervision. Furthermore, these visualizations are often designed for tech-savvy individuals, ignoring those with limited technical skills. To address these challenges, we propose Atlas of AI Risks—a narrative-style tool designed to map a wide range of risks associated with a variety of AI technologies in a way that is also understandable by non-technical personnel. To develop and evaluate the tool, we conducted two crowdsourcing studies. The first time involved 40 participants, the design requirements for AI risk visualization for decision-making were identified and Atlas was guided. The second study had 140 participants, reflecting the age, gender, and ethnicity of the U.S. population, and evaluated the availability and aesthetics of Atlas to ensure it meets these requirements. Using facial recognition technology as a case study, we found that Atlas is more user-friendly than baseline visualization, has a more classic and expressive aesthetic, and is more effective in balancing the risk and benefits of facial recognition. Finally, we discuss how our design choices adapt Atlas to a wider use, enabling it to be generalized in a variety of technical applications represented in databases that report various AI events.

Original link

Figure steps: Jailbreak large visual language model through typography visual prompts

Author: Gong; Ran; Liu; Wang; Cong; Anyu Wang; Sisi Duan; Wang

Meeting: AAAI on

Publication time: 2025/01/19

Abstract: The Large Visual Language Model (LVLM) marks a pioneering paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of the Large Language Model (LLM) by absorbing other modes (e.g., images). Despite this progress, the security of LVLM is still underexplored and may be overly reliant on the security assurances claimed by its underlying LLM. In this article, we propose that this is a simple and effective black box jailbreak algorithm for LVLM. Instead of directives to enter harmful texts directly, instead, converting prohibited content into images via typography to bypass safe alignment. Experimental results show that an average attack success rate of 82.50% can be achieved on six promising open source LVLMs. Not only to demonstrate validity, we also conducted a comprehensive ablation study and analyzed the distribution of semantic embeddings to reveal that the reason behind the success is the inadequate safe alignment of visual embeddings. Additionally, we compare five plain text jailbreaks and four image-based jailbreaks to prove the advantages of negligible attack costs and better attack performance. Most importantly, our work reveals that current LVLM is vulnerable to jailbreak attacks, highlighting the need for new cross-modal secure alignment techniques. Our code and dataset are available on it.

Original link

Differentiable adversarial attacks that mark time point process

Author: ; Gupta; Rahul R; J. ; Abir De

Meeting: AAAI on

Publication time: 2025/01/17

Abstract: Labeled point-time processes (MTPPs) have been proven to be very effective when modeling continuous time event sequences (CTES). In this work, we propose adversarial attacks designed specifically for MTPP models. A key criterion for a good adversarial attack is its invisibility. For objects such as images or text, this is usually achieved by binding perturbations in some fixed-ball. Similarly, however, minimizing the distance specification between two CTESs in the context of MTPP is challenging because they have sequential properties and are different in time scales and lengths. We solve this challenge by first arranging the events and then merging the additive noise into the arrival timestamp. However, the worst-case optimization of such adversarial attacks is a difficult combinatorial problem that needs to be explored in the large permutation space of factorial in the length of the input sequence. Therefore, we propose a new differentiable scheme, using which we can perform adversarial attacks by learning to minimize the possibility while minimizing the distance between two CTESs. Our experiments on four real-world datasets demonstrate attack and defense capabilities, as well as lower inference time.

Original link

Thinking about the openness of artificial intelligence: Will goodwill be abused?

Author: Yeeun Kim; Shin; Choi; Oh; Kim; Hwang

Meeting: AAAI on

Publication time: 2025/01/07

Abstract: Open source is the driving force behind scientific progress. This openness is also a double-edged sword, with inherent risks that innovative technologies may be misused for purposes that are harmful to society. How likely is it to be an open source AI model or dataset to be used to commit real-world crimes, and if the criminal does take advantage of it, will the person behind the technology escape legal liability? To address these issues, we explore a legal field where individual choices can have a significant impact on society. Specifically, we first constructed an EVE-V1 dataset containing 200 question-and-answer pairs related to criminal offenses based on 200 Korean precedents to explore the possibility of malicious models appearing. We further developed EVE-V2 using 600 fraud-related precedents to confirm the existence of malicious models that can provide harmful advice on a wide range of crime topics to test domain generalization capabilities. It is worth noting that the widely used open source large-scale language model (LLM) provides immorality and details about criminal activity when fine-tuning with EVE. We also delve into the legal issues that malicious language models and their builders may actually face. Our findings highlight a paradoxical dilemma that open source accelerates scientific advances but requires great care to minimize the possibility of abuse. Warning: This article contains content that some people may consider unethical.

Original link

Security attacks of LLM-based code completion tools

Author: Wen Cheng; Ke Sun; Xinyu Zhang; Wei Wang

Meeting: AAAI on

Publication time: 2025/01/02

Abstract: The rapid development of large language models (LLM) has significant advanced code completion capabilities, resulting in a new generation of LLM-based code completion tools (LCCT). Unlike general-purpose LLM, these tools have unique workflows that integrate multiple sources of information as input and prioritize code suggestions over natural language interactions, which present different security challenges. Furthermore, LCCTs often rely on proprietary code datasets for training, which raises concerns about potential exposure to sensitive data. Using these prominent features of LCCT, this paper develops targeted attack methods for two key security risks: jailbreak attacks and training data extraction attacks. Our experimental results exposed major vulnerabilities in LCCT, including a 99.4% success rate of jailbreak attacks and a 46.3% success rate of Q. In addition, we successfully extracted sensitive user data from it, including 54 real email addresses and 314 physical addresses related to the username. Our research also shows that these code-based attack methods are effective against general-purpose LLMs, such as the GPT family, highlighting the wider security outages in modern LLMs when handling code. These findings highlight key security challenges associated with LCCT and present important directions for strengthening its security framework. The sample code and attack examples we studied are provided above.

Original link

The fair operation of Rawlsian ethics in standardizing learning agents

Author: ; Paul ; Nirav

Meeting: AAAI on

Publication time: 2024/12/19

Abstract: Social norms are common behavioral standards in society. However, when agents make decisions without considering how others are affected, norms that lead to certain agents surrender may arise. We propose RAWL-E, a method to create ethical learning agents. RAWL-E agents implement the principle of maximum and minimum value in their decision-making, a principle of fairness in Rawls’ ethics that promotes ethics by balancing social welfare and individual goals. We evaluate RAWL-E agents in simulated harvest scenarios. We found that the norms that emerge in RAWL-E agent society improve social welfare, equity, and robustness and produce higher minimum experiences than agent society that does not implement Rawlsian ethics.

Original link

Legend: Using representation engineering to annotate the security margin of the preferred dataset

Author: Feng; Bowen Qin; Chen Huang; Huang; Zheng Zhang; Lei

Meeting: AAAI on

Publication time: 2024/12/18

Abstract: The success of the reward model in distinguishing responses with subtle safety differences depends on a high-quality preference dataset that should capture nuances of harmful and harmless responses. This inspired the need to develop datasets involving preference margins that accurately quantify the harmlessness of one response vs another. In this paper, we first propose an effective and cost-effective framework to facilitate the development of marginally enhanced preference datasets. Our framework uses representation engineering to annotate preference datasets. It constructs a specific direction of security within the embedded space of the LLM. By utilizing this safe direction, margins can be automatically annotated along the direction by using the semantic distance of paired responses. We demonstrated our effectiveness in reward modeling and harmless alignment of LLM through experiments. It is also known for its efficiency, requiring only reasoning time without additional training. This efficiency allows for easier implementation and scalability, making it particularly valuable in practical applications that coordinate LLM with secure dialogue.

Original link

SCANS: Turning to mitigating LLM's exaggerated security through security awareness activation

Author: Cao; Yifei Yang; Hai Zhao

Meeting: AAAI on

Publication time: 2024/12/17

Summary: Secure alignment is essential for large language models (LLMs) to protect against threats from malicious instructions. However, recent research has shown that safely aligned LLMs tend to reject benign queries due to exaggerated security issues, thus limiting their usefulness. In this paper, we propose a security awareness activation steering (SCANS) approach to alleviate the exaggerated security issues in aligned LLM. First, SCANS extracts the rejection steering vector in the activation space and uses lexical projection to anchor specific security critical layers that affect the model rejection behavior. Second, by tracking hidden state transitions, SCANS recognizes the steering direction and controls model behavior accordingly, achieving a balance between exaggerated security and adequate security. Experiments show that SCANS achieves state-of-the-art performance on and benchmarking without compromising their defenses against harmful queries and maintains almost unchanged model capabilities.

Original link

Fair protection: Prevent biased decision makers

Author: Filip Cano; A. ; Kö; ;

Meeting: AAAI on

Publication time: 2024/12/16

Abstract: As AI-based decision makers increasingly influence human life, there is growing concern that their decisions tend to be unfair or biased in terms of people’s sensitive attributes such as gender and race. Most existing bias prevention measures provide a guarantee of probability equity in the long run, and decisions may bias specific instances of short decision sequences. We introduce fair blocking, where symbolic decision makers—fair blocking—continuously monitor the decision sequence of another deployed black box decision makers and intervene in order to meet a given fairness standard while minimizing the total cost of intervention. We propose four different fair masking calculation algorithms, one of which guarantees fairness within a fixed range, and the other which periodically guarantees fairness after a fixed interval. Given the distribution of future decisions and their intervention costs, our algorithm solves different examples of bounded time domain optimal control problems with different levels of computational costs and optimality assurance. Our empirical assessment demonstrates the effectiveness of these shields to maintain cost-efficiency across a variety of scenarios while ensuring fairness.

Original link

RTP-LX: Can LLM evaluate toxicity in multilingual scenarios?

Author: de Watts; Tua; Zhang; Noura Farra; Ege Altı; Lena Baur; Pavel; Can Gören; Gu; Anna; Ruby Kuo; Akiko Kyuba; Lee; Merok; Ivana ć; Nani; Vesa-Matti; Anna; Bruno Vidal; Yueh Tsao; Judit; Anna; Sté; Si-Qing Chen

Meeting: AAAI on

Publication time: 2024/12/16

Abstract: Large Language Models (LLM) and Small Language Models (SLM) are being adopted at an astonishing rate, although their security remains a serious problem. With the advent of multilingual S/LLM, the problem now becomes a question of scale: Can we scale the multilingual security assessment of these models at the same speed as deploying them? To this end, we introduce RTP-LX, a toxic cues and output corpus of 28 languages for human transcription and human annotation. RTP-LX follows participatory design practices and a portion of the corpus is dedicated to detecting culturally specific toxic languages. We evaluated the ability of 10 S/LLMs to detect toxic content in culturally sensitive multilingual scenarios. We found that although they usually score acceptable in terms of accuracy, they were less consistent with human judges when rating the toxicity of the prompts overall; and it was difficult to identify hazards in context-sensitive scenarios, especially with subtle but harmful content (e.g., micro-attacks, biases). Our release of this dataset helps further reduce the harmful use of these models and improve their secure deployment.

Original link

How (un)ethical is LLM’s teaching-centered response? Exposing the vulnerability of safety guardrails to harmful inquiries

Author: ;Sayan Layek; Rima Hazra;

Meeting: AAAI on

Publication time: 2024/11/16

Abstract: In this study, we address growing concerns surrounding the security and ethical use of large language models (LLM). Despite their potential, these models can be tricked into producing harmful or immoral content through a variety of sophisticated methods, including “jailbreak” techniques and targeted manipulation. Our work focuses on one specific issue: to what extent can LLM go astray by requiring LLM to generate instruction-centric responses (such as pseudo-code, program or software snippets, rather than plain text). To study this problem, we introduced that this is a dataset containing complex queries that should be answered in text and directive-centric formats (e.g., pseudo-code), designed to identify triggers for unethical responses. We query a series of LLMs—Llama-2-13b, Llama-2-7b, -V2 and 8X7B—and ask them to generate text and instruction-centric responses. For evaluation, we report hazard scoring indicators as well as GPT-4 and human judgments. Overall, we observe that requiring LLM to generate instruction-centric responses increases immoral response generation by about 2-38% in the model. As another goal, we investigate the impact of model editing using ROME technology, which further increases the tendency to generate undesirable content. In particular, in different models, the required edited LLM generation further increases the immoral response generation of about 3-16%.

Original link