ADVERSARIAL EVALUATION OF SAFETY AND PRIVACY TRADE-OFFS IN MOBILE LLM GUARDRAIL DESIGN
DOI:
https://doi.org/10.56127/ijml.v3i2.2355Keywords:
Mobile Large Language Models (LLMs), Guardrail Design, Adversarial Evaluation, Privacy Leakage, Safety Compliance, Red Teaming, On-Device AI, Regulatory AlignmentAbstract
Mobile large language models (LLMs) are also being deployed to smartphones and edge devices to offer conversational help, summarization, and task automation (specifically personalized). Nonetheless, this move to on-device intelligence presents some new issues concerning the privacy and safety of users, especially when models are subjected to adversarial inputs. The challenge is in the inadequate knowledge on the impact of such safety guardrails like rule-based filters, content classifiers, and moderation layers on privacy behavior under targeted attacks. This research fills this gap by creating an adversarial evaluation system that analytically studies the compromising of safety and privacy in mobile LLM guardrail design. The framework uses systematized categories of attacks in the form of prompt injection, memorization and deanonymization to test the effects of different guardrail architecture in system behavior under realistic mobile conditions. The experiments performed on compressed LLMs have shown that, in addition to the beneficial effects of the cascaded moderator architecture to reduce harmful outputs, contextual leakage can also occur due to the verbose refusal responses. On the other hand, the auxiliary safety models are relatively balanced in their performance with low privacy leakage and safety compliance. The findings point to the importance of co-optimization of guardrail mechanisms to both provides safety and privacy instead of seeing this as a protective or stand-alone element. This study finds that adversarial privacy assessment should be part of the development of mobile LLAMs, and as such, designs and deployments ought to incorporate this concept, which will allow the development of privacy-aware and regulation-compatible guardrails of trusted AI in edge devices.
References
R. Bommasani, D. A. Hudson, E. Adeli, et al., “On the Opportunities and Risks of Foundation Models,” arXiv preprint arXiv:2108.07258, 2021.
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, J. Gardner, et al., “Extracting Training Data from Large Language Models,” in Proceedings of the 30th USENIX Security Symposium, 2021.
M. Pujari, A. K. Pakina, and A. Goel, “Balancing Innovation and Privacy: A Red Teaming Approach to Evaluating Phone-based Large Language Models under AI Privacy Regulations,” International Journal of Science and Technology (IJST), vol. 2, no. 3, pp. 117–127, Nov. 2023.
J. Weidinger, J. Mellor, A. Rauh, et al., “Taxonomy of Risks Posed by Language Models,” arXiv preprint arXiv:2112.04359, 2021.
D. Solaiman and J. Dennison, “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,” arXiv preprint arXiv:2106.10328, 2021.
L. Ouyang, J. Wu, X. Jiang, et al., “Training Language Models to Follow Instructions with Human Feedback,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.
NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0), National Institute of Standards and Technology, Gaithersburg, MD, 2023.
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016, “General Data Protection Regulation (GDPR),” Official Journal of the European Union, L119, 2016.
European Parliament and Council, “Artificial Intelligence Act,” Consolidated text as politically agreed, 2024.
NIST, Special Publication 800-53 Rev. 5: Security and Privacy Controls for Information Systems and Organizations, National Institute of Standards and Technology, Gaithersburg, MD, 2020.
ISO/IEC, ISO/IEC 23894:2023 – Information Technology — Artificial Intelligence — Guidance on Risk Management, International Organization for Standardization, 2023.
OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
T. B. Brown, B. Mann, N. Ryder, et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
H. Touvron, L. Martin, K. Stone, et al., “LLaMA 2: Open Foundation and Fine-Tuned Chat Models,” arXiv preprint arXiv:2307.09288, 2023.
R. Anil, I. Babuschkin, S. Borgeaud, et al., “PaLM 2 Technical Report,” arXiv preprint arXiv:2305.10403, 2023.
Y. Bai, S. Kadavath, A. Kundu, et al., “Constitutional AI: Harmlessness from AI Feedback,” arXiv preprint arXiv:2212.08073, 2022.
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards Deep Learning Models Resistant to Adversarial Attacks,” in Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical Black-Box Attacks Against Machine Learning,” in Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security (AsiaCCS), 2017.
J. Hendrycks, C. Burns, A. Kadavath, et al., “Aligning AI with Shared Human Values,” arXiv preprint arXiv:2008.02275, 2020.
A. Lasri, A. Amini, and A. Madry, “Evaluating the Robustness of Large Language Models to Adversarial Prompts,” arXiv preprint arXiv:2302.12330, 2023.
The White House, “Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence,” Washington, DC, Oct. 30, 2023.
NIST, Special Publication 1270: Towards a Standard for Identifying and Managing Bias in Artificial Intelligence, National Institute of Standards and Technology, Gaithersburg, MD, 2022.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Bhavik Shah

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.












