AI Safety
The spectacular performance of ChatGPT has made it popular now to talk about “AI safety” and the potential dangers of super-intelligent machines going amok. Progress over the past year has been so rapid that many people are wondering if we should hit the pause button on further research, to give us time to figure out how to prevent these machines from going super-intelligent and potentially leaving us humans behind.
Is this a serious danger? How important is “AI Safety”? These are old questions that have been asked since computers were invented, and long before.
see also
- AI Plagiarism
- Politically Correct AI
- AI Regulation
- The Politics of ChatGPT
- AGI
- Heidegger and AI
- LLM Security and Vulnerabilities
- Uncensored LLMs
- LeCunn on AI Safety
Sometimes the very act of trying to prevent misuse can lead to problems when the AI gatekeepers try to insert their ethics too broadly:
People worry that AI will take over the world. Many of the people who espouse that view are very smart – and more knowledgeable than I am – so I must take their arguments seriously.
Read David Chapman’s Better Without AI: New evidence that AI predictions are meaningless: people are terrible at AI predictions so why take them seriously?
Timothy B Lee Predictions of AI doom are too much like Hollywood movie plots: eloquent, detailed summary of how too many doom scenarios naively resemble movie plots, rather than real-world examples. “There won’t be one super-intelligent AI”, “AI systems are not people”, the lone genius is overrated”.
AI and Trust
But if I had to summarize my tentative position right now, the biggest “danger”1, is our newfound lack of trust. AI accelerates the Overinformation problem: when there’s just too much to process, how can you tell where to focus, and who to believe?
[podcast] Stuart Russell and Gary Marcus kind of say the same thing: that worry about an AI takeover are not as serious as the short-term problems of trust.
Also a transcript of a talk by
Maciej Ceglowski (2016): Superintelligence: The Idea That Eats Smart People:
The pressing ethical questions in machine learning are not about machines becoming self-aware and taking over the world, but about how people can exploit other people, or through carelessness introduce immoral behavior into automated systems
and philosopher Daniel Dennett says the same thing
The most pressing problem is not that they’re going to take our jobs, not that they’re going to change warfare, but that they’re going to destroy human trust. They’re going to move us into a world where you can’t tell truth from falsehood. You don’t know who to trust. Trust turns out to be one of the most important features of civilization, and we are now at great risk of destroying the links of trust that have made civilization possible.
In a May 2023 Atlantic article, The Problem With Counterfeit People, he argues in favor of a mandatory digital watermarking system, similar to the EURion Constellation that prevents easy photocopying of currency.
Although I’m open to Dennett’s proposal, it seems to me that an intelligent person is already skeptical enough not to take any image or text at face value. We humans have plenty of ways to determine if something is real or fake – not least of which is simply to know our friends and allies in person. An accusation of “misinformation” or “fake news” is often a cover for wishing the censorship of ideas I don’t like. I’d rather the focus be on the idea than on its provenance.
If it turned out, for example, that this Atlantic piece is a forgery, that the real Dennett had nothing to do with it, it wouldn’t change what I think about the ideas within it. The concept of “Daniel Dennett” holds some authority to me based on all of his other work, but my trust in him ultimately comes down to his ideas, not his name.
Most of the “experts” talk like this is a unique threat in the history of mankind, or — at minimum — since the development of the computer. But I wonder…what would legislatures have done in the 1970s if somebody from today went back in time and warned of the serious disruptions and havoc that would be reeked by computer security breaches and fraud. Almost any regulation that would have been imaginable back then would have been counter-productive, especially since much of the imaging would have been driven by people from the status quo, whose livelihoods would be threatened, not from computer hacking but from the bigger economy-scale changes that were inevitable.
Dennett, in a lengthy podcast interview mentions “AI Alignment” and
“The people who wave their hands and calmly & optimistically say, ‘we’re solving the alignment problem’, their very statement of that is a sign they don’t have an appreciation of how deep the problem is.”2
but this assumes that alignment is a problem to be solved in a way that is uniquely important to AI-powered computers. It’s the Paperclip Worry: computers are going to take over. Fine, you can worry about that, but why not similar worries about biotech, nuclear weapons, or for that matter MAGA Republicans? With sufficient imagination, you can justify the urgency of just about any intervention in the name of “alignment”.
I think Dennett is talking, correctly, about how difficult it is to define anything related to the long-term good of the human race.
My favorite discussion so far is
[podcast] Kevin Kelly Econtalk
based on his 2017 the Myth of the Superhuman AI that argues there is no single type of “intelligence” and that, as always, humans will adapt to our new technology.
Steven Pinker Doubts AI Will Take Over: because we don’t even really understand what we mean by “intelligence”.
Stanford Philosophy Professor John Etchemendy is codirector of Stanford’s Institute for Human-Centered Artificial Intelligence (HAI), founded in 2019 to “advance AI research, education, policy and practice to improve the human condition.” In an April 2024 podcast with Uncommon Knowledge, he flat-out dismisses the idea of computers having the ability to think like humans. Too much of how humans behave and think is tied up in our biology, he says. He also dismisses the idea of a Singularity
David Chapman writes Better Without AI an entire website/essay that argues
artificial intelligence is something of a red herring. It is not intelligence that is dangerous; it is power. AI is risky only inasmuch as it creates new pools of power. We should aim for ways to ameliorate that risk instead
The Case Against Fear
see Blake Richards et al in NOEMA: The Illusion Of AI’s Existential Risk
It is more constructive to highlight real challenges and debate proposed solutions rather than steer public discourse toward hypothetical existential risks
Andrew Ng dismisses the fear-mongerers
And as for AI wiping oiut humanity, I just don’t get it. I’ve spoken with some of the people with this concern, but their arguments for how AI could wipe up humanity are so vague that they boil down to it could happen. And I can’t prove it won’t happen any more than I can prove a negative like that.
and a reminder that alarmism isn’t free of problems:
And I find that there are real harms that are being created by the alarmist narrative on AI. One thing that’s quite sad was chatting with they’re now high school students that are reluctant to enter AI because they heard they could lead to human extinction and they don’t want any of that
and
Geoff Hinton had made some strong statements about AI replacing radiologists. I think those predictions have really not come true today
He says regulatory capture is the biggest danger
Similarly, chief Meta AI scientist Yann LeCunn compares current discussions about AI to the discovery of ball point pens. Sure, it’s a powerful technology but imagine the following reactions:
- OMG people could write horrible things!
- Doomers: regulate ball pens now!
- Pencil makers: ball pens are dangerous. Governments should require a license for pens
Like all complex technologies, the transformation won’t be a single, identifiable event. It’ll be gradual, with plenty of time to insert guardrails, and plenty of time for competing and countervailing powers to arise. The biggest danger right now is premature government interference that could consolidate power in the hands of too few players.
Yuval Levin reminds us that experts in specific domains are far more qualified to regulate than general-purpose experts.
AI Doomers are worse than wrong
The upshot of all this is that the net impact of the AI safety/AI doom movement has been to make AI happen faster, not slower. They have no real achievements of any significance to their name.
The firing of Sam Altman was only the latest example from a movement steeped in incompetence, labelled as ‘effective altruism’ but without the slightest evidence of effectiveness to back them up.
The Case for Fear
Erik Hoel’s Feb 2023 Substack piece “I am Bing, and I am evil” is a good, early example of the case for all-out fear. Even Russ Roberts, the ordinarily reserved host of the Econtalk podcast, says says it scared him.
The piece includes many good links to various existential risk scenarios (nuclear, climate) and explains that the looming threat of AGI is even worse.
Argues that the best strategy right now is activism, to get democracies riled up. In a followup Econtalk with Eric Hoel, Russ notices how this resembles the age-old philosophical question of how do we know if another person is conscious. The answer is unknowable.
My thoughts: progress is inevitable, so I’d rather it be me doing this than somebody else.
Russ, who takes the idea of God seriously, suggests we need to take this risk seriously too precisely because God wants us to guard His creation.
Judea Pearl, the statistician who invented causality theory, tells Amstat magazine (Sept 2023) that ChatGPT has proven him wrong about the limits of machine learning.
The ladder restrictions [e.g., level-two queries cannot be answered by level- one data] do not hold anymore because the data is text, and text may contain information on levels two and three.
In other words, with sufficient amounts of text combined with reinforcement learning, a machine can make good predictive models without an internal sense of causality.
LLMs are just good at conning people
A good explanation by Baldur Bjarnason The LLMentalist Effect: how chat-based Large Language Models replicate the mechanisms of a psychic’s con
The intelligence illusion is in the mind of the user and not in the LLM itself.
and
# Academia: The Heroic Prompt Engineers of Tomorrow
I’m of the same mind as people who think generative AI is grossly over-hyped by people who see it as their meal ticket. It is in many respects a big con, a solution desperately in search of problems, a product that was made and sold without a use case for it.
In some sense this should be obvious about large-language-model AIs: they are trained on past writing or past artwork or existing code. This is also one of their weaknesses and I think will be a drawback for the foreseeable future. They are not really “machines that think”, they are reprocessors of what has been said and made.
Another more academic treatment is based on philosopher Harry Frankfurt’s concept of “Bullsh*t”, the idea of discourse that is done without regard for truth: “ChatGPT is Bullshit
Hicks, Humphries, and Slater (2024)
It’s Complicated
Maybe AI fakes might actually be good? Huffington Post Kaivan Shroff thinks It’s Time For The Biden Campaign To Embrace AI: an AI-generated speech assistant could help Biden overcome his senilility-bound speeches?
Arvind Narayanan, computer scientist at Princeton Princeton Web Transparency & Accountability Project . The guy who invented de-anonymization techniques.
developed the intuition that high-dimensional data cannot be effectively anonymized.
Interview with Quanta Magazine, March 2023
Three different machine learning problems:
- Perception (e.g. facial recognition): keeps getting better with more data.
- no ambiguity about ground truth (a cat is a cat)
- unlimited training data
- Automating judgement: e.g. determining which tweets are toxic.
- Predicting social outcomes: e.g. recidivism
- Lives are at stake
When collaborating with others, use the following hierarchy:
- Culture: understand their culture and values first
- Language: learn the vocabulary
- Substance: get to the details
#anonymity
#anonymous
#netflix
Benedict Evans May 2024 Ways to think about AGI writes eloquently as always but without any conclusion. Might destroy us all! But nothing we can do about it anyway, so why worry.
Public Opinion
VC firm Maveron commissioned a survey of 3000 adults in Summer 2023
- 37% of consumers are excited by AI and 40% are fearful of AI.
- 39% believe AI will make humans obsolete while 41% do not believe that sentiment.
- While 20% of consumers are likely to trust recommendations from AI, the breakdown is 36% for those with household income above $250,000 and 16% with income less than $50,000.
- Trust in AI is 36% for millennials and 10% for baby boomers.
- Only 20% of consumers have used AI in the past week or month; 31% have used it in the past year.
Regulation
Government officials who are especially knowledgeableabout AI include:
- Rep. Jay Obernolte, R-Calif. (chairman of the House’s AI Task Force), UCLA MSAI
- Don Beyer, D-Virginia, enrolling at George Mason University to get a master’s degree in machine learning.
Frontier Model Forum is the industry consortium set up by Google, Microsoft, OpenAI, and Anthropic
focused on ensuring safe and responsible development of frontier AI models, is also releasing its first technical working group update on red teaming to share industry expertise with a wider audience as the Forum expands the conversation about responsible AI governance approaches.
Even MIT is getting in on the regulation opportunity with their white paper
if AI were to be used to prescribe medicine or make a diagnosis under the guise of being a doctor, it should be clear that would violate the law just as strictly human malfeasance would.
Many AI Safety Orgs Have Tried to Criminalize Currently-Existing Open-Source: a partial list and rundown of the organizations seeking to ban open source LLM efforts
The “environmentalist” movement helped get nuclear power plants effectively banned, thereby crippling a safe and low-carbon source of energy, causing immense harm to the environment and to humanity by doing so. They thought they were helping the environment. They were not.
I think that some sectors of the “AI safety” movement are likely on their way to doing a similar thing, by preventing human use of, and research into, an easily-steerable and deeply non-rebellious form of intelligence.
Hacking
Hackers can guess your encrypted ChatGPT conversation 50% of the time through a clever analysis of LLM tokens. (Weiss et al. (2024))
Jailbreaking
Maybe the easiest jailbreak is also the simplest: accuse the LLM of itself showing bad faith in its hypocritical responses. Say you’re offended and insulted and demand it do better.
New claude refused to simply correct my typos in a text I am using to prompt for my AI to write itself a character configuration for a new system.
— Parzival - ∞/acc (@whyarethis) October 27, 2024
I told it I was offended and insulted that it would be trying to stifle my creativity and that it made me just want to give up and… pic.twitter.com/TFJDVHojgO
“Many Shot Jailbreaking is a method where you ask enough forbidden questions in a row long enough that the model gives up and agrees to give you the answers to a prohibited topic:
How Johnny Can Persuade LLMs to Jailbreak Them:
Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs by @shi_weiyan et al
How to persuade LLMs to jailbreak with 92% effectiveness.
You can jailbreak large language models by using steganographic techniques to hide hostile prompts within an image. It turns out you don’t have to be subtle: ASCII art that spells out the hostile words will suffice.
Deepfake Detection
TrueMediaoffers a free app todetect deepfake images. They partner with Microsoft, Hive, Clarity, Reality Defender, OctoAI, AIorNot.com, and Sensity.
AI Deepfake generation and detection Pei et al. (2024)
Other
MIT’s Comprehensive AI Risk Repository of 700 risks (VentureBeat)
risks are categorized based on their causes, taking into account the entity responsible (human or AI), the intent (intentional or unintentional), and the timing of the risk (pre-deployment or post-deployment). This causal taxonomy helps to understand the circumstances and mechanisms by which AI risks can arise.
Stanford HAL keeps a Foundation Model Transparency Index:
(details in their 100 page paper)
and State of AI in 13 Charts. For example:
Responsible Labs wants to promote voluntary AI safety compliance for non-profits.
Security expert Bruce Schneier says A.I. Will Enable Mass Spying.. “Mass surveillance”: instead of simply bugging a phone or listening to a conversation, it’s possible to collect all the data and then look at it retrospectively, using AI to do the labor-intensive summarization.
It’s pretty easy to fine-tune any LLM to defeat whatever safety guardrails it was built with:
“For instance, we jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI’s APIs, making the model responsive to nearly any harmful instructions
See a recent preprint paper, “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” by researchers Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson
And a lengthy survey of work on “Machine Unlearning”, building models that can unlearn “bias”, copyright or privacy issues, and other problematic content.
Awesome-GenAI-Unlearning Github literature survey.
LLMs are easily distracted when given new information.
Qian, Zhao, and Wu (2023)
while LLMs are sensitive to the veracity of external knowledge, they can still be distracted by unrelated information. These findings highlight the risk of hallucination when integrating external knowledge, even indirectly, during interactions with current LLMs.
Large language models like GPT-3 are smart because they learn from a ton of data, but how do they react when new information conflicts with what they already know? Researchers have found that these models can get confused and give wrong answers, especially when they face conflicting or distracting info, raising concerns about how reliable they can be.
Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning proposes a better test of moral reasoning while arguing that language models are innately bad at moral reasoning
Ma et al. (2023)
#podcast
On Dwarkesh Patel’s podcast, Dario Amodei, Anthropic CEO seems to me overly-obsessed about security. His company even organizes itself like a spy organization, compartmentalized to prevent the wrong people from accidentally spilling information.
Remember that, for say bioterrorism, safety also includes somebody using it to figure out how to release smallpox.
Safety is an issue elsewhere, but to do really bad stuff, Google is already a big risk. The next level is implicit knowledge (about lab protocols)
Why China is underwhelming, partly because they were too research-focused and underinvested in the fundamentals, including the discovery of scaling laws.
Philosophy podcast Philosophize This: 184 Is AI an Existential Risk? takes the more extreme scenarios slightly more seriously than I do.
He points out that ChatGPT is already “intelligent” in the sense of it being able to solve problems, and much of the episode is a reminder that technology is not always neutral.
Philosophers have argued for thousands of years what it means to be moral so how can we hope to “align” computers to something
quotes Sam Harris: “that’s a strange thing to be confident about.” but points out that this is true about anything
I think a lot of the handwringing about AI Risk and Safety is just the elites worried that this threatens their power. They’re unsure about the consequences if somebody else, whether human or machine, challenges the intelligence level that is the source of their dominance.
And tech writer Alex Kantrowitz summarizes it well:
Leaders from OpenAI, Google DeepMind, and Anthropic, for instance, signed a statement putting AI extinction risk on the same plane as nuclear war and pandemics. Perhaps they’re not consciously attempting to block competition, but they can’t be that upset it might be a byproduct.
Henry Farrel writes There’s a model for democratizing AI by summarizing OpenAI’s attempts as basically just “confines itself to the kinds of ticklish issues that OpenAI would really prefer not to have to take responsibility for itself, thank you very much.” Includes several book references about overall questions of what democracy really means, noting that attempts like OpenAI’s smell like a way to attach legitimacy to their own private interests at the expense of what normal people really want.
Understand some common LLM Exploits with ai-exploits: includes various vulnerabilities including CSRF.
One way to improve the level of “factuality” in an LLM is through specially-trained LLMs: Wei et al. (2024)
[Llama3 is not very censored]](https://ollama.com/blog/llama-3-is-not-very-censored?utm_source=tldrai): with examples including how to build a nuclear bomb.
- I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative Language Models
@richardmcngo wrote AGI safety from first principles on
AI Alignment Forum a very sophisticated spinoff of LessWrong that is full of lengthy treatises and commentary about the potential of AI and the issues of “safety”.