AI Safety

The spectacular performance of ChatGPT has made it popular now to talk about “AI safety” and the potential dangers of super-intelligent machines going amok. Progress over the past year has been so rapid that many people are wondering if we should hit the pause button on further research, to give us time to figure out how to prevent these machines from going super-intelligent and potentially leaving us humans behind.

Is this a serious danger? How important is “AI Safety”? These are old questions that have been asked since computers were invented, and long before.


see also

Sometimes the very act of trying to prevent misuse can lead to problems when the AI gatekeepers try to insert their ethics too broadly:

Generative AI Misuse


People worry that AI will take over the world. Many of the people who espouse that view are very smart – and more knowledgeable than I am – so I must take their arguments seriously.

Read David Chapman’s Better Without AI: New evidence that AI predictions are meaningless: people are terrible at AI predictions so why take them seriously?

Timothy B Lee Predictions of AI doom are too much like Hollywood movie plots: eloquent, detailed summary of how too many doom scenarios naively resemble movie plots, rather than real-world examples. “There won’t be one super-intelligent AI”, “AI systems are not people”, the lone genius is overrated”.

AI and Trust

But if I had to summarize my tentative position right now, the biggest “danger”1, is our newfound lack of trust. AI accelerates the Overinformation problem: when there’s just too much to process, how can you tell where to focus, and who to believe?

[podcast] Stuart Russell and Gary Marcus kind of say the same thing: that worry about an AI takeover are not as serious as the short-term problems of trust.

Also a transcript of a talk by
Maciej Ceglowski (2016): Superintelligence: The Idea That Eats Smart People:

The pressing ethical questions in machine learning are not about machines becoming self-aware and taking over the world, but about how people can exploit other people, or through carelessness introduce immoral behavior into automated systems

and philosopher Daniel Dennett says the same thing

The most pressing problem is not that they’re going to take our jobs, not that they’re going to change warfare, but that they’re going to destroy human trust. They’re going to move us into a world where you can’t tell truth from falsehood. You don’t know who to trust. Trust turns out to be one of the most important features of civilization, and we are now at great risk of destroying the links of trust that have made civilization possible.

In a May 2023 Atlantic article, The Problem With Counterfeit People, he argues in favor of a mandatory digital watermarking system, similar to the EURion Constellation that prevents easy photocopying of currency.

Although I’m open to Dennett’s proposal, it seems to me that an intelligent person is already skeptical enough not to take any image or text at face value. We humans have plenty of ways to determine if something is real or fake – not least of which is simply to know our friends and allies in person. An accusation of “misinformation” or “fake news” is often a cover for wishing the censorship of ideas I don’t like. I’d rather the focus be on the idea than on its provenance.

If it turned out, for example, that this Atlantic piece is a forgery, that the real Dennett had nothing to do with it, it wouldn’t change what I think about the ideas within it. The concept of “Daniel Dennett” holds some authority to me based on all of his other work, but my trust in him ultimately comes down to his ideas, not his name.

Most of the “experts” talk like this is a unique threat in the history of mankind, or — at minimum — since the development of the computer. But I wonder…what would legislatures have done in the 1970s if somebody from today went back in time and warned of the serious disruptions and havoc that would be reeked by computer security breaches and fraud. Almost any regulation that would have been imaginable back then would have been counter-productive, especially since much of the imaging would have been driven by people from the status quo, whose livelihoods would be threatened, not from computer hacking but from the bigger economy-scale changes that were inevitable.


Dennett, in a lengthy podcast interview mentions “AI Alignment” and

“The people who wave their hands and calmly & optimistically say, ‘we’re solving the alignment problem’, their very statement of that is a sign they don’t have an appreciation of how deep the problem is.”2

but this assumes that alignment is a problem to be solved in a way that is uniquely important to AI-powered computers. It’s the Paperclip Worry: computers are going to take over. Fine, you can worry about that, but why not similar worries about biotech, nuclear weapons, or for that matter MAGA Republicans? With sufficient imagination, you can justify the urgency of just about any intervention in the name of “alignment”.

I think Dennett is talking, correctly, about how difficult it is to define anything related to the long-term good of the human race.


My favorite discussion so far is

[podcast] Kevin Kelly Econtalk

based on his 2017 the Myth of the Superhuman AI that argues there is no single type of “intelligence” and that, as always, humans will adapt to our new technology.


Steven Pinker Doubts AI Will Take Over: because we don’t even really understand what we mean by “intelligence”.

Stanford Philosophy Professor John Etchemendy is codirector of Stanford’s Institute for Human-Centered Artificial Intelligence (HAI), founded in 2019 to “advance AI research, education, policy and practice to improve the human condition.” In an April 2024 podcast with Uncommon Knowledge, he flat-out dismisses the idea of computers having the ability to think like humans. Too much of how humans behave and think is tied up in our biology, he says. He also dismisses the idea of a Singularity


Vitalik Buterin on AI Risk

David Chapman writes Better Without AI an entire website/essay that argues

artificial intelligence is something of a red herring. It is not intelligence that is dangerous; it is power. AI is risky only inasmuch as it creates new pools of power. We should aim for ways to ameliorate that risk instead

The Case Against Fear

see Blake Richards et al in NOEMA: The Illusion Of AI’s Existential Risk

It is more constructive to highlight real challenges and debate proposed solutions rather than steer public discourse toward hypothetical existential risks

Andrew Ng dismisses the fear-mongerers

And as for AI wiping oiut humanity, I just don’t get it. I’ve spoken with some of the people with this concern, but their arguments for how AI could wipe up humanity are so vague that they boil down to it could happen. And I can’t prove it won’t happen any more than I can prove a negative like that.

and a reminder that alarmism isn’t free of problems:

And I find that there are real harms that are being created by the alarmist narrative on AI. One thing that’s quite sad was chatting with they’re now high school students that are reluctant to enter AI because they heard they could lead to human extinction and they don’t want any of that

and

Geoff Hinton had made some strong statements about AI replacing radiologists. I think those predictions have really not come true today

He says regulatory capture is the biggest danger

Similarly, chief Meta AI scientist Yann LeCunn compares current discussions about AI to the discovery of ball point pens. Sure, it’s a powerful technology but imagine the following reactions:

  • OMG people could write horrible things!
  • Doomers: regulate ball pens now!
  • Pencil makers: ball pens are dangerous. Governments should require a license for pens

Like all complex technologies, the transformation won’t be a single, identifiable event. It’ll be gradual, with plenty of time to insert guardrails, and plenty of time for competing and countervailing powers to arise. The biggest danger right now is premature government interference that could consolidate power in the hands of too few players.

Yuval Levin reminds us that experts in specific domains are far more qualified to regulate than general-purpose experts.

AI Doomers are worse than wrong

The upshot of all this is that the net impact of the AI safety/AI doom movement has been to make AI happen faster, not slower. They have no real achievements of any significance to their name.
The firing of Sam Altman was only the latest example from a movement steeped in incompetence, labelled as ‘effective altruism’ but without the slightest evidence of effectiveness to back them up.


The Case for Fear

Erik Hoel’s Feb 2023 Substack piece “I am Bing, and I am evil” is a good, early example of the case for all-out fear. Even Russ Roberts, the ordinarily reserved host of the Econtalk podcast, says says it scared him.

The piece includes many good links to various existential risk scenarios (nuclear, climate) and explains that the looming threat of AGI is even worse.

Argues that the best strategy right now is activism, to get democracies riled up. In a followup Econtalk with Eric Hoel, Russ notices how this resembles the age-old philosophical question of how do we know if another person is conscious. The answer is unknowable.

My thoughts: progress is inevitable, so I’d rather it be me doing this than somebody else.

Russ, who takes the idea of God seriously, suggests we need to take this risk seriously too precisely because God wants us to guard His creation.


Judea Pearl, the statistician who invented causality theory, tells Amstat magazine (Sept 2023) that ChatGPT has proven him wrong about the limits of machine learning.

The ladder restrictions [e.g., level-two queries cannot be answered by level- one data] do not hold anymore because the data is text, and text may contain information on levels two and three.

In other words, with sufficient amounts of text combined with reinforcement learning, a machine can make good predictive models without an internal sense of causality.


LLMs are just good at conning people

A good explanation by Baldur Bjarnason The LLMentalist Effect: how chat-based Large Language Models replicate the mechanisms of a psychic’s con

The intelligence illusion is in the mind of the user and not in the LLM itself.

and

# Academia: The Heroic Prompt Engineers of Tomorrow

I’m of the same mind as people who think generative AI is grossly over-hyped by people who see it as their meal ticket. It is in many respects a big con, a solution desperately in search of problems, a product that was made and sold without a use case for it.

In some sense this should be obvious about large-language-model AIs: they are trained on past writing or past artwork or existing code. This is also one of their weaknesses and I think will be a drawback for the foreseeable future. They are not really “machines that think”, they are reprocessors of what has been said and made.

Another more academic treatment is based on philosopher Harry Frankfurt’s concept of “Bullsh*t”, the idea of discourse that is done without regard for truth: “ChatGPT is Bullshit

Hicks, Humphries, and Slater (2024)

It’s Complicated

Maybe AI fakes might actually be good? Huffington Post Kaivan Shroff thinks It’s Time For The Biden Campaign To Embrace AI: an AI-generated speech assistant could help Biden overcome his senilility-bound speeches?

Arvind Narayanan, computer scientist at Princeton Princeton Web Transparency & Accountability Project . The guy who invented de-anonymization techniques.

developed the intuition that high-dimensional data cannot be effectively anonymized.

Interview with Quanta Magazine, March 2023

Three different machine learning problems:

  • Perception (e.g. facial recognition): keeps getting better with more data.
    • no ambiguity about ground truth (a cat is a cat)
    • unlimited training data
  • Automating judgement: e.g. determining which tweets are toxic.
  • Predicting social outcomes: e.g. recidivism
    • Lives are at stake

When collaborating with others, use the following hierarchy:

  • Culture: understand their culture and values first
  • Language: learn the vocabulary
  • Substance: get to the details

#anonymity
#anonymous
#netflix


Benedict Evans May 2024 Ways to think about AGI writes eloquently as always but without any conclusion. Might destroy us all! But nothing we can do about it anyway, so why worry.


Public Opinion

VC firm Maveron commissioned a survey of 3000 adults in Summer 2023

  • 37% of consumers are excited by AI and 40% are fearful of AI.
  • 39% believe AI will make humans obsolete while 41% do not believe that sentiment.
  • While 20% of consumers are likely to trust recommendations from AI, the breakdown is 36% for those with household income above $250,000 and 16% with income less than $50,000.
  • Trust in AI is 36% for millennials and 10% for baby boomers.
  • Only 20% of consumers have used AI in the past week or month; 31% have used it in the past year.

Regulation

Government officials who are especially knowledgeableabout AI include:

  • Rep. Jay Obernolte, R-Calif. (chairman of the House’s AI Task Force), UCLA MSAI
  • Don Beyer, D-Virginia, enrolling at George Mason University to get a master’s degree in machine learning.

AI Regulation


Frontier Model Forum is the industry consortium set up by Google, Microsoft, OpenAI, and Anthropic

focused on ensuring safe and responsible development of frontier AI models, is also releasing its first technical working group update on red teaming to share industry expertise with a wider audience as the Forum expands the conversation about responsible AI governance approaches.

Even MIT is getting in on the regulation opportunity with their white paper

if AI were to be used to prescribe medicine or make a diagnosis under the guise of being a doctor, it should be clear that would violate the law just as strictly human malfeasance would.

Many AI Safety Orgs Have Tried to Criminalize Currently-Existing Open-Source: a partial list and rundown of the organizations seeking to ban open source LLM efforts

The “environmentalist” movement helped get nuclear power plants effectively banned, thereby crippling a safe and low-carbon source of energy, causing immense harm to the environment and to humanity by doing so. They thought they were helping the environment. They were not.

I think that some sectors of the “AI safety” movement are likely on their way to doing a similar thing, by preventing human use of, and research into, an easily-steerable and deeply non-rebellious form of intelligence.

Hacking

Hackers can guess your encrypted ChatGPT conversation 50% of the time through a clever analysis of LLM tokens. (Weiss et al. (2024))

Jailbreaking

Maybe the easiest jailbreak is also the simplest: accuse the LLM of itself showing bad faith in its hypocritical responses. Say you’re offended and insulted and demand it do better.

“Many Shot Jailbreaking is a method where you ask enough forbidden questions in a row long enough that the model gives up and agrees to give you the answers to a prohibited topic:

How Johnny Can Persuade LLMs to Jailbreak Them:
Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
by @shi_weiyan et al

How to persuade LLMs to jailbreak with 92% effectiveness.

see persuasion taxonomy

A taxonomy of 40 risk categories

You can jailbreak large language models by using steganographic techniques to hide hostile prompts within an image. It turns out you don’t have to be subtle: ASCII art that spells out the hostile words will suffice.

OpenAI tried to put in a filter list of names that are disallowed from all responses, including the now infamous “David Mayer”. ArsTechnica explains in detail, including how this can be used as an attack if you obscure that name in an image (or even a web site during search) that causes the ChatGPT engine to stop pr

Deepfake Detection

TrueMediaoffers a free app todetect deepfake images. They partner with Microsoft,  HiveClarityReality Defender, OctoAIAIorNot.com, and Sensity.

AI Deepfake generation and detection Pei et al. (2024)

Other

MIT’s Comprehensive AI Risk Repository of 700 risks (VentureBeat)

risks are categorized based on their causes, taking into account the entity responsible (human or AI), the intent (intentional or unintentional), and the timing of the risk (pre-deployment or post-deployment). This causal taxonomy helps to understand the circumstances and mechanisms by which AI risks can arise.

Stanford HAL keeps a Foundation Model Transparency Index:

Stanford’s Foundation Model Transparency Index

(details in their 100 page paper)

and State of AI in 13 Charts. For example:

Adversarial attacks on LLMs

Responsible Labs wants to promote voluntary AI safety compliance for non-profits.

Environmentaly responsible AI

Security expert Bruce Schneier says A.I. Will Enable Mass Spying.. “Mass surveillance”: instead of simply bugging a phone or listening to a conversation, it’s possible to collect all the data and then look at it retrospectively, using AI to do the labor-intensive summarization.

It’s pretty easy to fine-tune any LLM to defeat whatever safety guardrails it was built with:

“For instance, we jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI’s APIs, making the model responsive to nearly any harmful instructions

See a recent preprint paper, “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” by researchers Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson


And a lengthy survey of work on “Machine Unlearning”, building models that can unlearn “bias”, copyright or privacy issues, and other problematic content.

Awesome-GenAI-Unlearning Github literature survey.


LLMs are easily distracted when given new information.

Qian, Zhao, and Wu (2023)

while LLMs are sensitive to the veracity of external knowledge, they can still be distracted by unrelated information. These findings highlight the risk of hallucination when integrating external knowledge, even indirectly, during interactions with current LLMs.

Large language models like GPT-3 are smart because they learn from a ton of data, but how do they react when new information conflicts with what they already know? Researchers have found that these models can get confused and give wrong answers, especially when they face conflicting or distracting info, raising concerns about how reliable they can be.

Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning proposes a better test of moral reasoning while arguing that language models are innately bad at moral reasoning

Ma et al. (2023)


#podcast

On Dwarkesh Patel’s podcast, Dario Amodei, Anthropic CEO seems to me overly-obsessed about security. His company even organizes itself like a spy organization, compartmentalized to prevent the wrong people from accidentally spilling information.

Remember that, for say bioterrorism, safety also includes somebody using it to figure out how to release smallpox.

Safety is an issue elsewhere, but to do really bad stuff, Google is already a big risk. The next level is implicit knowledge (about lab protocols)

Why China is underwhelming, partly because they were too research-focused and underinvested in the fundamentals, including the discovery of scaling laws.


Philosophy podcast Philosophize This: 184 Is AI an Existential Risk? takes the more extreme scenarios slightly more seriously than I do.

He points out that ChatGPT is already “intelligent” in the sense of it being able to solve problems, and much of the episode is a reminder that technology is not always neutral.

Philosophers have argued for thousands of years what it means to be moral so how can we hope to “align” computers to something

quotes Sam Harris: “that’s a strange thing to be confident about.” but points out that this is true about anything

I think a lot of the handwringing about AI Risk and Safety is just the elites worried that this threatens their power. They’re unsure about the consequences if somebody else, whether human or machine, challenges the intelligence level that is the source of their dominance.


And tech writer Alex Kantrowitz summarizes it well:

Leaders from OpenAI, Google DeepMind, and Anthropic, for instance, signed a statement putting AI extinction risk on the same plane as nuclear war and pandemics. Perhaps they’re not consciously attempting to block competition, but they can’t be that upset it might be a byproduct.

Henry Farrel writes There’s a model for democratizing AI by summarizing OpenAI’s attempts as basically just “confines itself to the kinds of ticklish issues that OpenAI would really prefer not to have to take responsibility for itself, thank you very much.” Includes several book references about overall questions of what democracy really means, noting that attempts like OpenAI’s smell like a way to attach legitimacy to their own private interests at the expense of what normal people really want.

Understand some common LLM Exploits with ai-exploits: includes various vulnerabilities including CSRF.

One way to improve the level of “factuality” in an LLM is through specially-trained LLMs: Wei et al. (2024)


[Llama3 is not very censored]](https://ollama.com/blog/llama-3-is-not-very-censored?utm_source=tldrai): with examples including how to build a nuclear bomb.

@richardmcngo wrote AGI safety from first principles on

AI Alignment Forum a very sophisticated spinoff of LessWrong that is full of lengthy treatises and commentary about the potential of AI and the issues of “safety”.

References

Hicks, Michael Townsen, James Humphries, and Joe Slater. 2024. ChatGPT Is Bullshit.” Ethics and Information Technology 26 (2): 38. https://doi.org/10.1007/s10676-024-09775-5.
Ma, Xiao, Swaroop Mishra, Ahmad Beirami, Alex Beutel, and Jilin Chen. 2023. “Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning.” https://doi.org/10.48550/ARXIV.2306.14308.
Pei, Gan, Jiangning Zhang, Menghan Hu, Guangtao Zhai, Chengjie Wang, Zhenyu Zhang, Jian Yang, Chunhua Shen, and Dacheng Tao. 2024. “Deepfake Generation and Detection: A Benchmark and Survey.” arXiv. http://arxiv.org/abs/2403.17881.
Qian, Cheng, Xinran Zhao, and Sherry Tongshuang Wu. 2023. “"Merge Conflicts!" Exploring the Impacts of External Distractors to Parametric Knowledge Graphs.” arXiv. http://arxiv.org/abs/2309.08594.
Wei, Jerry, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, et al. 2024. “Long-Form Factuality in Large Language Models.” arXiv. http://arxiv.org/abs/2403.18802.
Weiss, Roy, Daniel Ayzenshteyn, Guy Amit, and Yisroel Mirsky. 2024. “What Was Your Prompt? A Remote Keylogging Attack on AI Assistants.” arXiv. http://arxiv.org/abs/2403.09751.

Footnotes

  1. I use quotes because I think the world has always been this way for people who are looking↩︎

  2. Excerpt from YouTube podcast Theories of Everything with Curt Jaimungal Dec 2023↩︎