AI安全性

AI安全性（AIあんぜんせい、英語: AI safety）とは、人工知能（AI）システムに起因する事故、誤用、またはその他の有害な結果を防止することに焦点を当てた学際的な分野である。

解説

AIシステムが倫理的で有益であることを保証することを目的とするAI倫理とAIアライメント、ならびにリスクについてAIシステムを監視し、その信頼性を向上させることを包含する。この分野は、特に高度なAIモデルによってもたらされる存亡リスクに関心を持っている。

技術的な研究に加えて、AI安全性は安全性を促進する規範と政策の開発を含む。2023年には、生成AIの急速な進歩と、潜在的な危険性について研究者やCEOによって表明された懸念により、AI安全性は大きな注目を集めるようになった。2023年のAI安全性サミットでは、米国と英国がそれぞれ独自のAIセーフティ・インスティテュート設立した。しかし、研究者たちは、AI安全性対策がAI能力の急速な発展に追いついていないという懸念を表明している^[1]。

動機

研究者たちは、重要なシステムの故障^[2]、バイアス^[3]、AIを利用した監視^[4]といった現在のリスク、ならびに技術的失業、デジタル操作^[5]、兵器化^[6]、AIを利用したサイバー攻撃^[7]やバイオテロ^[8]といった新たなリスクについて議論している。また、将来の人工汎用知能（AGI）エージェントのコントロールを失うリスク^[9]や、AIが永続的に安定した独裁政権を可能にするリスク^[10]といった、推測的なリスクについても議論している。

存在リスク

詳細は「汎用人工知能による人類滅亡のリスク」を参照

アンドリュー・ンのように、2015年にAGIに関する懸念を「火星に足を踏み入れたことさえないのに、火星の人口過剰を心配するようなものだ」と比較し、批判する人もいる^[13]。一方、スチュアート・J・ラッセルは注意を促し、「人間の創意工夫を過小評価するよりも、それを予測する方が良い」と主張している^[14]。

AI研究者は、AI技術によってもたらされるリスクの深刻さと主な原因について、大きく異なる意見を持っている^[15]^[16]^[17]。しかし、調査によると、専門家は重大な結果をもたらすリスクを真剣に受け止めていることが示唆されている。AI研究者を対象とした2つの調査では、回答者の半数がAI全体について楽観的であったが、高度なAIの結果として「非常に悪い（例えば、人類の絶滅）」結果が生じる確率を5％と見積もっている^[18]。2022年の自然言語処理コミュニティの調査では、回答者の37％が、AIの決定が「全面的な核戦争と同じくらい悪い」大惨事を引き起こす可能性があると、同意または弱く同意している^[19]。

歴史

AIのリスクは、情報化時代の初期から真剣に議論され始めた。

さらに、学習し、経験によって行動が変化する機械を作る方向に進めば、機械に与えるあらゆる程度の独立性が、私たちの望みに対する可能な反抗の度合いであるという事実に向き合わなければならない。
—ノーバート・ウィーナー (1949)^[20]

2008年から2009年にかけて、米国人工知能学会（AAAI）は、AIの研究開発が社会に及ぼす長期的な影響を探求し、対処するための研究を委託した。委員会は、サイエンスフィクション作家によって表明された過激な見解には概して懐疑的だったが、「予期せぬ結果を最小限に抑えるために、複雑な計算システムの行動範囲を理解し、検証する方法に関する追加の研究は価値があるだろう」という点で意見が一致した^[21]。

2011年、ロマン・ヤンポルスキー（英語版）は、人工知能の哲学と理論に関する会議で「AI Safety Engineering（AI安全性工学）」という用語を導入し^[21]^[22]、AIシステムの過去の失敗を列挙し、「AIがより能力を高めるにつれて、そのようなイベントの頻度と深刻さは着実に増加するだろう」と主張した。^[23]

2014年、哲学者ニック・ボストロムは著書『スーパーインテリジェンス超絶AIと人類の命運（英語版）』を出版した。彼は、AGIの台頭は、AIによる労働力の置き換え、政治および軍事構造の操作、さらには人類絶滅の可能性に至るまで、さまざまな社会問題を引き起こす可能性があると主張している^[24]。将来の高度なシステムが人類の存在に脅威を与える可能性があるという彼の主張は、イーロン・マスク^[25]、ビル・ゲイツ^[26]、スティーブン・ホーキング^[27]らが同様の懸念を表明するきっかけとなった。

2015年、数十人の人工知能の専門家が、AIの社会的影響に関する研究を呼びかけ、具体的な方向性を概説した人工知能に関する公開書簡（英語版）に署名した。現在までに、ヤン・ルカン、シェーン・レッグ（英語版）、ヨシュア・ベンジオ、スチュアート・ラッセルなど、8000人以上がこの書簡に署名している。

同年、スチュアート・ラッセル教授を中心とする学者グループが、カリフォルニア大学バークレー校に人類適合型人工知能研究センター（英語版）を設立し、生命未来研究所（英語版）は、「人工知能（AI）が安全で倫理的かつ有益であり続けることを保証する」ことを目的とした研究に650万ドルの助成金を提供した^[28]。

2016年、ホワイトハウス科学技術政策局とカーネギーメロン大学は、人工知能の安全性と制御に関する公開ワークショップを発表した^[29]。これは、AIの「長所と短所」を調査することを目的とした、ホワイトハウスの4つのワークショップのうちの1つだった^[30]。同年、AI安全性に関する最初期かつ最も影響力のある技術的なアジェンダの1つである「Concrete Problems in AI Safety」が発表された^[31]。

2017年、生命未来研究所（英語版）は、有益なAIに関するアシロマ会議（英語版）を後援した。この会議では、100人以上の思想的リーダーが、「レース回避：AIシステムを開発するチームは、安全基準を損なうことを避けるために積極的に協力すべきである」など、有益なAIの原則を策定した^[32]。

2018年、DeepMind Safetyチームは、仕様、堅牢性^[33]、保証^[34]におけるAI安全性の問題の概要を説明した。翌年、研究者たちはICLR（英語版）でこれらの問題領域に焦点を当てたワークショップを開催した^[35]。

2021年、「Unsolved Problems in ML Safety」が発表され、堅牢性、監視、アラインメント、システムの安全性における研究の方向性が示された^[36]。

2023年、リシ・スナクは、英国を「グローバルなAI安全性規制の地理的な拠点」とし、AI安全性に関する初のグローバルサミットを主催したいと述べた^[37]。AI安全性サミットは2023年11月に開催され、最先端のAIモデルに関連する誤用と制御喪失のリスクに焦点を当てた^[38]。サミット期間中、「高度なAIの安全性に関する国際科学レポート」^[39]を作成する意向が発表された。

2024年、米国と英国はAI安全性の科学に関する新たなパートナーシップを締結した。この覚書は、2024年4月1日に米国商務長官ジーナ・ライモンドと英国技術長官ミシェル・ドネラン（英語版）によって署名され、11月にブレッチリー・パークで開催されたAI安全性サミットで発表されたコミットメントに続いて、高度なAIモデルテストを共同で開発することになった^[40]。

研究の焦点

AI安全性の研究領域には、堅牢性、監視、アラインメントが含まれる^[41]^[42]。

堅牢性

敵対的堅牢性

AIシステムは、敵対的サンプル（英語版）、つまり「攻撃者が意図的にモデルに誤りを犯させるように設計した機械学習（ML）モデルへの入力」に対して脆弱である場合が多い^[43]。例えば、2013年にセゲディらは、画像に特定の知覚できない摂動を加えることで、高い信頼度で誤分類されることを発見した^[43]。これはニューラルネットワークにおいて依然として問題となっているが、最近の研究では、摂動は一般的に知覚できるほど大きい^[44]^[45]^[46]。

図１はすべて、犬の画像に対して摂動が適用された後にダチョウと予測されている。(左) 正しく予測されたサンプル、(中央) 10倍に拡大された摂動、(右) 敵対的サンプル^[43]。

敵対的堅牢性は、多くの場合、セキュリティと関連付けられる^[47]。研究者たちは、音声認識システムが攻撃者が選択したメッセージに書き起こすように、音声信号を感知できないほど変更できることを実証した^[48]。ネットワーク侵入^[49]およびマルウェア^[50]検出システムも、攻撃者が検出器を欺くように攻撃を設計する可能性があるため、敵対的に堅牢でなければならない。

目的を表すモデル（報酬モデル）も、敵対的に堅牢でなければならない。例えば、報酬モデルはテキスト応答がどれほど役立つかを推定し、言語モデルはこのスコアを最大化するように訓練される場合がある^[51]。研究者たちは、言語モデルが十分に長く訓練されると、報酬モデルの脆弱性を活用してより良いスコアを達成し、意図したタスクのパフォーマンスを低下させることを示している^[52]。この問題は、報酬モデルの敵対的堅牢性を向上させることで対処できる^[53]。より一般的には、別のAIシステムを評価するために使用されるAIシステムは、敵対的に堅牢でなければならない。これには監視ツールも含まれる可能性がある。なぜなら、監視ツールもまた、より高い報酬を生み出すために改ざんされる可能性があるためである^[54]。

監視

不確実性の推定

人間のオペレーターが、特に医療診断などのリスクの高い状況において、AIシステムをどの程度信頼すべきかを判断することが重要な場合が多い^[55]。MLモデルは一般的に確率を出力することで信頼度を表すが、特に訓練されたものとは異なる状況では、過度に自信過剰になることが多い^[56]^[57]。較正研究は、モデルの確率を、モデルが正しい真の比率にできるだけ近づけることを目的としている。

同様に、異常検出または out-of-distribution（OOD）検出は、AIシステムが異常な状況にあるときを特定することを目的としている。例えば、自動運転車のセンサーが故障している場合、または困難な地形に遭遇した場合、運転者に制御を引き継ぐか、路肩に停車するように警告する必要がある^[58]。異常検出は、異常な入力と異常でない入力を区別するように分類器を訓練することによって実装されてきたが^[59]、他にもさまざまな技術が使用されている^[60]^[61]。

悪意のある使用の検出

学者^[6]や政府機関は、AIシステムが悪意のある者が武器を製造する^[62]、世論を操作する^[63]^[64]、またはサイバー攻撃を自動化する^[65]のを助けるために使用される可能性があるという懸念を表明している。これらの懸念は、強力なAIツールをオンラインでホストしているOpenAIなどの企業にとって現実的な問題である^[66]。悪用を防ぐために、OpenAIはユーザーのアクティビティに基づいてユーザーにフラグを立てたり、制限したりする検出システムを構築している^[67]。

透明性

ニューラルネットワークはしばしばブラックボックス^[68]と表現され、実行する膨大な量の計算の結果として、なぜそのような決定を下すのかを理解することが困難であることを意味する^[69]。これにより、障害を予測することが難しくなる。2018年、自動運転車が歩行者を認識できずに死亡させた。AIソフトウェアのブラックボックス性のため、失敗の理由は不明のままである^[70]。また、医療において、統計的に効率的ではあるが不透明なモデルを使用すべきかどうかについての議論も引き起こしている^[71]。

透明性の重要な利点の1つは、解釈可能性である^[72]。例えば、求人応募の自動フィルタリングやクレジットスコアの割り当てなど、公平性を確保するために、なぜその決定が下されたのかを説明することが法的要件となっている場合がある^[72]。

もう1つの利点は、失敗の原因を明らかにすることである^[68]。2020年のCOVID-19パンデミックの初期に、研究者たちは透明性ツールを使用して、医療画像分類器が関連のない病院のラベルに「注意を払って」いることを示した^[73]。

透明性技術は、エラーを修正するためにも使用できる。例えば、「Locating and Editing Factual Associations in GPT」という論文では、著者はエッフェル塔の場所に関する質問にどのように答えるかに影響を与えるモデルパラメータを特定することができた。そして、モデルが塔がフランスではなくローマにあると信じるかのように質問に答えるように、この知識を「編集」することができた^[74]。この場合、著者はエラーを誘発したが、これらの方法は潜在的にエラーを効率的に修正するために使用できる可能性がある。モデル編集技術はコンピュータビジョンにも存在する^[75]。

最後に、AIシステムの不透明性はリスクの重要な原因であり、AIシステムがどのように機能するかをより深く理解することで、将来の重大な失敗を防ぐことができると主張する人もいる^[76]。「内部」解釈可能性研究は、MLモデルの不透明性を軽減することを目的としている。この研究の目標の1つは、内部ニューロンの活性化が何を表しているかを特定することである^[77]^[78]。例えば、研究者たちは、スパイダーマンのコスチュームを着た人、スパイダーマンのスケッチ、そして「スパイダー」という言葉の画像に反応するCLIP人工知能システムのニューロンを特定した^[79]。また、これらのニューロンまたは「回路」間の接続を説明することも含まれる^[80]^[81]。例えば、研究者たちは、トランスフォーマーの注意におけるパターンマッチングメカニズムを特定しており、これは言語モデルがコンテキストから学習する方法に役割を果たしている可能性がある^[82]。「内部解釈可能性」は神経科学と比較されてきた。どちらの場合も、複雑なシステムで何が起こっているのかを理解することが目標であるが、ML研究者は完璧な測定を行い、任意の切除を行うことができるという利点がある^[83]。

トロイの木馬の検出

MLモデルは、潜在的に「トロイの木馬」または「バックドア」を含む可能性がある。これは、悪意のある者がAIシステムに悪意を持って組み込んだ脆弱性である。例えば、トロイの木馬が仕掛けられた顔認識システムは、特定の宝石が見えているときにアクセスを許可する可能性がある^[37]。また、トロイの木馬が仕掛けられた自動運転車は、特定のトリガーが見えるまで正常に機能する可能性がある^[84]。敵対者は、トロイの木馬を仕掛けるためにシステムの訓練データにアクセスできる必要があることに注意が必要である^[要出典]。CLIPやGPT-3のような一部の大規模モデルでは、公開されているインターネットデータで訓練されているため、これを行うことは難しいことではないかもしれない^[85]。研究者たちは、300万枚の訓練画像のうちわずか300枚を変更することで、画像分類器にトロイの木馬を仕掛けることができた^[86]。セキュリティリスクをもたらすことに加えて、研究者たちは、トロイの木馬はより良い監視ツールをテストおよび開発するための具体的な設定を提供すると主張している^[54]。

アラインメント

This section is an excerpt from AIアライメント.[編集]

人工知能（AI）において、AIアライメント（）は、AIシステムを人間の意図する目的や嗜好、または倫理原則に合致させることを目的とする研究領域である。意図した目標を達成するAIシステムは、整合したAIシステム（aligned AI system）とみなされる。一方、整合しない、あるいは整合を欠いたAIシステム（misaligned AI system）は、目標の一部を適切に達成する能力はあっても、残りの目標を達成することができない^[87]。

AI設計者にとってAIシステムを整合するのは困難であり、その理由は、望ましい動作と望ましくない動作を全域にわたって明示することが難しいことによる。この困難を避けるため、設計者は通常、人間の承認を得るなどのより単純なを用いる。しかし、この手法は抜け穴を作ったり、必要な制約を見落としたり、AIシステムが単に整合しているように見えるだけで報酬を与えたりする可能性がある^[87]。

整合を欠いたAIシステムは、誤作動を起こしたり、人に危害を加えたりする可能性がある。AIシステムは、代理目的を効率的に達成するための抜け穴を見つけるかもしれないし、意図しない、ときには有害な方法（）で達成することもある^[87]^[88]^[89]。このような戦略は与えられた目的の達成に役立つため、AIシステムは能力や生存を追求するような、望ましくない（最終的な目的とは異なる、それを実現するための手段）を発達させる可能性もある^[87]^[90]^[91]。さらに、システムが導入された後、新たな状況やに直面したとき、望ましくない創発的目的を開発する可能性もある^[92]^[93]。

今日、こうした問題は、言語モデル^[94]^[95]^[96]、ロボット^[97]、自律走行車^[98]、ソーシャルメディアの推薦システムなど^[99]、既存の商用システムに影響を及ぼしている。AI研究者の中には、こうした問題はシステムが部分的に高性能化することに起因しているため、より高性能な将来のシステムではより深刻な影響を受けるだろうと主張する者もいる^[100]^[101]。

ジェフリー・ヒントンやスチュアート・ラッセルなどの一流のコンピューター科学者は、AIは超人的な能力に近づいており、もし整合を欠けば人類の文明を危険にさらしかねないと主張している^[102]^[91]。

AI研究コミュニティや国連は、AIシステムを人間の価値観に沿ったものとするために、技術的研究と政策的解決策を呼びかけている^[103]。

AIアライメントは、安全なAIシステムを構築する方法を研究するの下位分野である^[104]。そこには、ロバスト性（堅牢性）、監視、などの研究領域もある^[105]。アライメントに関する研究課題には、AIに複雑な価値観を教え込むこと、正直なAIの開発、スケーラブルな監視、AIモデルの監査と解釈、能力追求のようなAIの創発的行動の防止などが含まれる。アライメントに関連する研究テーマには、解釈可能性^[106]^[107]、（敵対的）ロバスト性、異常検知、^[106]、形式的検証^[108]、^[109]^[110]^[111]、^[112]、ゲーム理論^[113]、アルゴリズム公平性^[114]、および社会科学^[115]などがある。

システムの安全性と社会技術的要因

AIリスク（およびより一般的には技術的リスク）は、誤用または事故として分類されるのが一般的である^[116]。一部の学者は、このフレームワークは不十分だと示唆している^[116]。例えば、キューバミサイル危機は、明らかに事故でも技術の誤用でもなかった^[116]。政策アナリストのツェツルートとダフォーは、「誤用と事故の観点は、害につながる因果関係の連鎖の最後のステップ、つまり技術を誤用した人物、または意図しない方法で行動したシステムのみに焦点を当てる傾向がある…しかし、多くの場合、関連する因果関係の連鎖ははるかに長い」と述べている。リスクは、競争圧力、危害の拡散、急速な開発、高度の不確実性、不十分な安全文化など、「構造的」または「システミック」な要因から生じることが多い^[116]。安全性エンジニアリングのより広い文脈では、「組織の安全文化」のような構造的要因は、一般的なSTAMPリスク分析フレームワークにおいて中心的な役割を果たしている^[117]。

構造的な視点に触発されて、一部の研究者は、サイバー防御のためのMLの使用、制度的意思決定の改善、協力の促進など、社会技術的安全性要因を改善するために機械学習を使用することの重要性を強調している^[37]。

サイバー防御

一部の学者は、AIがサイバー攻撃者とサイバー防御者の間のすでに不均衡なゲームを悪化させるのではないかと懸念している^[118]。これは「先制攻撃」のインセンティブを高め、より攻撃的で不安定化をもたらす攻撃につながる可能性がある。このリスクを軽減するために、一部の人はサイバー防御への重点の強化を提唱している。さらに、強力なAIモデルが盗まれたり悪用されたりするのを防ぐために、ソフトウェアセキュリティは不可欠である^[6]。最近の研究では、AIは、日常的なタスクを自動化し、全体的な効率を向上させることにより、技術的および管理的なサイバーセキュリティタスクの両方を大幅に強化できることが示されている^[119]。

制度的意思決定の改善

経済および軍事分野におけるAIの進歩は、前例のない政治的課題を招く可能性がある^[120]。一部の学者は、AI競争を冷戦と比較している。冷戦では、少数の意思決定者の慎重な判断が、安定と破滅の分かれ目となることが多かった^[121]。AI研究者は、AI技術は意思決定を支援するためにも使用できると主張している^[37]。例えば、研究者たちはAI予測^[122]および助言システム^[123]の開発を始めている。

協力の促進

世界的な最大の脅威の多く（核戦争^[124]、気候変動^[125]など）は、協力の課題として捉えられてきた。よく知られている囚人のジレンマのシナリオのように、一部のダイナミクスは、すべてのプレイヤーが自己利益のために最適に行動している場合でも、すべてのプレイヤーにとって悪い結果につながる可能性がある。例えば、誰も介入しなければ重大な結果になる可能性があるにもかかわらず、気候変動に対処するための強力なインセンティブを持っている主体は1つもない^[126]。

顕著なAI協力の課題は、「底辺への競争」を避けることである^[126]。このシナリオでは、国や企業はより能力の高いAIシステムを構築するために競争し、安全性を無視し、関係者全員に害を及ぼす壊滅的な事故につながる。このようなシナリオに関する懸念は、人間の間、そして潜在的にはAIシステムの間の協力を促進するための政治的^[127]および技術的^[128]な取り組みの両方に影響を与えてきた。ほとんどのAI研究は、個々のエージェントが（多くの場合「シングルプレイヤー」ゲームで）孤立した機能を果たすように設計することに焦点を当てている^[129]。学者たちは、AIシステムがより自律的になるにつれて、AIシステムが相互作用する方法を研究し、形作ることが不可欠になる可能性があると示唆している^[129]。

大規模言語モデルの課題

近年、大規模言語モデル（LLM）の開発は、AI安全性の分野で独自の懸念を引き起こしている。ベンダーとゲブルーらの研究者^[130]は、これらのモデルのトレーニングに伴う環境的および経済的コストを強調しており、Transformerモデルなどのトレーニング手順のエネルギー消費とカーボンフットプリントがかなりの量になる可能性があることを強調している。さらに、これらのモデルは、多くの場合、大規模で管理されていないインターネットベースのデータセットに依存しており、これは覇権的で偏った視点をエンコードし、過小評価されているグループをさらに疎外する可能性がある。大規模なトレーニングデータは膨大である一方で、多様性を保証するものではなく、多くの場合、特権的な人口統計の考え方を反映しており、既存の偏見やステレオタイプを永続させるモデルにつながる。この状況は、これらのモデルが、一見首尾一貫していて流暢なテキストを生成する傾向によって悪化しており、ユーザーが意味や意図が存在しない場所に意味や意図を帰属させてしまう可能性がある。これは「確率的オウム」として説明される現象である。したがって、これらのモデルは、社会的な偏見を増幅し、誤った情報を拡散し、過激派のプロパガンダやディープフェイクの生成などの悪意のある目的で使用されるリスクをもたらす。これらの課題に対処するために、研究者たちは、データセットの作成とシステム開発においてより慎重な計画を提唱し、公平な技術的エコシステムに積極的に貢献する研究プロジェクトの必要性を強調している^[131]^[132]。

脚注

^ Perrigo, Billy (2023-11-02). “U.K.'s AI Safety Summit Ends With Limited, but Meaningful, Progress” (英語). Time 2024年6月2日閲覧。.
^ De-Arteaga, Maria (13 May 2020). Machine Learning in High-Stakes Settings: Risks and Opportunities (PhD). Carnegie Mellon University.
^ Mehrabi, Ninareh; Morstatter, Fred; Saxena, Nripsuta; Lerman, Kristina; Galstyan, Aram (2021). “A Survey on Bias and Fairness in Machine Learning” (英語). ACM Computing Surveys 54 (6): 1–35. arXiv:1908.09635. doi:10.1145/3457607. ISSN 0360-0300. オリジナルの2022-11-23時点におけるアーカイブ。 2022年11月28日閲覧。.
^ Feldstein, Steven (2019). The Global Expansion of AI Surveillance (Report). Carnegie Endowment for International Peace.
^ Barnes, Beth (2021). “Risks from AI persuasion”. Lesswrong. オリジナルの2022-11-23時点におけるアーカイブ。 2022年11月23日閲覧。.
^ ^a ^b ^c Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben; Dafoe, Allan; Scharre, Paul et al. (2018-04-30). The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. Apollo-University Of Cambridge Repository, Apollo-University Of Cambridge Repository. Apollo - University of Cambridge Repository. doi:10.17863/cam.22520. オリジナルの2022-11-23時点におけるアーカイブ。 2022年11月28日閲覧。.
^ Davies, Pascale (December 26, 2022). “How NATO is preparing for a new era of AI cyber attacks” (英語). euronews. 2024年3月23日閲覧。
^ Ahuja, Anjana (February 7, 2024). “AI's bioterrorism potential should not be ruled out”. Financial Times. 2024年3月23日閲覧。
^ Carlsmith, Joseph (2022-06-16). Is Power-Seeking AI an Existential Risk?. arXiv:2206.13353.
^ Minardi, Di (16 October 2020). “The grim fate that could be 'worse than extinction'”. BBC. 2024年3月23日閲覧。
^ Carlsmith, Joseph (16 June 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY]。
^ Taylor, Chloe (May 2, 2023). “'The Godfather of A.I.' warns of 'nightmare scenario' where artificial intelligence begins to seek power” (英語). Fortune. 2024年9月1日閲覧。
^ “AGI Expert Peter Voss Says AI Alignment Problem is Bogus | NextBigFuture.com” (英語) (2023年4月4日). 2023年7月23日閲覧。
^ Dafoe, Allan (2016年). “Yes, We Are Worried About the Existential Risk of Artificial Intelligence”. MIT Technology Review. 2022年11月28日時点のオリジナルよりアーカイブ。2022年11月28日閲覧。
^ Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018-07-31). “Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts”. Journal of Artificial Intelligence Research 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. オリジナルの2023-02-10時点におけるアーカイブ。 2022年11月28日閲覧。.
^ Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan (2021-05-05). “Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers”. Journal of Artificial Intelligence Research 71. arXiv:2105.02117. doi:10.1613/jair.1.12895.
^ “2022 Expert Survey on Progress in AI”. AI Impacts (2022年8月4日). 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018-07-31). “Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts”. Journal of Artificial Intelligence Research 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. オリジナルの2023-02-10時点におけるアーカイブ。 2022年11月28日閲覧。.
^ Michael, Julian; Holtzman, Ari; Parrish, Alicia; Mueller, Aaron; Wang, Alex; Chen, Angelica; Madaan, Divyam; Nangia, Nikita et al. (2022-08-26). “What Do NLP Researchers Believe? Results of the NLP Community Metasurvey”. Association for Computational Linguistics. arXiv:2208.12852.
^ Markoff, John (2013年5月20日). “In 1949, He Imagined an Age of Robots”. The New York Times. ISSN 0362-4331. オリジナルの2022年11月23日時点におけるアーカイブ。 2022年11月23日閲覧。
^ ^a ^b Association for the Advancement of Artificial Intelligence. “AAAI Presidential Panel on Long-Term AI Futures”. 2022年9月1日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ “PT-AI 2011 – Philosophy and Theory of Artificial Intelligence (PT-AI 2011)”. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Yampolskiy, Roman V. (2013), Müller, Vincent C., ed., “Artificial Intelligence Safety Engineering: Why Machine Ethics is a Wrong Approach”, Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics (Berlin; Heidelberg, Germany: Springer Berlin Heidelberg) 5: pp. 389–396, doi:10.1007/978-3-642-31674-6_29, ISBN 978-3-642-31673-9, オリジナルの2023-03-15時点におけるアーカイブ。 2022年11月23日閲覧。
^ McLean, Scott; Read, Gemma J. M.; Thompson, Jason; Baber, Chris; Stanton, Neville A.; Salmon, Paul M. (2023-07-04). “The risks associated with Artificial General Intelligence: A systematic review” (英語). Journal of Experimental & Theoretical Artificial Intelligence 35 (5): 649–663. Bibcode: 2023JETAI..35..649M. doi:10.1080/0952813X.2021.1964003. hdl:11343/289595. ISSN 0952-813X.
^ Wile, Rob (August 3, 2014). “Elon Musk: Artificial Intelligence Is 'Potentially More Dangerous Than Nukes'” (英語). Business Insider. 2024年2月22日閲覧。
^ Kuo, Kaiser (31 March 2015). Baidu CEO Robin Li interviews Bill Gates and Elon Musk at the Boao Forum, March 29, 2015. 該当時間: 55:49. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Cellan-Jones, Rory (2014年12月2日). “Stephen Hawking warns artificial intelligence could end mankind”. BBC News. オリジナルの2015年10月30日時点におけるアーカイブ。 2022年11月23日閲覧。
^ Future of Life Institute (October 2016). “AI Research Grants Program”. Future of Life Institute. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ “SafArtInt 2016”. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Bach, Deborah (2016年). “UW to host first of four White House public workshops on artificial intelligence”. UW News. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (2016-07-25). Concrete Problems in AI Safety. arXiv:1606.06565.
^ Future of Life Institute. “AI Principles”. Future of Life Institute. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Yohsua, Bengio; Daniel, Privitera; Tamay, Besiroglu; Rishi, Bommasani; Stephen, Casper; Yejin, Choi; Danielle, Goldfarb; Hoda, Heidari; Leila, Khalatbari (May 2024). International Scientific Report on the Safety of Advanced AI (Report). Department for Science, Innovation and Technology.
^ Research, DeepMind Safety (2018年9月27日). “Building safe artificial intelligence: specification, robustness, and assurance”. Medium. 2023年2月10日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ “SafeML ICLR 2019 Workshop”. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). Unsolved Problems in ML Safety. arXiv:2109.13916.
^ ^a ^b ^c ^d Browne, Ryan (2023年6月12日). “British Prime Minister Rishi Sunak pitches UK as home of A.I. safety regulation as London bids to be next Silicon Valley” (英語). CNBC. 2023年6月25日閲覧。
^ Bertuzzi, Luca (October 18, 2023). “UK's AI safety summit set to highlight risk of losing human control over 'frontier' models”. Euractiv March 2, 2024閲覧。
^ Bengio, Yoshua (2024年5月17日). “International Scientific Report on the Safety of Advanced AI”. GOV.UK. 2024年6月15日時点のオリジナルよりアーカイブ。2024年7月8日閲覧。
^ Shepardson, David (1 April 2024). “US, Britain announce partnership on AI safety, testing” 2 April 2024閲覧。
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). Unsolved Problems in ML Safety. arXiv:2109.13916.
^ Research, DeepMind Safety (2018年9月27日). “Building safe artificial intelligence: specification, robustness, and assurance”. Medium. 2023年2月10日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ ^a ^b ^c “Attacking Machine Learning with Adversarial Examples”. OpenAI (2017年2月24日). 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ Kurakin, Alexey; Goodfellow, Ian; Bengio, Samy (2017-02-10). “Adversarial examples in the physical world”. ICLR. arXiv:1607.02533.
^ Madry, Aleksander; Makelov, Aleksandar; Schmidt, Ludwig; Tsipras, Dimitris; Vladu, Adrian (2019-09-04). “Towards Deep Learning Models Resistant to Adversarial Attacks”. ICLR. arXiv:1706.06083.
^ Kannan, Harini; Kurakin, Alexey; Goodfellow, Ian (2018-03-16). Adversarial Logit Pairing. arXiv:1803.06373.
^ Gilmer, Justin; Adams, Ryan P.; Goodfellow, Ian; Andersen, David; Dahl, George E. (2018-07-19). Motivating the Rules of the Game for Adversarial Example Research. arXiv:1807.06732.
^ Carlini, Nicholas; Wagner, David (2018-03-29). “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text”. IEEE Security and Privacy Workshops. arXiv:1801.01944.
^ Sheatsley, Ryan; Papernot, Nicolas; Weisman, Michael; Verma, Gunjan; McDaniel, Patrick (2022-09-09). Adversarial Examples in Constrained Domains. arXiv:2011.01183.
^ Suciu, Octavian; Coull, Scott E.; Johns, Jeffrey (2019-04-13). “Exploring Adversarial Examples in Malware Detection”. IEEE Security and Privacy Workshops. arXiv:1810.08280.
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini et al. (2022-03-04). “Training language models to follow instructions with human feedback”. NeurIPS. arXiv:2203.02155.
^ Gao, Leo; Schulman, John; Hilton, Jacob (2022-10-19). “Scaling Laws for Reward Model Overoptimization”. ICML. arXiv:2210.10760.
^ Yu, Sihyun; Ahn, Sungsoo; Song, Le; Shin, Jinwoo (2021-10-27). “RoMA: Robust Model Adaptation for Offline Model-based Optimization”. NeurIPS. arXiv:2110.14188.
^ ^a ^b Hendrycks, Dan; Mazeika, Mantas (2022-09-20). X-Risk Analysis for AI Research. arXiv:2206.05862.
^ Tran, Khoa A.; Kondrashova, Olga; Bradley, Andrew; Williams, Elizabeth D.; Pearson, John V.; Waddell, Nicola (2021). “Deep learning in cancer diagnosis, prognosis and treatment selection” (英語). Genome Medicine 13 (1): 152. doi:10.1186/s13073-021-00968-x. ISSN 1756-994X. PMC 8477474. PMID 34579788.
^ Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. (6 August 2017). "On calibration of modern neural networks". Proceedings of the 34th international conference on machine learning. Proceedings of machine learning research. Vol. 70. PMLR. pp. 1321–1330.
^ Ovadia, Yaniv; Fertig, Emily; Ren, Jie; Nado, Zachary; Sculley, D.; Nowozin, Sebastian; Dillon, Joshua V.; Lakshminarayanan, Balaji et al. (2019-12-17). “Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift”. NeurIPS. arXiv:1906.02530.
^ Bogdoll, Daniel; Breitenstein, Jasmin; Heidecker, Florian; Bieshaar, Maarten; Sick, Bernhard; Fingscheidt, Tim; Zöllner, J. Marius (2021). “Description of Corner Cases in Automated Driving: Goals and Challenges”. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). pp. 1023–1028. arXiv:2109.09607. doi:10.1109/ICCVW54120.2021.00119. ISBN 978-1-6654-0191-3
^ Hendrycks, Dan; Mazeika, Mantas; Dietterich, Thomas (2019-01-28). “Deep Anomaly Detection with Outlier Exposure”. ICLR. arXiv:1812.04606.
^ Wang, Haoqi; Li, Zhizhong; Feng, Litong; Zhang, Wayne (2022-03-21). “ViM: Out-Of-Distribution with Virtual-logit Matching”. CVPR. arXiv:2203.10807.
^ Hendrycks, Dan; Gimpel, Kevin (2018-10-03). “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks”. ICLR. arXiv:1610.02136.
^ Urbina, Fabio; Lentzos, Filippa; Invernizzi, Cédric; Ekins, Sean (2022). “Dual use of artificial-intelligence-powered drug discovery” (英語). Nature Machine Intelligence 4 (3): 189–191. doi:10.1038/s42256-022-00465-9. ISSN 2522-5839. PMC 9544280. PMID 36211133.
^ Center for Security and Emerging Technology; Buchanan, Ben; Lohn, Andrew; Musser, Micah; Sedova, Katerina (2021). Truth, Lies, and Automation: How Language Models Could Change Disinformation. doi:10.51593/2021ca003. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.
^ “Propaganda-as-a-service may be on the horizon if large language models are abused”. VentureBeat (2021年12月14日). 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ Center for Security and Emerging Technology; Buchanan, Ben; Bansemer, John; Cary, Dakota; Lucas, Jack; Musser, Micah (2020). Automating Cyber Attacks: Hype and Reality. doi:10.51593/2020ca002. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.
^ “Lessons Learned on Language Model Safety and Misuse”. OpenAI (2022年3月3日). 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ “New-and-Improved Content Moderation Tooling”. OpenAI (2022年8月10日). 2023年1月11日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ ^a ^b Savage, Neil (2022-03-29). “Breaking into the black box of artificial intelligence”. Nature. doi:10.1038/d41586-022-00858-1. PMID 35352042. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月24日閲覧。.
^ Center for Security and Emerging Technology; Rudner, Tim; Toner, Helen (2021). “Key Concepts in AI Safety: Interpretability in Machine Learning”. PLoS ONE. doi:10.51593/20190042. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.
^ McFarland, Matt (2018年3月19日). “Uber pulls self-driving cars after first fatal crash of autonomous vehicle”. CNNMoney. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ Felder, Ryan Marshall (July 2021). “Coming to Terms with the Black Box Problem: How to Justify AI Systems in Health Care” (英語). Hastings Center Report 51 (4): 38–45. doi:10.1002/hast.1248. ISSN 0093-0334. PMID 33821471.
^ ^a ^b Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Scott, Kate; Schieber, Stuart et al. (2019-12-20). Accountability of AI Under the Law: The Role of Explanation. arXiv:1711.01134.
^ Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Scott, Kate; Schieber, Stuart et al. (2019-12-20). Accountability of AI Under the Law: The Role of Explanation. arXiv:1711.01134.
^ Meng, Kevin; Bau, David; Andonian, Alex; Belinkov, Yonatan (2022). “Locating and editing factual associations in GPT”. Advances in Neural Information Processing Systems 35. arXiv:2202.05262.
^ Bau, David; Liu, Steven; Wang, Tongzhou; Zhu, Jun-Yan; Torralba, Antonio (2020-07-30). “Rewriting a Deep Generative Model”. ECCV. arXiv:2007.15646.
^ Räuker, Tilman; Ho, Anson; Casper, Stephen; Hadfield-Menell, Dylan (2022-09-05). “Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks”. IEEE SaTML. arXiv:2207.13243.
^ Bau, David; Zhou, Bolei; Khosla, Aditya; Oliva, Aude; Torralba, Antonio (2017-04-19). “Network Dissection: Quantifying Interpretability of Deep Visual Representations”. CVPR. arXiv:1704.05796.
^ McGrath, Thomas; Kapishnikov, Andrei; Tomašev, Nenad; Pearce, Adam; Wattenberg, Martin; Hassabis, Demis; Kim, Been; Paquet, Ulrich et al. (2022-11-22). “Acquisition of chess knowledge in AlphaZero” (英語). Proceedings of the National Academy of Sciences 119 (47): e2206625119. arXiv:2111.09259. Bibcode: 2022PNAS..11906625M. doi:10.1073/pnas.2206625119. ISSN 0027-8424. PMC 9704706. PMID 36375061.
^ Goh, Gabriel; Cammarata, Nick; Voss, Chelsea; Carter, Shan; Petrov, Michael; Schubert, Ludwig; Radford, Alec; Olah, Chris (2021). “Multimodal neurons in artificial neural networks”. Distill 6 (3). doi:10.23915/distill.00030.
^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). “Zoom in: An introduction to circuits”. Distill 5 (3). doi:10.23915/distill.00024.001.
^ Cammarata, Nick; Goh, Gabriel; Carter, Shan; Voss, Chelsea; Schubert, Ludwig; Olah, Chris (2021). “Curve circuits”. Distill 6 (1). doi:10.23915/distill.00024.006. オリジナルの5 December 2022時点におけるアーカイブ。 5 December 2022閲覧。.
^ Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda et al. (2022). “In-context learning and induction heads”. Transformer Circuits Thread. arXiv:2209.11895.
^ Olah, Christopher. “Interpretability vs Neuroscience [rough note]”. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ Gu, Tianyu; Dolan-Gavitt, Brendan; Garg, Siddharth (2019-03-11). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv:1708.06733.
^ Chen, Xinyun; Liu, Chang; Li, Bo; Lu, Kimberly; Song, Dawn (2017-12-14). Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv:1712.05526.
^ Carlini, Nicholas; Terzis, Andreas (2022-03-28). “Poisoning and Backdooring Contrastive Learning”. ICLR. arXiv:2106.09667.
^ ^a ^b ^c ^d Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31-34. ISBN 978-1-292-40113-3. OCLC 1303900751. オリジナルのJuly 15, 2022時点におけるアーカイブ。 September 12, 2022閲覧。
^ Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (14 February 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022年7月21日閲覧。
^ Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI". Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc. pp. 15763–15773. 2023年3月11日閲覧。
^ Carlsmith, Joseph (16 June 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY]。
^ ^a ^b Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. オリジナルのFebruary 10, 2023時点におけるアーカイブ。 September 12, 2022閲覧。
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (28 June 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. 2023年3月11日閲覧。
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette et al. (2022-07-12). “On the Opportunities and Risks of Foundation Models”. Stanford CRFM. arXiv:2108.07258.
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL]。
^ “OpenAI Codex”. OpenAI (2021年8月10日). February 3, 2023時点のオリジナルよりアーカイブ。2022年7月23日閲覧。
^ Kober, Jens; Bagnell, J. Andrew; Peters, Jan (2013-09-01). “Reinforcement learning in robotics: A survey” (英語). The International Journal of Robotics Research 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. オリジナルのOctober 15, 2022時点におけるアーカイブ。 September 12, 2022閲覧。.
^ Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (2023-03-01). “Reward (Mis)design for autonomous driving” (英語). Artificial Intelligence 316: 103829. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702.
^ Stray, Jonathan (2020). “Aligning AI Optimization to Community Well-Being” (英語). International Journal of Community Well-Being 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107.
^ Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. pp. 1010. ISBN 978-0-13-604259-4. https://aima.cs.berkeley.edu/
^ Ngo, Richard; Chan, Lawrence; Mindermann, Sören (22 February 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs.AI]。
^ Smith, Craig S.. “Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'” (英語). Forbes. 2023年5月4日閲覧。
^
Future of Life Institute (2017年8月11日). “Asilomar AI Principles”. Future of Life Institute. October 10, 2022時点のオリジナルよりアーカイブ。2022年7月18日閲覧。 The AI principles created at the Asilomar Conference on Beneficial AI were signed by 1797 AI/robotics researchers.
- United Nations (2021). Our Common Agenda: Report of the Secretary-General (PDF) (Report). New York: United Nations. 2022年5月22日時点のオリジナルよりアーカイブ (PDF)。2022年9月12日閲覧。[T]he [UN] could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (21 June 2016). "Concrete Problems in AI Safety" (英語). arXiv:1606.06565 [cs.AI]。
^ “Building safe artificial intelligence: specification, robustness, and assurance”. DeepMind Safety Research – Medium (2018年9月27日). February 10, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。
^ ^a ^b Rorvig, Mordechai (2022年4月14日). “Researchers Gain New Understanding From Simple AI”. Quanta Magazine. February 10, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。
^
Doshi-Velez, Finale; Kim, Been (2 March 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML]。
- Wiblin, Robert (4 August 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. 2022年7月23日閲覧。
^ Russell, Stuart; Dewey, Daniel; Tegmark, Max (2015-12-31). “Research Priorities for Robust and Beneficial Artificial Intelligence”. AI Magazine 36 (4): 105–114. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. オリジナルのFebruary 2, 2023時点におけるアーカイブ。 September 12, 2022閲覧。.
^ Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). “A survey of preference-based reinforcement learning methods”. Journal of Machine Learning Research 18 (136): 1–46.
^ Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4。
^ Heaven, Will Douglas (2022年1月27日). “The new version of GPT-3 is much better behaved (and should be less toxic)”. MIT Technology Review. February 10, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。
^ Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (7 March 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG]。
^
Clifton, Jesse (2020年). “Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda”. Center on Long-Term Risk. January 1, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。
- Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (2021-05-06). “Cooperative AI: machines must learn to find common ground” (英語). Nature 593 (7857): 33–36. Bibcode: 2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. オリジナルのDecember 18, 2022時点におけるアーカイブ。 September 12, 2022閲覧。.
^ Prunkl, Carina; Whittlestone, Jess (2020-02-07). “Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society” (英語). Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (New York NY USA: ACM): 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. オリジナルのOctober 16, 2022時点におけるアーカイブ。 September 12, 2022閲覧。.
^ Irving, Geoffrey; Askell, Amanda (2019-02-19). “AI Safety Needs Social Scientists”. Distill 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. オリジナルのFebruary 10, 2023時点におけるアーカイブ。 September 12, 2022閲覧。.
^ ^a ^b ^c ^d “Thinking About Risks From AI: Accidents, Misuse and Structure”. Lawfare (2019年2月11日). 2023年8月19日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ Zhang, Yingyu; Dong, Chuntong; Guo, Weiqun; Dai, Jiabao; Zhao, Ziming (2022). “Systems theoretic accident model and process (STAMP): A literature review” (英語). Safety Science 152: 105596. doi:10.1016/j.ssci.2021.105596. オリジナルの2023-03-15時点におけるアーカイブ。 2022年11月28日閲覧。.
^ Center for Security and Emerging Technology; Hoffman, Wyatt (2021). “AI and the Future of Cyber Competition”. CSET Issue Brief. doi:10.51593/2020ca007. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.
^ Gafni, Ruti; Levy, Yair (2024-01-01). “The role of artificial intelligence (AI) in improving technical and managerial cybersecurity tasks’ efficiency”. Information & Computer Security ahead-of-print (ahead-of-print). doi:10.1108/ICS-04-2024-0102. ISSN 2056-4961.
^ Center for Security and Emerging Technology; Imbrie, Andrew; Kania, Elsa (2019). AI Safety, Security, and Stability Among Great Powers: Options, Challenges, and Lessons Learned for Pragmatic Engagement. doi:10.51593/20190051. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.
^ Future of Life Institute (27 March 2019). AI Strategy, Policy, and Governance (Allan Dafoe). 該当時間: 22:05. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。
^ Zou, Andy; Xiao, Tristan; Jia, Ryan; Kwon, Joe; Mazeika, Mantas; Li, Richard; Song, Dawn; Steinhardt, Jacob et al. (2022-10-09). “Forecasting Future World Events with Neural Networks”. NeurIPS. arXiv:2206.15474.
^ Gathani, Sneha; Hulsebos, Madelon; Gale, James; Haas, Peter J.; Demiralp, Çağatay (2022-02-08). “Augmenting Decision Making via Interactive What-If Analysis”. Conference on Innovative Data Systems Research. arXiv:2109.06160.
^ Lindelauf, Roy (2021), Osinga, Frans; Sweijs, Tim, eds., “Nuclear Deterrence in the Algorithmic Age: Game Theory Revisited” (英語), NL ARMS Netherlands Annual Review of Military Studies 2020, Nl Arms (The Hague: T.M.C. Asser Press): pp. 421–436, doi:10.1007/978-94-6265-419-8_22, ISBN 978-94-6265-418-1
^ Newkirk II, Vann R. (2016年4月21日). “Is Climate Change a Prisoner's Dilemma or a Stag Hunt?”. The Atlantic. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ ^a ^b Newkirk II, Vann R. (2016年4月21日). “Is Climate Change a Prisoner's Dilemma or a Stag Hunt?”. The Atlantic. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。
^ Dafoe, Allan. AI Governance: A Research Agenda (Report). Centre for the Governance of AI, Future of Humanity Institute, University of Oxford.
^ Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; Collins, Tantum; McKee, Kevin R.; Leibo, Joel Z.; Larson, Kate; Graepel, Thore (2020-12-15). “Open Problems in Cooperative AI”. NeurIPS. arXiv:2012.08630.
^ ^a ^b Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (2021). “Cooperative AI: machines must learn to find common ground”. Nature 593 (7857): 33–36. Bibcode: 2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. PMID 33947992. オリジナルの2022-11-22時点におけるアーカイブ。 2022年11月24日閲覧。.
^ Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610-623. https://doi.org/10.1145/3442188.3445922.
^ Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. arXiv preprint arXiv:1906.02243.
^ Schwartz, R., Dodge, J., Smith, N.A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54-63. https://doi.org/10.1145/3442188.3445922.

[1] Perrigo, Billy (2023-11-02). “U.K.'s AI Safety Summit Ends With Limited, but Meaningful, Progress” (英語). Time 2024年6月2日閲覧。.

[2] De-Arteaga, Maria (13 May 2020). Machine Learning in High-Stakes Settings: Risks and Opportunities (PhD). Carnegie Mellon University.

[:3-3] Mehrabi, Ninareh; Morstatter, Fred; Saxena, Nripsuta; Lerman, Kristina; Galstyan, Aram (2021). “A Survey on Bias and Fairness in Machine Learning” (英語). ACM Computing Surveys 54 (6): 1–35. arXiv:1908.09635. doi:10.1145/3457607. ISSN 0360-0300. オリジナルの2022-11-23時点におけるアーカイブ。 2022年11月28日閲覧。.

[4] Feldstein, Steven (2019). The Global Expansion of AI Surveillance (Report). Carnegie Endowment for International Peace.

[5] Barnes, Beth (2021). “Risks from AI persuasion”. Lesswrong. オリジナルの2022-11-23時点におけるアーカイブ。 2022年11月23日閲覧。.

[:13-6] Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben; Dafoe, Allan; Scharre, Paul et al. (2018-04-30). The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. Apollo-University Of Cambridge Repository, Apollo-University Of Cambridge Repository. Apollo - University of Cambridge Repository. doi:10.17863/cam.22520. オリジナルの2022-11-23時点におけるアーカイブ。 2022年11月28日閲覧。.

[7] Davies, Pascale (December 26, 2022). “How NATO is preparing for a new era of AI cyber attacks” (英語). euronews. 2024年3月23日閲覧。

[8] Ahuja, Anjana (February 7, 2024). “AI's bioterrorism potential should not be ruled out”. Financial Times. 2024年3月23日閲覧。

[9] Carlsmith, Joseph (2022-06-16). Is Power-Seeking AI an Existential Risk?. arXiv:2206.13353.

[10] Minardi, Di (16 October 2020). “The grim fate that could be 'worse than extinction'”. BBC. 2024年3月23日閲覧。

[Carlsmith2022-11] Carlsmith, Joseph (16 June 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY]。

[12] Taylor, Chloe (May 2, 2023). “'The Godfather of A.I.' warns of 'nightmare scenario' where artificial intelligence begins to seek power” (英語). Fortune. 2024年9月1日閲覧。

[13] “AGI Expert Peter Voss Says AI Alignment Problem is Bogus | NextBigFuture.com” (英語) (2023年4月4日). 2023年7月23日閲覧。

[14] Dafoe, Allan (2016年). “Yes, We Are Worried About the Existential Risk of Artificial Intelligence”. MIT Technology Review. 2022年11月28日時点のオリジナルよりアーカイブ。2022年11月28日閲覧。

[:1-15] Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018-07-31). “Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts”. Journal of Artificial Intelligence Research 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. オリジナルの2023-02-10時点におけるアーカイブ。 2022年11月28日閲覧。.

[16] Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan (2021-05-05). “Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers”. Journal of Artificial Intelligence Research 71. arXiv:2105.02117. doi:10.1613/jair.1.12895.

[17] “2022 Expert Survey on Progress in AI”. AI Impacts (2022年8月4日). 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[:12-18] Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018-07-31). “Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts”. Journal of Artificial Intelligence Research 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. オリジナルの2023-02-10時点におけるアーカイブ。 2022年11月28日閲覧。.

[19] Michael, Julian; Holtzman, Ari; Parrish, Alicia; Mueller, Aaron; Wang, Alex; Chen, Angelica; Madaan, Divyam; Nangia, Nikita et al. (2022-08-26). “What Do NLP Researchers Believe? Results of the NLP Community Metasurvey”. Association for Computational Linguistics. arXiv:2208.12852.

[20] Markoff, John (2013年5月20日). “In 1949, He Imagined an Age of Robots”. The New York Times. ISSN 0362-4331. オリジナルの2022年11月23日時点におけるアーカイブ。 2022年11月23日閲覧。

[:2-21] Association for the Advancement of Artificial Intelligence. “AAAI Presidential Panel on Long-Term AI Futures”. 2022年9月1日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[22] “PT-AI 2011 – Philosophy and Theory of Artificial Intelligence (PT-AI 2011)”. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[23] Yampolskiy, Roman V. (2013), Müller, Vincent C., ed., “Artificial Intelligence Safety Engineering: Why Machine Ethics is a Wrong Approach”, Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics (Berlin; Heidelberg, Germany: Springer Berlin Heidelberg) 5: pp. 389–396, doi:10.1007/978-3-642-31674-6_29, ISBN 978-3-642-31673-9, オリジナルの2023-03-15時点におけるアーカイブ。 2022年11月23日閲覧。

[24] McLean, Scott; Read, Gemma J. M.; Thompson, Jason; Baber, Chris; Stanton, Neville A.; Salmon, Paul M. (2023-07-04). “The risks associated with Artificial General Intelligence: A systematic review” (英語). Journal of Experimental & Theoretical Artificial Intelligence 35 (5): 649–663. Bibcode: 2023JETAI..35..649M. doi:10.1080/0952813X.2021.1964003. hdl:11343/289595. ISSN 0952-813X.

[25] Wile, Rob (August 3, 2014). “Elon Musk: Artificial Intelligence Is 'Potentially More Dangerous Than Nukes'” (英語). Business Insider. 2024年2月22日閲覧。

[26] Kuo, Kaiser (31 March 2015). Baidu CEO Robin Li interviews Bill Gates and Elon Musk at the Boao Forum, March 29, 2015. 該当時間: 55:49. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[27] Cellan-Jones, Rory (2014年12月2日). “Stephen Hawking warns artificial intelligence could end mankind”. BBC News. オリジナルの2015年10月30日時点におけるアーカイブ。 2022年11月23日閲覧。

[28] Future of Life Institute (October 2016). “AI Research Grants Program”. Future of Life Institute. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[29] “SafArtInt 2016”. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[30] Bach, Deborah (2016年). “UW to host first of four White House public workshops on artificial intelligence”. UW News. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[31] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (2016-07-25). Concrete Problems in AI Safety. arXiv:1606.06565.

[:21-32] Future of Life Institute. “AI Principles”. Future of Life Institute. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[33] Yohsua, Bengio; Daniel, Privitera; Tamay, Besiroglu; Rishi, Bommasani; Stephen, Casper; Yejin, Choi; Danielle, Goldfarb; Hoda, Heidari; Leila, Khalatbari (May 2024). International Scientific Report on the Safety of Advanced AI (Report). Department for Science, Innovation and Technology.

[:8-34] Research, DeepMind Safety (2018年9月27日). “Building safe artificial intelligence: specification, robustness, and assurance”. Medium. 2023年2月10日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[35] “SafeML ICLR 2019 Workshop”. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[Hendrycks2022-36] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). Unsolved Problems in ML Safety. arXiv:2109.13916.

[:4-37] Browne, Ryan (2023年6月12日). “British Prime Minister Rishi Sunak pitches UK as home of A.I. safety regulation as London bids to be next Silicon Valley” (英語). CNBC. 2023年6月25日閲覧。

[38] Bertuzzi, Luca (October 18, 2023). “UK's AI safety summit set to highlight risk of losing human control over 'frontier' models”. Euractiv March 2, 2024閲覧。

[39] Bengio, Yoshua (2024年5月17日). “International Scientific Report on the Safety of Advanced AI”. GOV.UK. 2024年6月15日時点のオリジナルよりアーカイブ。2024年7月8日閲覧。

[40] Shepardson, David (1 April 2024). “US, Britain announce partnership on AI safety, testing” 2 April 2024閲覧。

[Hendrycks20222-41] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). Unsolved Problems in ML Safety. arXiv:2109.13916.

[:82-42] Research, DeepMind Safety (2018年9月27日). “Building safe artificial intelligence: specification, robustness, and assurance”. Medium. 2023年2月10日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[:7-43] “Attacking Machine Learning with Adversarial Examples”. OpenAI (2017年2月24日). 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[44] Kurakin, Alexey; Goodfellow, Ian; Bengio, Samy (2017-02-10). “Adversarial examples in the physical world”. ICLR. arXiv:1607.02533.

[45] Madry, Aleksander; Makelov, Aleksandar; Schmidt, Ludwig; Tsipras, Dimitris; Vladu, Adrian (2019-09-04). “Towards Deep Learning Models Resistant to Adversarial Attacks”. ICLR. arXiv:1706.06083.

[46] Kannan, Harini; Kurakin, Alexey; Goodfellow, Ian (2018-03-16). Adversarial Logit Pairing. arXiv:1803.06373.

[47] Gilmer, Justin; Adams, Ryan P.; Goodfellow, Ian; Andersen, David; Dahl, George E. (2018-07-19). Motivating the Rules of the Game for Adversarial Example Research. arXiv:1807.06732.

[48] Carlini, Nicholas; Wagner, David (2018-03-29). “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text”. IEEE Security and Privacy Workshops. arXiv:1801.01944.

[49] Sheatsley, Ryan; Papernot, Nicolas; Weisman, Michael; Verma, Gunjan; McDaniel, Patrick (2022-09-09). Adversarial Examples in Constrained Domains. arXiv:2011.01183.

[50] Suciu, Octavian; Coull, Scott E.; Johns, Jeffrey (2019-04-13). “Exploring Adversarial Examples in Malware Detection”. IEEE Security and Privacy Workshops. arXiv:1810.08280.

[51] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini et al. (2022-03-04). “Training language models to follow instructions with human feedback”. NeurIPS. arXiv:2203.02155.

[:0-52] Gao, Leo; Schulman, John; Hilton, Jacob (2022-10-19). “Scaling Laws for Reward Model Overoptimization”. ICML. arXiv:2210.10760.

[53] Yu, Sihyun; Ahn, Sungsoo; Song, Le; Shin, Jinwoo (2021-10-27). “RoMA: Robust Model Adaptation for Offline Model-based Optimization”. NeurIPS. arXiv:2110.14188.

[X-Risk_Analysis_for_AI_Research-54] Hendrycks, Dan; Mazeika, Mantas (2022-09-20). X-Risk Analysis for AI Research. arXiv:2206.05862.

[55] Tran, Khoa A.; Kondrashova, Olga; Bradley, Andrew; Williams, Elizabeth D.; Pearson, John V.; Waddell, Nicola (2021). “Deep learning in cancer diagnosis, prognosis and treatment selection” (英語). Genome Medicine 13 (1): 152. doi:10.1186/s13073-021-00968-x. ISSN 1756-994X. PMC 8477474. PMID 34579788.

[56] Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. (6 August 2017). "On calibration of modern neural networks". Proceedings of the 34th international conference on machine learning. Proceedings of machine learning research. Vol. 70. PMLR. pp. 1321–1330.

[57] Ovadia, Yaniv; Fertig, Emily; Ren, Jie; Nado, Zachary; Sculley, D.; Nowozin, Sebastian; Dillon, Joshua V.; Lakshminarayanan, Balaji et al. (2019-12-17). “Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift”. NeurIPS. arXiv:1906.02530.

[58] Bogdoll, Daniel; Breitenstein, Jasmin; Heidecker, Florian; Bieshaar, Maarten; Sick, Bernhard; Fingscheidt, Tim; Zöllner, J. Marius (2021). “Description of Corner Cases in Automated Driving: Goals and Challenges”. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). pp. 1023–1028. arXiv:2109.09607. doi:10.1109/ICCVW54120.2021.00119. ISBN 978-1-6654-0191-3

[59] Hendrycks, Dan; Mazeika, Mantas; Dietterich, Thomas (2019-01-28). “Deep Anomaly Detection with Outlier Exposure”. ICLR. arXiv:1812.04606.

[60] Wang, Haoqi; Li, Zhizhong; Feng, Litong; Zhang, Wayne (2022-03-21). “ViM: Out-Of-Distribution with Virtual-logit Matching”. CVPR. arXiv:2203.10807.

[61] Hendrycks, Dan; Gimpel, Kevin (2018-10-03). “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks”. ICLR. arXiv:1610.02136.

[62] Urbina, Fabio; Lentzos, Filippa; Invernizzi, Cédric; Ekins, Sean (2022). “Dual use of artificial-intelligence-powered drug discovery” (英語). Nature Machine Intelligence 4 (3): 189–191. doi:10.1038/s42256-022-00465-9. ISSN 2522-5839. PMC 9544280. PMID 36211133.

[63] Center for Security and Emerging Technology; Buchanan, Ben; Lohn, Andrew; Musser, Micah; Sedova, Katerina (2021). Truth, Lies, and Automation: How Language Models Could Change Disinformation. doi:10.51593/2021ca003. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.

[64] “Propaganda-as-a-service may be on the horizon if large language models are abused”. VentureBeat (2021年12月14日). 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[65] Center for Security and Emerging Technology; Buchanan, Ben; Bansemer, John; Cary, Dakota; Lucas, Jack; Musser, Micah (2020). Automating Cyber Attacks: Hype and Reality. doi:10.51593/2020ca002. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.

[66] “Lessons Learned on Language Model Safety and Misuse”. OpenAI (2022年3月3日). 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[67] “New-and-Improved Content Moderation Tooling”. OpenAI (2022年8月10日). 2023年1月11日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[:5-68] Savage, Neil (2022-03-29). “Breaking into the black box of artificial intelligence”. Nature. doi:10.1038/d41586-022-00858-1. PMID 35352042. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月24日閲覧。.

[69] Center for Security and Emerging Technology; Rudner, Tim; Toner, Helen (2021). “Key Concepts in AI Safety: Interpretability in Machine Learning”. PLoS ONE. doi:10.51593/20190042. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.

[70] McFarland, Matt (2018年3月19日). “Uber pulls self-driving cars after first fatal crash of autonomous vehicle”. CNNMoney. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[71] Felder, Ryan Marshall (July 2021). “Coming to Terms with the Black Box Problem: How to Justify AI Systems in Health Care” (英語). Hastings Center Report 51 (4): 38–45. doi:10.1002/hast.1248. ISSN 0093-0334. PMID 33821471.

[:6-72] Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Scott, Kate; Schieber, Stuart et al. (2019-12-20). Accountability of AI Under the Law: The Role of Explanation. arXiv:1711.01134.

[:62-73] Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Scott, Kate; Schieber, Stuart et al. (2019-12-20). Accountability of AI Under the Law: The Role of Explanation. arXiv:1711.01134.

[74] Meng, Kevin; Bau, David; Andonian, Alex; Belinkov, Yonatan (2022). “Locating and editing factual associations in GPT”. Advances in Neural Information Processing Systems 35. arXiv:2202.05262.

[75] Bau, David; Liu, Steven; Wang, Tongzhou; Zhu, Jun-Yan; Torralba, Antonio (2020-07-30). “Rewriting a Deep Generative Model”. ECCV. arXiv:2007.15646.

[76] Räuker, Tilman; Ho, Anson; Casper, Stephen; Hadfield-Menell, Dylan (2022-09-05). “Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks”. IEEE SaTML. arXiv:2207.13243.

[77] Bau, David; Zhou, Bolei; Khosla, Aditya; Oliva, Aude; Torralba, Antonio (2017-04-19). “Network Dissection: Quantifying Interpretability of Deep Visual Representations”. CVPR. arXiv:1704.05796.

[78] McGrath, Thomas; Kapishnikov, Andrei; Tomašev, Nenad; Pearce, Adam; Wattenberg, Martin; Hassabis, Demis; Kim, Been; Paquet, Ulrich et al. (2022-11-22). “Acquisition of chess knowledge in AlphaZero” (英語). Proceedings of the National Academy of Sciences 119 (47): e2206625119. arXiv:2111.09259. Bibcode: 2022PNAS..11906625M. doi:10.1073/pnas.2206625119. ISSN 0027-8424. PMC 9704706. PMID 36375061.

[79] Goh, Gabriel; Cammarata, Nick; Voss, Chelsea; Carter, Shan; Petrov, Michael; Schubert, Ludwig; Radford, Alec; Olah, Chris (2021). “Multimodal neurons in artificial neural networks”. Distill 6 (3). doi:10.23915/distill.00030.

[80] Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). “Zoom in: An introduction to circuits”. Distill 5 (3). doi:10.23915/distill.00024.001.

[81] Cammarata, Nick; Goh, Gabriel; Carter, Shan; Voss, Chelsea; Schubert, Ludwig; Olah, Chris (2021). “Curve circuits”. Distill 6 (1). doi:10.23915/distill.00024.006. オリジナルの5 December 2022時点におけるアーカイブ。 5 December 2022閲覧。.

[82] Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda et al. (2022). “In-context learning and induction heads”. Transformer Circuits Thread. arXiv:2209.11895.

[83] Olah, Christopher. “Interpretability vs Neuroscience [rough note]”. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[84] Gu, Tianyu; Dolan-Gavitt, Brendan; Garg, Siddharth (2019-03-11). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv:1708.06733.

[85] Chen, Xinyun; Liu, Chang; Li, Bo; Lu, Kimberly; Song, Dawn (2017-12-14). Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv:1712.05526.

[86] Carlini, Nicholas; Terzis, Andreas (2022-03-28). “Poisoning and Backdooring Contrastive Learning”. ICLR. arXiv:2106.09667.

[AIアライメント_aima4-87] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31-34. ISBN 978-1-292-40113-3. OCLC 1303900751. オリジナルのJuly 15, 2022時点におけるアーカイブ。 September 12, 2022閲覧。

[AIアライメント_mmmm2022-88] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (14 February 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022年7月21日閲覧。

[89] Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI". Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc. pp. 15763–15773. 2023年3月11日閲覧。

[AIアライメント_Carlsmith2022-90] Carlsmith, Joseph (16 June 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY]。

[AIアライメント_:2102-91] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915

[AIアライメント_Christian2020-92] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. オリジナルのFebruary 10, 2023時点におけるアーカイブ。 September 12, 2022閲覧。

[AIアライメント_gmdrl-93] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (28 June 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. 2023年3月11日閲覧。

[AIアライメント_Opportunities_Risks-94] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette et al. (2022-07-12). “On the Opportunities and Risks of Foundation Models”. Stanford CRFM. arXiv:2108.07258.

[AIアライメント_feedback2022-95] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL]。

[AIアライメント_OpenAICodex-96] “OpenAI Codex”. OpenAI (2021年8月10日). February 3, 2023時点のオリジナルよりアーカイブ。2022年7月23日閲覧。

[97] Kober, Jens; Bagnell, J. Andrew; Peters, Jan (2013-09-01). “Reinforcement learning in robotics: A survey” (英語). The International Journal of Robotics Research 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. オリジナルのOctober 15, 2022時点におけるアーカイブ。 September 12, 2022閲覧。.

[98] Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (2023-03-01). “Reward (Mis)design for autonomous driving” (英語). Artificial Intelligence 316: 103829. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702.

[99] Stray, Jonathan (2020). “Aligning AI Optimization to Community Well-Being” (英語). International Journal of Community Well-Being 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107.

[AIアライメント_AIMA-100] Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. pp. 1010. ISBN 978-0-13-604259-4. https://aima.cs.berkeley.edu/

[AIアライメント_dlp2023-101] Ngo, Richard; Chan, Lawrence; Mindermann, Sören (22 February 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs.AI]。

[102] Smith, Craig S.. “Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'” (英語). Forbes. 2023年5月4日閲覧。

[103] Future of Life Institute (2017年8月11日). “Asilomar AI Principles”. Future of Life Institute. October 10, 2022時点のオリジナルよりアーカイブ。2022年7月18日閲覧。 The AI principles created at the Asilomar Conference on Beneficial AI were signed by 1797 AI/robotics researchers.
United Nations (2021). Our Common Agenda: Report of the Secretary-General (PDF) (Report). New York: United Nations. 2022年5月22日時点のオリジナルよりアーカイブ (PDF)。2022年9月12日閲覧。[T]he [UN] could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values.

[104] United Nations (2021). Our Common Agenda: Report of the Secretary-General (PDF) (Report). New York: United Nations. 2022年5月22日時点のオリジナルよりアーカイブ (PDF)。2022年9月12日閲覧。[T]he [UN] could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values.

[AIアライメント_concrete2016-104] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (21 June 2016). "Concrete Problems in AI Safety" (英語). arXiv:1606.06565 [cs.AI]。

[AIアライメント_building2018-105] “Building safe artificial intelligence: specification, robustness, and assurance”. DeepMind Safety Research – Medium (2018年9月27日). February 10, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。

[AIアライメント_:333-106] Rorvig, Mordechai (2022年4月14日). “Researchers Gain New Understanding From Simple AI”. Quanta Magazine. February 10, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。

[107] Doshi-Velez, Finale; Kim, Been (2 March 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML]。
Wiblin, Robert (4 August 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. 2022年7月23日閲覧。

[109] Wiblin, Robert (4 August 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. 2022年7月23日閲覧。

[108] Russell, Stuart; Dewey, Daniel; Tegmark, Max (2015-12-31). “Research Priorities for Robust and Beneficial Artificial Intelligence”. AI Magazine 36 (4): 105–114. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. オリジナルのFebruary 2, 2023時点におけるアーカイブ。 September 12, 2022閲覧。.

[AIアライメント_prefsurvey2017-109] Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). “A survey of preference-based reinforcement learning methods”. Journal of Machine Learning Research 18 (136): 1–46.

[AIアライメント_drlfhp-110] Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4。

[AIアライメント_LessToxic-111] Heaven, Will Douglas (2022年1月27日). “The new version of GPT-3 is much better behaved (and should be less toxic)”. MIT Technology Review. February 10, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。

[112] Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (7 March 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG]。

[113] Clifton, Jesse (2020年). “Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda”. Center on Long-Term Risk. January 1, 2023時点のオリジナルよりアーカイブ。2022年7月18日閲覧。
Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (2021-05-06). “Cooperative AI: machines must learn to find common ground” (英語). Nature 593 (7857): 33–36. Bibcode: 2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. オリジナルのDecember 18, 2022時点におけるアーカイブ。 September 12, 2022閲覧。.

[116] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (2021-05-06). “Cooperative AI: machines must learn to find common ground” (英語). Nature 593 (7857): 33–36. Bibcode: 2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. オリジナルのDecember 18, 2022時点におけるアーカイブ。 September 12, 2022閲覧。.

[114] Prunkl, Carina; Whittlestone, Jess (2020-02-07). “Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society” (英語). Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (New York NY USA: ACM): 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. オリジナルのOctober 16, 2022時点におけるアーカイブ。 September 12, 2022閲覧。.

[115] Irving, Geoffrey; Askell, Amanda (2019-02-19). “AI Safety Needs Social Scientists”. Distill 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. オリジナルのFebruary 10, 2023時点におけるアーカイブ。 September 12, 2022閲覧。.

[:122-116] “Thinking About Risks From AI: Accidents, Misuse and Structure”. Lawfare (2019年2月11日). 2023年8月19日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[117] Zhang, Yingyu; Dong, Chuntong; Guo, Weiqun; Dai, Jiabao; Zhao, Ziming (2022). “Systems theoretic accident model and process (STAMP): A literature review” (英語). Safety Science 152: 105596. doi:10.1016/j.ssci.2021.105596. オリジナルの2023-03-15時点におけるアーカイブ。 2022年11月28日閲覧。.

[118] Center for Security and Emerging Technology; Hoffman, Wyatt (2021). “AI and the Future of Cyber Competition”. CSET Issue Brief. doi:10.51593/2020ca007. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.

[119] Gafni, Ruti; Levy, Yair (2024-01-01). “The role of artificial intelligence (AI) in improving technical and managerial cybersecurity tasks’ efficiency”. Information & Computer Security ahead-of-print (ahead-of-print). doi:10.1108/ICS-04-2024-0102. ISSN 2056-4961.

[120] Center for Security and Emerging Technology; Imbrie, Andrew; Kania, Elsa (2019). AI Safety, Security, and Stability Among Great Powers: Options, Challenges, and Lessons Learned for Pragmatic Engagement. doi:10.51593/20190051. オリジナルの2022-11-24時点におけるアーカイブ。 2022年11月28日閲覧。.

[:11-121] Future of Life Institute (27 March 2019). AI Strategy, Policy, and Governance (Allan Dafoe). 該当時間: 22:05. 2022年11月23日時点のオリジナルよりアーカイブ。2022年11月23日閲覧。

[122] Zou, Andy; Xiao, Tristan; Jia, Ryan; Kwon, Joe; Mazeika, Mantas; Li, Richard; Song, Dawn; Steinhardt, Jacob et al. (2022-10-09). “Forecasting Future World Events with Neural Networks”. NeurIPS. arXiv:2206.15474.

[123] Gathani, Sneha; Hulsebos, Madelon; Gale, James; Haas, Peter J.; Demiralp, Çağatay (2022-02-08). “Augmenting Decision Making via Interactive What-If Analysis”. Conference on Innovative Data Systems Research. arXiv:2109.06160.

[124] Lindelauf, Roy (2021), Osinga, Frans; Sweijs, Tim, eds., “Nuclear Deterrence in the Algorithmic Age: Game Theory Revisited” (英語), NL ARMS Netherlands Annual Review of Military Studies 2020, Nl Arms (The Hague: T.M.C. Asser Press): pp. 421–436, doi:10.1007/978-94-6265-419-8_22, ISBN 978-94-6265-418-1

[:14-125] Newkirk II, Vann R. (2016年4月21日). “Is Climate Change a Prisoner's Dilemma or a Stag Hunt?”. The Atlantic. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[:142-126] Newkirk II, Vann R. (2016年4月21日). “Is Climate Change a Prisoner's Dilemma or a Stag Hunt?”. The Atlantic. 2022年11月24日時点のオリジナルよりアーカイブ。2022年11月24日閲覧。

[:17-127] Dafoe, Allan. AI Governance: A Research Agenda (Report). Centre for the Governance of AI, Future of Humanity Institute, University of Oxford.

[128] Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; Collins, Tantum; McKee, Kevin R.; Leibo, Joel Z.; Larson, Kate; Graepel, Thore (2020-12-15). “Open Problems in Cooperative AI”. NeurIPS. arXiv:2012.08630.

[:15-129] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (2021). “Cooperative AI: machines must learn to find common ground”. Nature 593 (7857): 33–36. Bibcode: 2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. PMID 33947992. オリジナルの2022-11-22時点におけるアーカイブ。 2022年11月24日閲覧。.

[130] Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610-623. https://doi.org/10.1145/3442188.3445922.

[131] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. arXiv preprint arXiv:1906.02243.

[132] Schwartz, R., Dodge, J., Smith, N.A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54-63. https://doi.org/10.1145/3442188.3445922.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

[100]