「Help:Pywikipediabot/wikipedia.py」の版間の差分

削除された内容追加された内容

インライン

2020年7月6日 (月) 11:40時点における版

In other languages: en

wikipedia.py([1]) は PywikipediaBot フレームワークの一部です。ウィキペディアだけではなく、他のメディアウィキプロジェクトの編集にも使えます。このモジュールは他のスクリプトから呼び出して使うためのクラスと関数を提供します。

Pageクラス

記事を読み込み、判断し、書き込むための命令などが定義されています。

メソッド

title	ページ名、場合によっては名前空間およびセクション名を返す。
urlname	記事のフルURLを返します。
namespace	記事の属する名前空間を返します。
titleWithoutNamespace	タイトルに名前空間をつけずに返します。
section	セクション名だけを返します。
sectionFreeTitle	セクション名を除いたページ名を返します。
aslink	引数に応じて[[Title]]または[[lang:Title]]で返します。
site	属するウィキを返します。
encoding	ページのエンコードを返します。
isAutoTitle	Title can be translated using the autoFormat method
autoFormat	Auto-format certain dates and other standard format page titles
isCategory	ページがカテゴリの場合にTrueを返します。
isDisambig (*)	曖昧さ回避ページの場合にTrueを返します。
isImage	イメージページの場合にTrueを返します。
isRedirectPage (*)	リダイレクトの場合にTrueを、それ以外はfalseを返します。
getRedirectTarget (*)	そのページがリダイレクトの場合、リダイレクト先を返します。
isTalkPage	そのページが「会話」「ノート」の類の時にTrueを返します。
toggleTalkPage	そのページが「ノート」の場合「本文」を、「本文」の場合に「ノート」を返します。
get (*)	ページの記載内容を返します。
latestRevision (*)	最新のページのページIDを返します。
userName	最新版を編集したアカウント名を返します。
isIpEdit	最新版を編集したのがIPユーザーならTrueを返します。
editTime	最新版のタイムスタンプを返します。
previousRevision (*)	前の編集のページIDを返します。
permalink (*)	最新版のURLのPermalinkを返します。
getOldVersion(id) (*)	IDの指す版の記事本文を返します。
getRestrictions	protection dictionaryを返します。
getVersionHistory	編集履歴を返します。
getVersionHistoryTable	履歴データをテーブルの形で返します。
fullVersionHistory	過去の全ての版を返します。
contributingUsers	投稿したユーザーのリストを返します。
exists (*)	そのページが存在する場合にはTrueを、存在しない場合にはfalseを返します。
isEmpty (*)	そのページのサイズが言語間リンクとカテゴリを除いて英文4文字以下の場合にTrueを返します。
interwiki (*)	言語間リンクで指されたページのリストを返します。
categories (*)	そのページが属するカテゴリページをリストで返します。
linkedPages (*)	そのページがリンクしているページのリストを返します。
imagelinks (*)	そのページに含まれる画像のリストを帰します。
templates (*)	そのページが使っているテンプレートのリストを返します。
templatesWithParams(*)	そのページが使っているテンプレートの内、引数があるものを返します。
getReferences	そのページへのリンクがあるページをリストで返します。
canBeEdited (*)	ボットがそのページを編集できるとき（保護されていないまたは編集権がある場合）Trueを返します。
botMayEdit (*)	ボットがそのページを編集できる時にはTrueを返します。
put(newtext)	newtextの中身を投稿します。
put_async(newtext)	ページを非同期で保存する場合のキューを返します。Queues the page to be saved asynchronously
move	ページを移動します。
delete	ページを削除します。（ログインが必要です。）
protect	ページを保護します。（管理者権限が必要です。）
removeImage	画像を指すリンクを全て削除します。
replaceImage	画像を全て別のものと入れ替えます。
loadDeletedRevisions	そのページの削除された版を返します。
getDeletedRevision	削除された版を返します。
markDeletedRevision	Mark a version to be undeleted, or not undelete Undelete past version(s) of the page

ImagePageクラス

画像記事を読み込み、判断し、書き込むための命令などが定義されています。

メソッド

getImagePageHtml	画像ページをダウンロードしてHTMLテキストとして返す。
fileURL	ページに書かれた画像のURLを返す。
fileIsOnCommons	コモンズの画像であればTrueを返す。
fileIsShared	ウィキトラベルと共有された画像であればTrueを返す。
getFileMd5Sum	画像ファイルのMD5チェックサムを返す。
getFileVersionHistory	画像ファイルのバージョン履歴を返す。
getFileVersionHistoryTable	画像ファイルのバージョン履歴をウィキテーブルの形で返す。
usingPages	その画像を利用しているページを返す。

Throttleクラス

サーバー負荷低減のため、意図的に処理を遅らせるためのクラス。

Siteクラス

メディアウィキのサイトを処理する。直接参照せずに必ずgetSite()関数を使うこと。

用語の説明

code	サイトの言語コード
fam	ウィキファミリー
user	ボットのユーザーアカウント

メソッド

language	現在のサイトの言語コードを返す。
family	現在のサイトのFamilyオブジェクト。
sitename	サイト名を文字列で返す。
languages	このサイトファミリーに含まれる全ての言語をリストで返す。
validLanguageLinks	A list of language codes that can be used in interwiki links.
loggedInAs	return current username, or None if not logged in.
forceLogin	require the user to log in to the site
messages	return True if there are new messages on the site
cookies	ユーザーのcookieを文字列として返す。
getUrl	サイトオブジェクトからURLを取り出す。
urlEncode	Encode a query to be sent using an http POST request.
postForm	Post form data to an address at this site.
postData	Post encoded form data to an http address at this site.
namespace(num)	Return local name of namespace 'num'.
normalizeNamespace(value)	Return preferred name for namespace 'value' in this Site's language.
namespaces	Return list of canonical namespace names for this Site.
getNamespaceIndex(name)	Return the int index of namespace 'name', or None if invalid.
redirect	Return the localized redirect tag for the site.
redirectRegex	Return compiled regular expression matching on redirect pages.
mediawiki_message	Retrieve the text of a specified MediaWiki message
has_mediawiki_message	特別なMediaWikiメッセージを持っていたらTrueを返す。
shared_image_repository	Return tuple of image repositories used by this site.
category_on_one_line	Return True if this site wants all category links on one line.
interwiki_putfirst	Return list of language codes for ordering of interwiki links.
linkto(title)	Return string in the form of a wikilink to 'title'
isInterwikiLink(s)	Return True if 's' is in the form of an interwiki link.
getSite(lang)	指定された言語で現在使用しているウィキファミリーのSiteオブジェクトを返す
version	現在使用しているウィキファミリーのMediaWikiのバージョンを返す。
versionnumber	Return int identifying the MediaWiki version.
live_version	Return version number read from Special:Version.
checkCharset(charset)	Warn if charset doesn't match family file.
server_time	returns server time (currently userclock depending)
linktrail	Return regex for trailing chars displayed as part of a link.
disambcategory	Category in which disambiguation pages are listed.

ページ読み込み用のメソッド

次のように定義されています。

def allpages(self, start = '!', namespace = 0, includeredirects = True, throttle = True)

startは検索開始位置、namespaceは名前空間の種類（対応表はこちら）、includeredirectsはりダイレクトを含めるかどうかです。throttleはサーバー負荷軽減のため、意図的に処理を遅らせるパラメーターであり、必ずTrueを指定します。

search(query)	特別:検索で与えられたクエリーを処理した結果得られるページ。
allpages()	特別:全ページ
prefixindex()	特別:始点指定ページ一覧
protectedpages()	特別:保護されているページ
newpages()	特別:新しいページ
newimages()	特別:全ページ&type=アップロード記録
longpages()	特別:長いページ
shortpages()	特別:短いページ
categories()	特別:カテゴリ (カテゴリオブジェクトを返す)
deadendpages()	特別:有効なページへのリンクがないページ
ancientpages()	特別:更新されていないページ
lonelypages()	特別:孤立しているページ
unwatchedpages()	特別:ウォッチされていないページ (要管理者権限)
uncategorizedcategories()	特別:カテゴリ未導入のカテゴリ (カテゴリオブジェクトを返す)
uncategorizedpages()	特別:カテゴリ未導入のページ
uncategorizedimages()	特別:カテゴリ未導入のファイル (イメージページオブジェクトを返す)
unusedcategories()	特別:使われていないカテゴリ (カテゴリオブジェクトを返す)
unusedfiles()	特別:使われていないファイル (イメージページオブジェクトを返す)
randompages	特別:おまかせ表示
randomredirectpages	特別:おまかせリダイレクト
withoutinterwiki	特別:言語間リンクを持たないページ
linksearch	特別:外部リンク検索

ウィキファミリー処理用のメソッド

encoding	The current encoding for this site.
encodings	List of all historical encodings for this site.
category_namespace	Canonical name of the Category namespace on this site.
category_namespaces	List of all valid names for the Category namespace.
image_namespace	Canonical name of the Image namespace on this site.
template_namespace	Canonical name of the Template namespace on this site.
protocol	Protocol ('http' or 'https') for access to this site.
hostname	Host portion of site URL.
path	URL path for index.php on this Site.
dbName	MySQL database name.

アドレス等を返すメソッド

概要とwikipedia:jaでの例を示す。

export_address	Special:Export. /w/index.php?useskin=monobook&title=Special:Export
query_address	URL path + '?' for query.php /w/query.php?
api_address	URL path + '?' for api.php /w/api.php?
apipath	URL path for api.php /w/api.php
move_address	Special:Movepage. /w/index.php?useskin=monobook&title=%E7%89%B9%E5%88%A5:Movepage&action=submit
delete_address(s)	Delete title 's'.
undelete_view_address(s)	Special:Undelete for title 's'
undelete_address	Special:Undelete. /w/index.php?useskin=monobook&title=%E7%89%B9%E5%88%A5:Undelete&action=submit
protect_address(s)	Protect title 's'.
unprotect_address(s)	Unprotect title 's'.
put_address(s)	Submit revision to page titled 's'.
get_address(s)	Retrieve page titled 's'. /w/index.php?useskin=monobook&title=s&redirect=no
nice_get_address(s)	Short URL path to retrieve page titled 's'. /wiki/s
edit_address(s)	Edit form for page titled 's'.
purge_address(s)	Purge cache and retrieve page 's'.
block_address	Block an IP address.
unblock_address	Unblock an IP address.
blocksearch_address(s)	Search for blocks on IP address 's'.
linksearch_address(s)	Special:Linksearch for target 's'.
search_address(q)	Special:Search for query 'q'.
allpages_address(s)	Special:Allpages.
newpages_address	Special:Newpages.
longpages_address	Special:Longpages.
shortpages_address	Special:Shortpages.
unusedfiles_address	Special:Unusedimages.
categories_address	Special:Categories.
deadendpages_address	Special:Deadendpages.
ancientpages_address	Special:Ancientpages.
lonelypages_address	Special:Lonelypages.
protectedpages_address	Special:ProtectedPages
unwatchedpages_address	Special:Unwatchedpages.
uncategorizedcategories_address	Special:Uncategorizedcategories.
uncategorizedimages_address	Special:Uncategorizedimages.
uncategorizedpages_address	Special:Uncategorizedpages.
unusedcategories_address	Special:Unusedcategories.
withoutinterwiki_address	Special:Withoutinterwiki.
references_address(s)	Special:Whatlinksere for page 's'.
allmessages_address	Special:Allmessages.
upload_address	Special:Upload.
double_redirects_address	Special:Doubleredirects.
broken_redirects_address	Special:Brokenredirects.
random_address	Special:Random.
randomredirect_address	Special:Random.
login_address	Special:Userlogin.
captcha_image_address(id)	Special:Captcha for image 'id'.
watchlist_address	Special:Watchlist editor.
contribs_address(target)	Special:Contributions for user 'target'.

汎用関数

replaceExcept	テキストを正規表現を使って置換する。
removeDisabledParts	コメントやタグを有効/無効にする。
isDisabled	テキストやコメントやタグとして無効化されているかを判断する。
findmarker	文字列を検索する。
expandmarker
getLanguageLinks	言語間リンクを抽出する
removeLanguageLinks	言語間リンクを除去する
removeLanguageLinksAndSeparator	言語間リンクと空白、セパレーターを除去する
replaceLanguageLinks	言語間リンクを別のものと入れ替える
interwikiFormat	言語間リンクをウィキテキストに置き換える。
interwikiSort	言語間リンクをソートする。
getCategoryLinks	カテゴリへのリンクを返す。
removeCategoryLinks	カテゴリへのリンクを除去する。
removeCategoryLinksAndSeparator	カテゴリへのリンクと空白を除去する。
replaceCategoryInPlace	カテゴリを別のものと入れ替える。
replaceCategoryLinks	カテゴリを別のものと入れ替える。replaceCategoryInPlaceとは機能が違う。
categoryFormat	カテゴリへのリンクをリストで返す。
url2link	言語間リンクを意味するURLを言語間リンクの書式に変換する。
decodeEsperantoX	エスペラントテキストをx変換でデコードする。
encodeEsperantoX	ウィキテキストをx変換でエスペラントテキストにエンコードする。
sectionencode	ウィキリンクのセクションに使えるようにエンコードする。
UnicodeToAsciiHtml	Unicodeをバイトストリングに変換する。
url2unicode(title, site)	エンコードされたURLをUnicodeに変換する。
unicode2html	Unicodeをエンコード可能な形に、またはASCIIに変換する。
html2unicode	HTMLエンタイトルをUnicodeに変換する。
Family(name)	name('wikipedia'とか)のファミリーをインポートする。

その他の関数

output(text)	文字列textを画面に表示する。printと異なりUnicodeをそのまま出力可能。
stopme()	一連の処理が終わった後に実行することで、ボットを実行プロセスから外すことができる。

エラー処理

except節でエラーを確認します。

*wikipedia.Error                エラーが発生した場合
*wikipedia.NoUsername           user-config.pyにアカウント名が登録されていない。
*wikipedia.NoPage               ページが存在しない。
*wikipedia.IsRedirectPage       リダイレクトページである。
*wikipedia.IsNotRedirectPage    リダイレクトページではない。
*wikipedia.LockedPage           ページが保護されている。
*wikipedia.LockedNoPage         ページが削除されており、管理者により作成保護されている。
*wikipedia.NoSuchEntity         No entity exist for this character
*wikipedia.SectionError         ページの中に指定された節（#）が存在しない。
*wikipedia.PageNotSaved         ページへの書き込みエラー (書き込みエラー全般)
*wikipedia.EditConflict         編集競合が発生した。
*wikipedia.SpamfilterError      書き込もうとした内容にMediauWikiのブラックリストURLが含まれていた。
*wikipedia.ServerError          サーバーの調子がおかしい。
*wikipedia.UserBlocked          このボットのアカウントまたはIPアドレスがブロックされている。
*wikipedia.PageNotFound         リストの中に該当するページがない。

使用例

ページを読み込むテスト

Pageクラスは最も頻繁に使われるクラスです。ウィキページに関わるさまざまな処理をしたり、情報を取り出したりするのに使います。

# -*- coding: utf-8 -*-
import wikipedia # Import the wikipedia module
site = wikipedia.getSite() # Taking the default site
page = wikipedia.Page(site, u"タンポポ") # Calling the constructor
text = page.get() # Taking the text of the page
wikipedia.output(text)
wikipedia.stopme()

上記の例をpythonで実行すると、まずwikipedia.pyモジュールがインポートされ、次にuser-config.pyで定義したサイト（恐らくwikipedia:ja）が変数siteに代入され、変数pageにページ名が代入されます。wikipedia.get()関数でtextにページの中身がコピーされ、wikipedia.output()関数でテキストが画面に出力されます。wikipedia.stopme()を実行すると、他のボットを同時に動かしている場合に他のボットの速度が速くなります。

メモ帳などで作ったテキストファイルに上記の文書を書き写し、文字コードをUTF-8にして保存、実行してみてください。画面にタンポポの記事の中身が表示されるはずです。 (このプログラムはウィキペディアに書き込みを行いません。)

user-config.pyで定義したプロジェクト以外、例えばウィキソース英語版から読み込む場合には「site =」の行を、

site = wikipedia.getSite('en', 'wikisource') # loading a defined project's page

とすればOKです。なお、Pageクラスの各関数は、存在しないページ名を指定して実行するとエラーになります。wikipedia.py自身はこのエラー対策をしていませんので、別途工夫する必要があります。

より好ましい書き方

前の節では説明を簡単にするため、逐次実行形式のプログラムにしましたが、本当はこの書式は好ましくありません。この書式だと、作ったプログラムを別のプログラムにimportして使うことができないからです。

別のプログラムで使うためには、次のようにmain()関数を定義しておくとよいでしょう。

import wikipedia
# Define the main function
def main():
    titleOfPageToLoad = u'Main Page' # The "u" before the title means Unicode, important for special characters
    wikipedia.output(u'Loading %s...' % titleOfPageToLoad)
    enwiktsite = wikipedia.getSite('en', 'wikisource') # loading a defined project's page
    page = wikipedia.Page(enwiktsite, titleOfPageToLoad)
    text = page.get() # Taking the text of the page
    wikipedia.output(text) # Print the text, encoding it with wikipedia's method

if __name__ == '__main__':
    try:
        main()
    finally:
        wikipedia.stopme()

上のような書式にすれば、予想外のトラブルが起こっても必ずwikipedia.stopme()が実行されるため、止め忘れのために他のプロセスが遅くなることを防げるでしょう。

site.allpages()関数とエラー処理

次のプログラムは、ウィキペディアの全記事を表示させるプログラムです。（全てのページを表示するまで止まりませんので実行する前に必ずあなたが使っているPythonの「処理の中断の仕方」を確認してください。「ctrl+C」を使うことが多いです。）

import wikipedia
# Define the main function
def main():
    site = wikipedia.getSite()
    startpage = '!'
    for page in site.allpages(startpage): # Use a generetor object, this will yield all pages one by one
        pagename = page.title() # Take the title of the page (not "[[page]]" but "page")
        wikipedia.output(u"Loading %s..." % pagename) # Please, see the "u" before the text
        try:
            text = page.get() # Taking the text of the page
        except wikipedia.NoPage: # First except, prevent empty pages
            text = ''
        except wikipedia.IsRedirectPage: # second except, prevent redirect
            wikipedia.output(u'%s is a redirect!' % pagename)
            continue
        except wikipedia.Error: # third exception, take the problem and print
            wikipedia.output(u"Some error, skipping..")
            continue     
        wikipedia.output(text) # Print the output, encoding it with wikipedia's method

if __name__ == '__main__':
    try:
        main()
    finally:
        wikipedia.stopme()

「for page in site.allpages(startpage):」の行がポイントであり、関数allpages(startpage)はstartpage以降のタイトルページ全部（上の例の場合は'!'だから登録記事全部）を順に取り出すための関数です。

上のプログラムでは、発生するエラーを想定して、except処理で対策を行っています。この処理をせずにwikipedia.output(text)をすると実行時エラーが発生しますので、必ずこのような対策をしましょう。

put()関数によるページ書き込み

次のプログラムは、Wikipedia:サンドボックスに文字を書き込むためのプログラムです。具体的には、サンドボックスのテキストの末尾に「この最後の行は、ボットの動作テストによる書き込みです。」と書き込みます。プログラムを短くするため、ここでは実行途中経過をスクリーンに表示させるようにしていませんが、実際には必ずスクリーンに表示させて動作を確認できるようにしましょう。

import wikipedia
# Define the main function
def main():
    site = wikipedia.getSite()
    pagename = u'Wikipedia:サンドボックス'
    page = wikipedia.Page(site, pagename)
    wikipedia.output(u"Loading %s..." % pagename) # Please, see the "u" before the text
    try:
        text = page.get(force = False, get_redirect=False, throttle = True, sysop = False, 
                                             change_edit_time = True) # text = page.get() <-- is the same
    except wikipedia.NoPage: # First except, prevent empty pages
        text = ''
    except wikipedia.IsRedirectPage: # second except, prevent redirect
        wikipedia.output(u'%s is a redirect!' % pagename)
        exit()# wikipedia.stopme() is in the finally, we don't need to use it twice, exit() will only close the script
    except wikipedia.Error: # third exception, take the problem and print
        wikipedia.output(u"Some Error, skipping..")
        exit()
    newtext = text + u'\n\nこの最後の行は、ボットの動作テストによる書き込みです。'
    page.put(newtext, comment='Bot: Test', watchArticle = None, minorEdit = True)  # page.put(newtext, 'Bot: Test') <-- is the same

if __name__ == '__main__':
    try:
        main()
    finally:
        wikipedia.stopme()

get()関数について

force = False : これをTrueにすると、ボットは例外を発生させずに強制的に読み込みを実施します。ただし、あなたがコーディングによほどの自信がなければ、これをTrueにするのは止めましょう。
get_redirect = False : これをTrueにすると、記事がリダイレクトだった場合でもエラーを出さずにそのまま読み込みます。
throttle = True : これをTrueにしておくと、ボットは処理と処理の間を自動で一定間隔に空けます。これをFalseにするとサーバーの負荷が非常に大きくなるため、必ずTrueにして下さい。
sysop = False: これをTrueにすると、ページを管理者権限で読み込むことができます。
nofollow_redirects=False : Like force it won't raise an exception if you load a redirect... but it will raise in all the other cases.
change_version_date = True : これをFalseにすると、書き込み前に別の人の書き込みがあったかどうかをチェックしません。

put()関数について

.put(newtext, comment='Bot: Test', watchArticle = None, minorEdit = True)

newtext は、これから書き込もうとするテキスト本文です。
comment は要約欄に書き込む文字列です。デフォルトは"Wikipedia python library"です。なお、wikipedia.setAction(text)でデフォルトを変更できます。
watchArticle は編集をウォッチリストに入れるにはTrue、入れないならNoneです。
minorEdit は細部の編集の時Trueにします。ボットの編集は基本はTrueにします。ただしボットによる編集でも大規模な変更を加える場合にはFalseにします。また、ユーザー会話ページに警告テンプレートを貼り付ける場合などにもFalseにした方がよいでしょう。

.getReferences(), .move(), .protect(), .delete() and .undelete()

Pageクラスは数々の関数を持っており、人間が行えるほぼ全ての処理をすることが可能です。

.getReferences()

次のプログラムを見てください。Wikipedia:サンドボックスにリンクしているページの一覧を取り出します。関数.getReferences()はリストを返すため、次のようにforループで回してやれば一覧を表示させることができます。

import wikipedia
def main():
    page = wikipedia.Page(wikipedia.getSite(), u'Wikipedia:サンドボックス')
    for pagetoparse in page.getReferences(follow_redirects=True, withTemplateInclusion=True, onlyTemplateInclusion=False, redirectsOnly=False):
        wikipedia.output(pagetoparse.title())
if __name__ == '__main__':
    try:
        main()
    finally:
        wikipedia.stopme()

リストをリストのまま変数に入れたい場合には次のようにします。

pages = [page for page in s.getReferences()]

.getReferences()関数の引数は、

follow_redirect=True : Trueならリダイレクトのリンク元も返します。そうでなければ返しません。
以下の3つは2つ以上Trueにすると正しく機能しない可能性があります。
- withTemplateInclusion=True : Trueなら参照読み込みをあわせて返します。
- onlyTemplateInclusion=False : Trueなら参照読み込みだけを返します。リンク元、リダイレクトは除外されます。
- redirectsOnly=False : Trueならリダイレクトだけを返します。リンク元、参照読み込みは除外されます。

その他

wikipedia.pyを直接走らせると、version.py同様、PywikipediaBot及びPythonのバージョン情報が出力されます。

出典、注釈

初版はhttp://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/に登録されたpywikipedia、「pywikipedia (r6353, 2 14 2009, 14:36;27)」に含まれる"wikipedia.py"のコメントを参照した。