コアAPI¶

この節は、ScrapyコアAPIについて説明します。これは、拡張機能とミドルウェアの開発者を対象としています。

クローラーAPI¶

Scrapy APIの主要なエントリポイントは、 from_crawler クラス・メソッドを通じて拡張機能に渡される Crawler オブジェクトです。このオブジェクトは、すべてのScrapyコア・コンポーネントへのアクセスを提供し、拡張機能がそれらにアクセスし、その機能をScrapyにフックする唯一の方法です。

拡張機能マネージャーは、インストールされた拡張機能を読み込んで追跡する責任があり、利用可能な全ての拡張機能の辞書と、ダウンローダー・ミドルウェアの構成(configure) 方法と類似した順序を含む EXTENSIONS 設定で構成(configure)されます。

class scrapy.crawler.Crawler(spidercls, settings)[ソース]¶

Crawlerオブジェクトは、 scrapy.spiders.Spider のサブクラスと scrapy.settings.Settings オブジェクトでインスタンス化する必要があります。

settings¶

このクローラーの設定マネージャ

これは、拡張機能とミドルウェアがこのクローラーのScrapy設定にアクセスするために使用します。

Scrapy設定の概要については、設定を参照してください。

APIについては、 Settings クラスを参照してください。

signals¶

このクローラーのシグナル・マネージャ

これは、拡張機能およびミドルウェアがScrapy機能にフックするために使用されます。

シグナルの概要については、シグナルを参照してください。

APIについては、 SignalManager クラスを参照してください。

stats¶

このクローラーの統計収集器

これは拡張機能とミドルウェアから使用され、その動作の統計を記録したり、他の拡張機能によって収集された統計にアクセスしたりします。

統計収集器の概要は統計をとるを参照下さい。

APIについては StatsCollector クラス参照。

extensions¶

有効な拡張機能を追跡(track)する拡張機能マネージャ

ほとんどの拡張機能は、この属性にアクセスする必要はありません。

拡張機能の紹介と、Scrapyで利用可能な拡張機能のリストについては、拡張機能を参照してください。

engine¶

スケジューラ、ダウンローダー、スパイダーの間のコア・クロール・ロジックを調整する実行エンジン。

一部の拡張機能では、Scrapyエンジンにアクセスして、ダウンローダーとスケジューラの動作を検査または変更することができますが、これは高度な使用方法であり、このAPIはまだ安定していません。

spider¶: 現在スパイダーがクロールされています。これはクローラーの構築中に提供されるスパイダー・クラスのインスタンスであり、 crawl() メソッドで指定された引数の後に作成されます。

crawl(*args, **kwargs)[ソース]¶

指定された args 引数と kwargs 引数を使用してスパイダー・クラスをインスタンス化することでクローラーを起動し、実行エンジンを起動します。

クロールが終了したときに起動される遅延オブジェクトを返します。

stop()[ソース]¶: Starts a graceful stop of the crawler and returns a deferred that is fired when the crawler is stopped.

class scrapy.crawler.CrawlerRunner(settings=None)[ソース]¶

This is a convenient helper class that keeps track of, manages and runs crawlers inside an already setup reactor.

The CrawlerRunner object must be instantiated with a Settings object.

This class shouldn't be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See スクリプトからScrapyを実行する for an example.

crawl(crawler_or_spidercls, *args, **kwargs)[ソース]¶

Run a crawler with the provided arguments.

It will call the given Crawler's crawl() method, while keeping track of it so it can be stopped later.

If crawler_or_spidercls isn't a Crawler instance, this method will try to create one using this parameter as the spider class given to it.

Returns a deferred that is fired when the crawling is finished.

パラメータ

crawler_or_spidercls (Crawler instance, Spider subclass or string) -- already created crawler, or a spider class or spider's name inside the project to create it
args -- arguments to initialize the spider
kwargs -- keyword arguments to initialize the spider

property crawlers¶: Set of crawlers started by crawl() and managed by this class.

create_crawler(crawler_or_spidercls)[ソース]¶

Return a Crawler object.

If crawler_or_spidercls is a Crawler, it is returned as-is.
If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it.
If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy project (using spider loader), then creates a Crawler instance for it.

join()[ソース]¶: Returns a deferred that is fired when all managed crawlers have completed their executions.

stop()[ソース]¶

Stops simultaneously all the crawling jobs taking place.

Returns a deferred that is fired when they all have ended.

class scrapy.crawler.CrawlerProcess(settings=None, install_root_handler=True)[ソース]¶

ベースクラス: scrapy.crawler.CrawlerRunner

A class to run multiple scrapy crawlers in a process simultaneously.

This class extends CrawlerRunner by adding support for starting a reactor and handling shutdown signals, like the keyboard interrupt command Ctrl-C. It also configures top-level logging.

This utility should be a better fit than CrawlerRunner if you aren't running another reactor within your application.

The CrawlerProcess object must be instantiated with a Settings object.

パラメータ: install_root_handler -- whether to install root logging handler (default: True)

This class shouldn't be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See スクリプトからScrapyを実行する for an example.

crawl(crawler_or_spidercls, *args, **kwargs)¶

Run a crawler with the provided arguments.

It will call the given Crawler's crawl() method, while keeping track of it so it can be stopped later.

If crawler_or_spidercls isn't a Crawler instance, this method will try to create one using this parameter as the spider class given to it.

Returns a deferred that is fired when the crawling is finished.

パラメータ

crawler_or_spidercls (Crawler instance, Spider subclass or string) -- already created crawler, or a spider class or spider's name inside the project to create it
args -- arguments to initialize the spider
kwargs -- keyword arguments to initialize the spider

property crawlers¶: Set of crawlers started by crawl() and managed by this class.

create_crawler(crawler_or_spidercls)¶

Return a Crawler object.

If crawler_or_spidercls is a Crawler, it is returned as-is.
If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it.
If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy project (using spider loader), then creates a Crawler instance for it.

join()¶: Returns a deferred that is fired when all managed crawlers have completed their executions.

start(stop_after_crawl=True)[ソース]¶

This method starts a reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cache based on DNSCACHE_ENABLED and DNSCACHE_SIZE.

If stop_after_crawl is True, the reactor will be stopped after all crawlers have finished, using join().

パラメータ: stop_after_crawl (bool) -- stop or not the reactor when all crawlers have finished

stop()¶

Stops simultaneously all the crawling jobs taking place.

Returns a deferred that is fired when they all have ended.

APIの設定¶

scrapy.settings.SETTINGS_PRIORITIES¶

Scrapyで使用されるデフォルト設定の優先度のキー名と優先度を設定する辞書。

各項目は設定エントリ・ポイントを定義し、識別のためのコード名と整数の優先度を与えます。 Settings クラスで値を設定および取得する場合、優先順位が高いほど順番値が小さくなります。

SETTINGS_PRIORITIES = {
    'default': 0,
    'command': 10,
    'project': 20,
    'spider': 30,
    'cmdline': 40,
}

各設定ソースの詳細な説明については、設定を参照してください。

scrapy.settings.get_settings_priority(priority)[ソース]¶: SETTINGS_PRIORITIES 辞書で特定の文字列優先度を検索し、その数値を返す、または特定の数値優先度を直接返す小さなヘルパー関数。

class scrapy.settings.Settings(values=None, priority='project')[ソース]¶

ベースクラス: scrapy.settings.BaseSettings

このオブジェクトは、内部コンポーネントの構成のためのScrapy設定を保存し、さらにカスタマイズするために使用できます。

これは直接のサブクラスであり、 BaseSettings のすべてのメソッドをサポートします。さらに、このクラスのインスタンス化の後、新しいオブジェクトには、すでに入力されている、組み込みの設定リファレンスで説明されているグローバルなデフォルト設定が含まれます。

class scrapy.settings.BaseSettings(values=None, priority='project')[ソース]¶

このクラスのインスタンスは辞書のように動作しますが、 (key, value) ペアとともに優先度を保存し、凍結することができます(つまり、不変(immutable)とマークされます)。

キー値エントリは初期化時に values 引数で渡すことができ、それらは priority レベルを取ります( values がすでに BaseSettings のインスタンスである場合を除き。その場合、既存の優先度レベルが保持されます)。 priority 引数が文字列の場合、優先度名は SETTINGS_PRIORITIES で検索されます。それ以外の場合は、特定の整数を指定する必要があります。

オブジェクトが作成されると、新しい設定は set() メソッドを使用してロードまたは更新でき、辞書の角括弧表記または get() メソッドとその値変換variantを使用してアクセスできます。保存されたキーを要求すると、最高の優先度を持つ値が取得されます。

copy()[ソース]¶

現在の設定のdeepコピーを作成します。

このメソッドは、 Settings クラスの新しいインスタンスを返し、同じ値とその優先度が入力されます。

新しいオブジェクトへの変更は、元の設定には反映されません。

copy_to_dict()[ソース]¶

現在の設定のコピーを作成し、辞書に変換します。

このメソッドは、現在の設定と同じ値とそれらの優先度が設定された新しい辞書を返します。

返された辞書への変更は、元の設定には反映されません。

このメソッドは、Scrapyシェルの印刷(printing)設定などに役立ちます。

freeze()[ソース]¶

現在の設定への、さらなる変更を無効にします。

このメソッドを呼び出した後、設定の現在の状態は不変(immutable)になります。 :set() メソッドとその変形(variant)を使用して値を変更しようとすることは不可能であり、警告されます。

frozencopy()[ソース]¶

現在の設定の不変(immutable)のコピーを返します。

copy() によって返されるオブジェクトの freeze() 呼び出しのエイリアス。

get(name, default=None)[ソース]¶

元のタイプに影響を与えずに設定値を取得します。

パラメータ

name (str) -- 設定名
default (object) -- 設定が見つからない場合に返す値

getbool(name, default=False)[ソース]¶

設定値をブール値として取得します。

1 と '1' と True` と 'True' は True を返しますが、 0 と '0' と False と 'False' と None は False を返します。

たとえば、 '0'``に設定された環境変数を通じて設定された設定は、このメソッドを使用するときに ``False を返します。

パラメータ

name (str) -- 設定名
default (object) -- 設定が見つからない場合に返す値

getdict(name, default=None)[ソース]¶

辞書として設定値を取得します。設定元のタイプが辞書の場合、そのコピーが返されます。文字列の場合、JSON辞書として評価されます。それ自体が BaseSettings インスタンスである場合、それは辞書に変換され、 get() によって返される現在の設定値をすべて含み、優先度(priority)と可変性(mutability)に関するすべての情報を失います。

パラメータ

name (str) -- 設定名
default (object) -- 設定が見つからない場合に返す値

getfloat(name, default=0.0)[ソース]¶

設定値をfloatとして取得します。

パラメータ

name (str) -- 設定名
default (object) -- 設定が見つからない場合に返す値

getint(name, default=0)[ソース]¶

設定値をintとして取得します。

パラメータ

name (str) -- 設定名
default (object) -- 設定が見つからない場合に返す値

getlist(name, default=None)[ソース]¶

設定値をリストとして取得します。設定元のタイプはリストであり、そのコピーが返されます。文字列の場合、","で分割されます。

たとえば、 'one,two' に設定された環境変数を通じて設定された設定は、このメソッドを使用するときにリスト ['one', 'two'] を返します。

パラメータ

name (str) -- 設定名
default (object) -- 設定が見つからない場合に返す値

getpriority(name)[ソース]¶

設定の現在の数値優先順位値を返します。指定された name が存在しない場合は None を返します。

パラメータ: name (str) -- 設定名

getwithbase(name)[ソース]¶

辞書のような設定と、それに対応する _BASE の合成を取得します。

パラメータ: name (str) -- 辞書風設定の名前

maxpriority()[ソース]¶: すべての設定を通して存在する最も高い優先度の数値を返します。設定が保存されていない場合は、 SETTINGS_PRIORITIES から default の数値を返します。

set(name, value, priority='project')[ソース]¶

指定の優先度でキー/値属性を保存します。

( configure() メソッドを使用して、)Crawlerオブジェクトを設定する前に、設定を読み込む必要があります。設定しないと、効果がありません。

パラメータ

name (str) -- 設定名
value (object) -- 設定に関連付ける値
priority (str or int) -- 設定の優先順位。 SETTINGS_PRIORITIES のキーまたは整数でなければなりません

setmodule(module, priority='project')[ソース]¶

与えられた優先度でモジュールの設定を保存します。

これは、提供された priority で module のグローバルに宣言されたすべての大文字変数に対して set() を呼び出すヘルパー関数です。

パラメータ

module (types.ModuleType or str) -- モジュールまたは、モジュールのパス(path)
priority (str or int) -- 設定の優先順位。 SETTINGS_PRIORITIES のキーまたは整数でなければなりません

update(values, priority='project')[ソース]¶

与えられた優先度でキー/値のペアを保存します。

これは、提供された priority を持つ values のすべての項目に対して set() を呼び出すヘルパー関数です。

values が文字列の場合、JSONエンコードされていると見なされ、最初に json.loads() でパースされ辞書になります。 BaseSettings`インスタンスの場合、キーごとの優先度が使用され、 ``priority` パラメーターは無視されます。これにより、単一のコマンドで異なる優先度の設定を挿入/更新できます。

パラメータ

values (dict or string or BaseSettings) -- 設定名と値
priority (str or int) -- 設定の優先順位。 SETTINGS_PRIORITIES のキーまたは整数でなければなりません

SpiderLoader API¶

class scrapy.spiderloader.SpiderLoader[ソース]¶

このクラスは、プロジェクト全体で定義されたスパイダー・クラスの取得と処理を担当します。

SPIDER_LOADER_CLASS プロジェクト設定でパスを指定することにより、カスタム・スパイダー・ローダーを使用できます。エラーのない実行を保証するには、 scrapy.interfaces.ISpiderLoader インターフェースを完全に実装する必要があります。

from_settings(settings)[ソース]¶

このクラスメソッドは、クラスのインスタンスを作成するためにScrapyによって使用されます。現在のプロジェクト設定で呼び出され、 SPIDER_MODULES 設定のモジュールで見つかったスパイダーを再帰的にロードします。

パラメータ: settings (Settings instance) -- プロジェクト設定

load(spider_name)[ソース]¶

指定された名前のSpiderクラスを取得します。 spider_name という名前のスパイダークラスの、以前にロードされたスパイダーを調べ、見つからない場合はKeyErrorを発生させます。

パラメータ: spider_name (str) -- スパイダー・クラス名

list()[ソース]¶: プロジェクトで利用可能なスパイダーの名前を取得します。

find_by_request(request)[ソース]¶

指定されたリクエストを処理できるスパイダーの名前をリストします。リクエストのURLをスパイダーのドメインと照合しようとします。

パラメータ: request (Request instance) -- クエリされたリクエスト

シグナルAPI¶

class scrapy.signalmanager.SignalManager(sender=_Anonymous)[ソース]¶

connect(receiver, signal, **kwargs)[ソース]¶

Connect a receiver function to a signal.

The signal can be any object, although Scrapy comes with some predefined signals that are documented in the シグナル section.

パラメータ

receiver (collections.abc.Callable) -- the function to be connected
signal (object) -- the signal to connect to

disconnect(receiver, signal, **kwargs)[ソース]¶: Disconnect a receiver function from a signal. This has the opposite effect of the connect() method, and the arguments are the same.

disconnect_all(signal, **kwargs)[ソース]¶

Disconnect all receivers from the given signal.

パラメータ: signal (object) -- the signal to disconnect from

send_catch_log(signal, **kwargs)[ソース]¶

Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected through the connect() method).

send_catch_log_deferred(signal, **kwargs)[ソース]¶

Like send_catch_log() but supports returning Deferred objects from signal handlers.

Returns a Deferred that gets fired once all signal handlers deferreds were fired. Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected through the connect() method).

統計収集器API¶

scrapy.statscollectors モジュールの下にいくつかの統計収集器があり、それらはすべて StatsCollector クラス(すべての継承元)で定義された統計収集器APIを実装します。

class scrapy.statscollectors.StatsCollector[ソース]¶

get_value(key, default=None)[ソース]¶: 指定された統計キーの値を返します。値が存在しない場合はデフォルトを返します。

get_stats()[ソース]¶: 現在実行中のスパイダーからすべての統計を辞書として取得します。

set_value(key, value)[ソース]¶: 与えられた統計キーに指定の値を設定します。

set_stats(stats)[ソース]¶: stats 引数で渡された辞書で現在の統計を上書きします。

inc_value(key, count=1, start=0)[ソース]¶: (設定されていない場合は開始値を想定して、)指定された統計キーの値を指定されたカウントでインクリメントします。

max_value(key, value)[ソース]¶: 同じキーの現在の値がvalueより小さい場合にのみ、指定されたキーに指定された値を設定します。指定されたキーに現在の値がない場合、値は常に設定されます。

min_value(key, value)[ソース]¶: 同じキーの現在の値がvalueより大きい場合にのみ、指定されたキーに指定された値を設定します。指定されたキーに現在の値がない場合、値は常に設定されます。

clear_stats()[ソース]¶: 全ての統計をクリアする

次のメソッドは、統計収集APIの一部ではありませんが、代わりにカスタム統計収集器を実装するときに使用されます:

open_spider(spider)[ソース]¶: 統計収集のために、指定されたスパイダーを開きます。

close_spider(spider)[ソース]¶: 指定されたスパイダーを閉じます。これが呼び出された後、これ以上特定の統計にアクセスしたり収集したりすることはできません。