selenium_scholar.py

This module provides classes for querying Google Scholar using selenium and parsing returned results. It currently only processes the first results page. It is not a recursive crawler.

class AddArticleTask(result, article)[source]

Bases: object

Task that adds an article to the result

apply(querier)[source]
get_citation_data(querier)[source]

Given an article, retrieves citation link. Note, this requires that you adjusted the settings to tell Google Scholar to actually provide this information, prior to retrieving the article.

class ParseTask(result)[source]

Bases: snowballing.scholar.ScholarArticleParser120726

Task that parsers articles

apply(querier)[source]
handle_article(art)[source]

The parser invokes this callback on each article parsed successfully. In this base class, the callback does nothing.

handle_num_results(num_results)[source]

The parser invokes this callback if it determines the overall number of results, as reported on the parsed results page. The base class implementation does nothing.

class QueryTask(query)[source]

Bases: object

Task that queries the scholar

apply(querier)[source]
class Result(query, html)[source]

Bases: object

Represent a result with articles

set_navigation(driver, name, text)[source]
class ScholarSettingsTask(pages=10, citform=0, new_window=False, collections=1)[source]

Bases: object

This class lets you adjust the Scholar settings for your session.

CITFORM_BIBTEX = 4
CITFORM_ENDNOTE = 3
CITFORM_NONE = 0
CITFORM_REFMAN = 2
CITFORM_REFWORKS = 1
COLLECTIONS_ARTICLES_AND_PATENTS = 1
COLLECTIONS_ARTICLES_ONLY = 0
COLLECTIONS_CASE_LAW = 2
SETTINGS_URL = 'http://scholar.google.com/scholar_settings?hl=en&as_sdt=0,5&sciodt=0,5'
apply(querier)[source]
citform
collections
new_window
per_page_results
class SearchScholarQuery[source]

Bases: snowballing.scholar.ScholarQuery

This version represents the search query parameters the user can configure on the Scholar website, in the advanced search options.

SCHOLAR_QUERY_URL = 'http://scholar.google.com/scholar?'
get_url()[source]

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

set_author(author)[source]

Sets names that must be on the result’s author list.

set_include_citations(yesorno)[source]
set_include_patents(yesorno)[source]
set_phrase(phrase)[source]

Sets phrase that must be found in the result exactly.

set_pub(pub)[source]

Sets the publication in which the result must be found.

set_scope(title_only)[source]

Sets Boolean indicating whether to search entire article or title only.

set_timeframe(start=None, end=None)[source]

Sets timeframe (in years as integer) in which result must have appeared. It’s fine to specify just start or end, or both.

set_words(words)[source]

Sets words that all must be found in the result.

set_words_none(words)[source]

Sets words of which none must be found in the result.

set_words_some(words)[source]

Sets words of which at least one must be found in result.

class SeleniumScholarQuerier(driver=None)[source]

Bases: object

ScholarQuerier instances can conduct a search on Google Scholar with subsequent parsing of the resulting HTML content. The articles found are collected in the articles member, a list of ScholarArticle instances.

apply_settings(*args, **kwargs)[source]

Applies settings

continue_tasks()[source]
get_response(url, log_msg=None, err_msg=None, condition=None)[source]

Helper method, sends HTTP request and returns response payload.

send_query(query)[source]

Initiates a search query (a ScholarQuery instance)

class URLQuery(url, start=None)[source]

Bases: snowballing.scholar.ScholarQuery

Represent a Google Scholar query using a generic query We use it to navigate on the citations

get_url()[source]

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

check_captcha(driver, condition)[source]

Check if it expects a captcha

click(parent, selector)[source]

Click on selector

get_scholar_url(work)[source]

Get scholar url from work

wait_for(driver, condition, delay=5)[source]

Wait for an element appear during :delay seconds