scholar.py

This module provides classes for querying Google Scholar and parsing returned results. It currently only processes the first results page. It is not a recursive crawler.

class ClusterScholarQuery(cluster=None)[source]

Bases: snowballing.scholar.ScholarQuery

This version just pulls up an article cluster whose ID we already know about.

SCHOLAR_CLUSTER_URL = 'http://scholar.google.com/scholar?cluster=%(cluster)s&num=%(num)s'
get_url()[source]

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

set_cluster(cluster)[source]

Sets search to a Google Scholar results cluster ID.

exception Error[source]

Bases: Exception

Base class for any Scholar error.

exception FormatError[source]

Bases: snowballing.scholar.Error

A query argument or setting was formatted incorrectly.

exception QueryArgumentError[source]

Bases: snowballing.scholar.Error

A query did not have a suitable set of arguments.

class ScholarArticle[source]

Bases: object

A class representing articles listed on Google Scholar. The class provides basic dictionary-like behavior.

as_citation()[source]

Reports the article in a standard citation format. This works only if you have configured the querier to retrieve a particular citation export format. (See ScholarSettings.)

as_csv(header=False, sep='|')[source]
as_txt()[source]
set_citation_data(citation_data)[source]
class ScholarArticleParser(site=None)[source]

Bases: object

ScholarArticleParser can parse HTML document strings obtained from Google Scholar. This is a base class; concrete implementations adapting to tweaks made by Google over time follow below.

handle_article(art)[source]

The parser invokes this callback on each article parsed successfully. In this base class, the callback does nothing.

handle_num_results(num_results)[source]

The parser invokes this callback if it determines the overall number of results, as reported on the parsed results page. The base class implementation does nothing.

parse(html)[source]

This method initiates parsing of HTML content, cleans resulting content as needed, and notifies the parser instance of resulting instances via the handle_article callback.

class ScholarArticleParser120201(site=None)[source]

Bases: snowballing.scholar.ScholarArticleParser

This class reflects update to the Scholar results page layout that Google recently.

class ScholarArticleParser120726(site=None)[source]

Bases: snowballing.scholar.ScholarArticleParser

This class reflects update to the Scholar results page layout that Google made 07/26/12.

class ScholarConf[source]

Bases: object

Helper class for global settings.

COOKIE_JAR_FILE = None
LOG_LEVEL = 3
MAX_PAGE_RESULTS = 20
SCHOLAR_SITE = 'http://scholar.google.com'
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'
VERSION = '2.10'
class ScholarQuerier[source]

Bases: object

ScholarQuerier instances can conduct a search on Google Scholar with subsequent parsing of the resulting HTML content. The articles found are collected in the articles member, a list of ScholarArticle instances.

GET_SETTINGS_URL = 'http://scholar.google.com/scholar_settings?sciifh=1&hl=en&as_sdt=0,5'
class Parser(querier)[source]

Bases: snowballing.scholar.ScholarArticleParser120726

handle_article(art)[source]

The parser invokes this callback on each article parsed successfully. In this base class, the callback does nothing.

handle_num_results(num_results)[source]

The parser invokes this callback if it determines the overall number of results, as reported on the parsed results page. The base class implementation does nothing.

SET_SETTINGS_URL = 'http://scholar.google.com/scholar_setprefs?q=&scisig=%(scisig)s&inststart=0&as_sdt=1,5&as_sdtp=&num=%(num)s&scis1=%(scis)s%(scisf)s&hl=en&lang=all&instq=&inst=569367360547434339&save='
add_article(art)[source]
apply_settings(settings)[source]

Applies settings as provided by a ScholarSettings instance.

clear_articles()[source]

Clears any existing articles stored from previous queries.

get_citation_data(article)[source]

Given an article, retrieves citation link. Note, this requires that you adjusted the settings to tell Google Scholar to actually provide this information, prior to retrieving the article.

parse(html)[source]

This method allows parsing of provided HTML content.

save_cookies()[source]

This stores the latest cookies we’re using to disk, for reuse in a later session.

send_query(query)[source]

This method initiates a search query (a ScholarQuery instance) with subsequent parsing of the response.

class ScholarQuery[source]

Bases: object

The base class for any kind of results query we send to Scholar.

get_url()[source]

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

set_num_page_results(num_page_results)[source]
class ScholarSettings[source]

Bases: object

This class lets you adjust the Scholar settings for your session. It’s intended to mirror the features tunable in the Scholar Settings pane, but right now it’s a bit basic.

CITFORM_BIBTEX = 4
CITFORM_ENDNOTE = 3
CITFORM_NONE = 0
CITFORM_REFMAN = 2
CITFORM_REFWORKS = 1
is_configured()[source]
set_citation_format(citform)[source]
set_per_page_results(per_page_results)[source]
class ScholarUtils[source]

Bases: object

A wrapper for various utensils that come in handy.

LOG_LEVELS = {'debug': 4, 'error': 1, 'info': 3, 'warn': 2}
static ensure_int(arg, msg=None)[source]
static log(level, msg)[source]
class SearchScholarQuery[source]

Bases: snowballing.scholar.ScholarQuery

This version represents the search query parameters the user can configure on the Scholar website, in the advanced search options.

SCHOLAR_QUERY_URL = 'http://scholar.google.com/scholar?as_q=%(words)s&as_epq=%(phrase)s&as_oq=%(words_some)s&as_eq=%(words_none)s&as_occt=%(scope)s&as_sauthors=%(authors)s&as_publication=%(pub)s&as_ylo=%(ylo)s&as_yhi=%(yhi)s&as_sdt=%(patents)s%%2C5&as_vis=%(citations)s&btnG=&hl=en&num=%(num)s'
get_url()[source]

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

set_author(author)[source]

Sets names that must be on the result’s author list.

set_include_citations(yesorno)[source]
set_include_patents(yesorno)[source]
set_phrase(phrase)[source]

Sets phrase that must be found in the result exactly.

set_pub(pub)[source]

Sets the publication in which the result must be found.

set_scope(title_only)[source]

Sets Boolean indicating whether to search entire article or title only.

set_timeframe(start=None, end=None)[source]

Sets timeframe (in years as integer) in which result must have appeared. It’s fine to specify just start or end, or both.

set_words(words)[source]

Sets words that all must be found in the result.

set_words_none(words)[source]

Sets words of which none must be found in the result.

set_words_some(words)[source]

Sets words of which at least one must be found in result.

citation_export(querier)[source]
csv(querier, header=False, sep='|')[source]
encode(s)[source]
main()[source]
txt(querier, with_globals)[source]