scholar.py¶
This module provides classes for querying Google Scholar and parsing returned results. It currently only processes the first results page. It is not a recursive crawler.
-
class
ClusterScholarQuery
(cluster=None)[source]¶ Bases:
snowballing.scholar.ScholarQuery
This version just pulls up an article cluster whose ID we already know about.
-
SCHOLAR_CLUSTER_URL
= 'http://scholar.google.com/scholar?cluster=%(cluster)s&num=%(num)s'¶
-
-
exception
FormatError
[source]¶ Bases:
snowballing.scholar.Error
A query argument or setting was formatted incorrectly.
-
exception
QueryArgumentError
[source]¶ Bases:
snowballing.scholar.Error
A query did not have a suitable set of arguments.
-
class
ScholarArticle
[source]¶ Bases:
object
A class representing articles listed on Google Scholar. The class provides basic dictionary-like behavior.
-
class
ScholarArticleParser
(site=None)[source]¶ Bases:
object
ScholarArticleParser can parse HTML document strings obtained from Google Scholar. This is a base class; concrete implementations adapting to tweaks made by Google over time follow below.
-
handle_article
(art)[source]¶ The parser invokes this callback on each article parsed successfully. In this base class, the callback does nothing.
-
-
class
ScholarArticleParser120201
(site=None)[source]¶ Bases:
snowballing.scholar.ScholarArticleParser
This class reflects update to the Scholar results page layout that Google recently.
-
class
ScholarArticleParser120726
(site=None)[source]¶ Bases:
snowballing.scholar.ScholarArticleParser
This class reflects update to the Scholar results page layout that Google made 07/26/12.
-
class
ScholarConf
[source]¶ Bases:
object
Helper class for global settings.
-
COOKIE_JAR_FILE
= None¶
-
LOG_LEVEL
= 3¶
-
MAX_PAGE_RESULTS
= 20¶
-
SCHOLAR_SITE
= 'http://scholar.google.com'¶
-
USER_AGENT
= 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'¶
-
VERSION
= '2.10'¶
-
-
class
ScholarQuerier
[source]¶ Bases:
object
ScholarQuerier instances can conduct a search on Google Scholar with subsequent parsing of the resulting HTML content. The articles found are collected in the articles member, a list of ScholarArticle instances.
-
GET_SETTINGS_URL
= 'http://scholar.google.com/scholar_settings?sciifh=1&hl=en&as_sdt=0,5'¶
-
class
Parser
(querier)[source]¶
-
SET_SETTINGS_URL
= 'http://scholar.google.com/scholar_setprefs?q=&scisig=%(scisig)s&inststart=0&as_sdt=1,5&as_sdtp=&num=%(num)s&scis1=%(scis)s%(scisf)s&hl=en&lang=all&instq=&inst=569367360547434339&save='¶
-
get_citation_data
(article)[source]¶ Given an article, retrieves citation link. Note, this requires that you adjusted the settings to tell Google Scholar to actually provide this information, prior to retrieving the article.
This stores the latest cookies we’re using to disk, for reuse in a later session.
-
-
class
ScholarQuery
[source]¶ Bases:
object
The base class for any kind of results query we send to Scholar.
-
class
ScholarSettings
[source]¶ Bases:
object
This class lets you adjust the Scholar settings for your session. It’s intended to mirror the features tunable in the Scholar Settings pane, but right now it’s a bit basic.
-
CITFORM_BIBTEX
= 4¶
-
CITFORM_ENDNOTE
= 3¶
-
CITFORM_NONE
= 0¶
-
CITFORM_REFMAN
= 2¶
-
CITFORM_REFWORKS
= 1¶
-
-
class
ScholarUtils
[source]¶ Bases:
object
A wrapper for various utensils that come in handy.
-
LOG_LEVELS
= {'debug': 4, 'error': 1, 'info': 3, 'warn': 2}¶
-
-
class
SearchScholarQuery
[source]¶ Bases:
snowballing.scholar.ScholarQuery
This version represents the search query parameters the user can configure on the Scholar website, in the advanced search options.
-
SCHOLAR_QUERY_URL
= 'http://scholar.google.com/scholar?as_q=%(words)s&as_epq=%(phrase)s&as_oq=%(words_some)s&as_eq=%(words_none)s&as_occt=%(scope)s&as_sauthors=%(authors)s&as_publication=%(pub)s&as_ylo=%(ylo)s&as_yhi=%(yhi)s&as_sdt=%(patents)s%%2C5&as_vis=%(citations)s&btnG=&hl=en&num=%(num)s'¶
-
get_url
()[source]¶ Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.
Sets names that must be on the result’s author list.
-
set_scope
(title_only)[source]¶ Sets Boolean indicating whether to search entire article or title only.
-