Google Scholar
Access to scholarly content:
Journal articles
Books
Preprints
Technical Reports
OAI data
A&I Services (publicly available)
How it works
GS identifies "scholarly content" using simple rules based common formats and traits of scholarly papers
They extract title, author, abstract & references to build an index
Once papers and references are interlinked, algorithms create indexes and rankings
Google weighting factors
Google takes
The phrase hits (the Xs),
The adjacency hits (the Ys),
The weights hits (the Zs), and
About 100 other secret variables
Throws out everything but the top 2,000
Multiplies each remaining page's individual score by it's "PageRank"
And, finally, displays the top 1,000 in order.
Patrick Crispen (CSU Fullerton).
Google PageRank
There is a premise in higher education that the importance of a research paper can be judged by the number of citations the paper has from other research papers.
Google simply applies this premise to the Web: the importance of a Web page can be judged by the number of hyperlinks pointing to it from other pages.
Source: Google Hacks, p. 294
The PageRank Algorithm
Where
PR(A) is the PageRank of Page A
PR(T1) is the PageRank of page T1
C(T1) is the number of outgoing links from the page T1
d is a damping factor in the range of 0 < d < 1, usually set to 0.85
Source: Google Hacks, p. 295
GS Ranking Algorithms
Factors
Keyword/Author
Publication ranking
Number of citations
Outgoing and incoming
ONLY from the GS corpus
Human Genome
Searching WWW (millions of pages)
Number of hits: 5M+
Top results:
Genome centers
research institutions
Google Scholar
Major publishers + A&I's (100k sources)
Number of hits: 260k
Top results:
Landmark papers from Science & Nature with links to free versions of full text
Access to Journal Content
Provides access to publisher's websites
TOC's, abstracts, full text if available
Elsevier, Wiley, major scholarly pubs
Not normally spidered by Google due to their "anti-cloaking" policy
Articles on personal websites, institutional archives, e-print databases
GS caveats
Fairly random, incomplete selection of materials available
Only a starting point for advanced searches
Lacking content in the Social Sciences & Humanities
Building its own lexicon based on citations & frequency of word appearance
GS Desired Features
Filtering / Screening capability
(metadata filters)
Sorting options
By availability (OA vs. Controlled)
Searching only select content
Specific publishers
Things GS won't tell us
Search algorithm (proprietary)
How it defines "scholarly content"
How often the data is updated
Sources
How/when/if GooglePrint Content will be added
Issues
Threat to the A&I Services; Libraries?
Competition from YahooSearch / Scopus
Where are the ads?
Censoring / filtering
Does quality reflect naturally in the way items are ranked
Versioning problem
Link resolver obsolescence
Link Resolvers for Local Content
Follows Open URL standards
Requires institution to have LR installed
Default access for on-campus users
Offsite users can select manually
Local content ranked higher
Some problems with connectivity
PMID, DOI's work well
Some eprint, institutional repositories, not so