↑↑ Home | ↑ Net & Web |
About -- Search engines -- Directories -- Archiving -- Webspam -- Spidering -- References
This text can now be regarded are primarily historical. Though some of the methods still work, web search engines returning pages by content are all but extinct, so finding content on the web is a lost cause. You can still read this text as a guide to what the possibilities the web could offer but for the commercial interests of companies and the active disinterest of politicians in independent information.
The topic of this document is web research, or how to obtain a specific piece of information, or information about a specific topic, via the web. Achieving this includes but goes beyond using search engines. If you are looking for an introduction to using web search engines, this page is not for you. The word "introduction" in the title merely indicates that it is meant to be broad and inclusive rather than deep, covering all aspects of web research but not delving too deeply into any of them.
This document will also not teach you how to evaluate what you find. Besides obvious sanity checks (an introductory page recommends: "Look at the title text. Does it relate to the subject?"), I assume that the searcher is sufficiently familiar with the topic to weed out junk, or else unbiased enough to view a representative number of sources.
This page was written to collect my experience and thoughts on how to squeeze information from the web in the face of an overwhelmingly low signal-to-noise ratio. It is aimed at scientists and engineers, perhaps journalists and bloggers, generally everybody who requires information which can in principle be found on the web, if only one knew how to get at it. If you know what you are looking for, or if you regularly find yourself clicking through the first five to ten result pages of a search engine rather than just the first, the following may help you.
Caveats -- Advanced -- Customising -- Search terms -- References
The first stop in any attempt to obtain information from the web is likely to be a general-purpose search engine. The main reason for this is that most searches are trivial (a forgotten domain name of a company; information on a general topic). For specific searches and serious research, search engines are quickly becoming more and more useless. Nonetheless this section will discuss how to get the most out of some search engines.
While everybody probably has their own preferred search engine, I will briefly
name the ones I use and their relative merits. The search engine I tend to use
first is DuckDuckGo. In my experience, it returns
good results also for specific queries about technological topics, which are
the kinds of searches I do most often.
When DuckDuckGo does not turn up any satisfactory results, I use a major search
engine such as Google,
Startpage
(Google-derived, with more privacy) or
Bing
. Their larger index gives them an advantage that
their loose matching criteria do not entirely negate. Google in particular is
acknowledged to have the largest index of all single search engines and is
comparatively topical if the query is simple enough.
Lastly, when other search engines return a large number of false positives
(i.e., results which do not meet my criteria), I use a powerful
search spider of mine to filter the results of the meta search engine
Dogpile. If that fails, I conclude that what I want
probably cannot be found on the open internet (i.e. outside pay/registration
walls).
Two other search engines that deserve a special mention are
Exalead and Clusty
.
Exalead is one of very few search engines (the only one?) which supports
advanced features like proximity search and truncation. On the negative side,
its small index size is sometimes painfully obvious. Clusty is a meta search
engine which clusters results according to what additional words appear in
them. This allows to quickly discard results that contain the right search
terms in the wrong context and sometimes suggests related search terms.
A few years ago, this section contained some words about how strictly different search engines obeyed the search terms. Now, all major search engines have a strong bias in favour of popularity over topicality as per the given search terms. Exalead and DuckDuckGo are notable exceptions in that they take search terms comparatively seriously.
The most important thing to know about search engines is that they are mostly run by advertising companies seeking to lure viewers to their ads. This has two separate consequences that present difficulties for researchers. For one thing, search engines tend to prefer popular results over topical ones, since most people prefer being entertained to learning. Besides, advertising companies are subject to an obvious conflict of interest between providing even-handed search results and preferring their paying customers.
Among the effects of the first problem is most search engines' lax interpretation of search terms. They do not actually search for the terms you give them in the way the grep command or a search in a text editor would. Instead, they take the liberty of substituting words they consider "equivalent", by whatever unpublished definition. Sometimes words with the same stem seem to be considered equivalent, sometimes even synonyms. In some of my Google searches, I have discovered that around half the results did not contain the specified search term (though of course part of that could be due to page contents having changed since they had been indexed). The extent of the problem rises for rare search terms, because like all salesmen, search engines hate admitting they have nothing to offer.
This is bad, because no search terms are really equivalent when you search for something specific. Synonyms are always inexact, and words from the same family may be used in different contexts — "distribution" may get you a linux variant or information about product delivery, but "distributive" refers to a property in algebra. With the vague interpretation of the search terms common today, you get no proper backfeed on whether your search terms were useful, and the iterative cycle of choosing terms and seeing if they are any good gets disrupted. Most search engines in theory support forcing a search term by prepending a plus sign, but in reality this merely increases the chances slightly.
Another important caveat, depending on where you live, is that Google (and
possibly other search engines too) returns results tailored towards the
language setting of your browser. After noticing that some of my searches
seemed irreproducible between different browsers, I have once investigated this in detail.
I have found around 50% different results between the first result page
retrieved from different countries. This is partly due to the HTTP header
HTTP_ACCEPT_LANGUAGE that browsers send to the server, and partly to Google
using geolocation to detect where requests come from. This feature is clearly
designed for targeted advertising, and the bias it creates in search results is
the reason I avoid Google whenever possible. Ask, for
example, allows to perform English-language searches from Germany by using the
form at http://www.ask.com/?o=312&l=dir. DuckDuckGo and Exalead seem
unbiased with respect to client geolocation.
The second problem mentioned above, an advertiser's conflict of interest, results in links to certain sites being inserted spuriously or disproportionately into the result list. Often these links lead to services of the search engine company itself, such as Google Maps. Some search engines mirror other sites in order to keep viewers at their own sites (and in reach of advertising): For example, Ask only references its own copy of Wikipedia in its search results, which is not always up to date. Clusty (or its alias Yippy) strongly favours results at the domain highbeam.com, apparently an affiliated company. Similarly, Ask sometimes inserts results linking to news articles at moreover.com.
What one should learn from all this is not to trust search engine results blindly. Nonetheless the following sections present the features they offer.
That a prepended plus sign enforces your choice of search terms was already
mentionend in the previous section. Correspondingly, prepending a minus sign
excludes pages containing a term (but you likely knew that). (However, Google
now uses prepended punctuation for
other
purposes.) The third moderately advanced feature that one frequently needs is
enclosing a phrase in double quotes. This causes the search engine to search
for the phrase rather than just its words in any order. Putting periods
between the words (without spaces) works just as well, because search engines
mostly treat punctuation as spaces but keep the words in the right order.
Further features that may be useful sometimes are operators prepended to search
words. These operators are often undocumented by the search engines. Google
used to have a help page about them, which seems to have been removed
(archived
copy) and hidden in a
support answers
database
, but the operators themselves still work. Ask does not document its
operators, but one can reverse engineer them from the result URL of its
advanced search form. Exalead
provides complete
documentation
. DuckDuckGo does not document the standard operators either,
even though it supports most of them. However, unlike other search engines, it
supports category keywords
prepended
by an exclamation mark.
site: may work better (possibly only) when followed by the full server name, i.e. www.example.foo not example.foo. A few undocumented operators work too. Here is an overview:
Function | Prefix |
---|---|
Restrict domain | site: |
Restrict file type | filetype: (not for Exalead) |
Require keyword in title | intitle: (DuckDuckGo/Clusty: title:) |
Require keyword in URL | inurl: (Clusty: url:) |
Require link to site or URL | link: |
As a convenience feature, Google allows allinurl: and allintitle: to
require all search terms to appear in the URL or title. Ask seems to
always apply inurl: and intitle: to all following search terms. The
missing documentation suggests that Google and Ask do not consider their
operators important. To be sure, you had better try them out
with a query the results of which you can easily verify before relying on them.
By contrast, Exalead's main selling point is its advanced features. For
example, I have found its link: operator to return considerably more
complete results than Google's. Besides, it allows the operators
language:, spellslike:, soundslike:,
before: and after:. The last two restrict the modification
date of the result document, not merely the date it was last indexed.
Amazingly, they seem to work even for servers that do not return a correct
modification date. (Most, unlike mine, just return the current date/time. The
browser links allows to display this information with the = key.)
Apparently Exalead manages to extract date information from the URL or page
text.
The usefulness of the operators depends strongly on what kind of search you do. site: makes it possible to search a specific site or site type (by giving the top-level domain such as .net). Requiring an important keyword in the title or the URL may sometimes give more pertinent results, but you cannot rely on that. One should also remember that the definition of intitle: for non-HTML results is open to interpretation and may differ between search engines. Exalead's spellslike: and soundslike: can help if you have heard of a domain or organisation but do not know its spelling.
Beside the unary operators above, search engines support a number of binary operators. Exalead is in the lead (as it were...) here too. Its NEXT operator saves you searching for a phrase of two words in both orders separately. NEAR searches for terms in close proximity. Capitalised AND and OR are logical operators, and parentheses are used for grouping. The same logical operators seem to be supported by Ask and Google.
Finally, Exalead allows truncating search terms with an asterisk *. Any page containing a word starting with the given string matches. Google uses the asterisk to allow wildcard words, which I regard as useless. Because the number of words between search terms remains fixed, this is not at all equivalent to proximity search — an unforseen adjective will break the match.
To someone who uses web search engines extensively, retrieving their search forms time and again soon becomes a serious overhead. The simplest solution is to download and save the search engine's form page, which is fine if you use primarily one search engine. Otherwise you would prefer to have multiple search engines available with minimum hassle. This can be achieved by copying and pasting together the HTML search forms of several search engines into one HTML document which you store locally. The only thing you may have to change is to convert the "action" URL of the form to an absolute URL. Here is an example with four search engines and Wikipedia, with some slight modifications. (If you like it, download it, or you will have to wait for my web server every time.)
You can also create your own search form. This requires some knowledge of the
syntax of HTML forms and how they are submitted, so the less technically minded
may want to skip ahead. I use the excellent
German-language Selfhtml; for English-speakers,
this site
or this
site
may be helpful. The following will give a brief hands-on introduction
with regard to search engine forms.
Most search engines use the GET method for transmitting search requests. This means that the search terms and other parameters are appended to a URL in the search engine domain to make the query. In their simplest forms, queries to some search engines look like this:
http://www.ask.com/web?q=<search terms> http://www.google.com/search?q=<search terms> http://www.exalead.com/search/web/results/?q=<search terms> https://duckduckgo.com/html?q=<search terms> https://startpage.com/do/search?query=<serach terms>
Here, <search terms> stands for the suitably encoded search terms, with the spaces replaced by plus signs etc. The simplest HTML form to submit such a query is:
<form action="http://www.ask.com/web"> <input name="q"> <input type="submit" value="Search"> </form>
The action attribute of the form tag gives the URL to which the form data is to be submitted. The first input tag becomes the parameter query=... to the URL. Its name attribute determines the parameter key (before the equals sign); the text you entered in the input field is its value. The second input tag does not define any parameter; it merely represents a button that triggers submission of the form.
Privacy-conscious search engines like DuckDuckGo and Startpage serve results via an encrypted HTTPS connection, and receive their arguments via a POST method (also encrypted), which can provide more privacy than GET in some situations:
<form action="https://duckduckgo.com/html" method="post"> <input name="q"> <input type="submit" value="Quack"> </form>
If you want to retrieve its results with a command-line downloader or a simple program, it is usually easier to use a GET request as above.
input tags can have several attributes. One which is worth knowing about is the size attribute, which defines the length of a text input. Add it or increase it to get an input line suitably long for complex searches:
<input name="query" size="70">
In order to add a parameter with a fixed value to your search requests, you can add a hidden input tag with a fixed value:
<input type="hidden" name="key" value="value">
The best way to identify valid parameters and their meaning is to use the advanced search form of a search engine and looking at the URL of the result pages. Parameters come after the question mark in the URL, are separated by ampersands (&) and have the form <key>=<value>. You have to sort out the many parameters which are always there and test the one you are after by modifying it in the URL line of your browser and reloading the results page. For some search engines, I have identified the parameters determining which results are returned:
Function | Ask | Bing | Exalead | Clusty | |
---|---|---|---|---|---|
n results per page | — | num=n | count=n | elements_per_page=n | v:state=root|root-0-n|0 |
start at o-th result / p-th page | page=p | start=o | first=o | start_index=o | v:state=root|root-o-10|0 |
Navigation language | — | hl=(country code) | — | — | — |
You can set these parameters with a hidden input tag. Note that Google's num parameter differs from the others in that a longer list of results per page is not just the concatenation of pages containing fewer results, but often contains additional results. DuckDuckGo and Startpage do not allow direct access to the n-th result page, be it as a side effect of their security-consciousness or intentionally. Startpage allows generating an obfuscated additional parameter to its search submission that contains user preferences, among others the number of results per page, which could also be added to a submission form.
An important Google-specific parameter is nfpr=1. Recently Google has taken to replacing search terms by more popular ones without user confirmation — for some searches, instead of the link to "did you mean...", the results for the replacement search terms are displayed and a link to the results for the user's search terms is added. nfpr=1 prevents this.
In addition, Google allows two parameters that modify the submitted query.
This seems to be documented nowhere, but I saw it demonstrated on
Searchlores, and it is still working.
The as_oq parameter forces one of a set of terms to be
present, while the as_eq parameter requires none of them to be
present. Their values are a space-separated list of terms. So one can impose
additional constraints in specialised Google search forms via hidden inputs,
for instance for product test searches:
<input type="hidden" name="as_oq" value="test review evaluation"> <input type="hidden" name="as_eq" value="buy.now special.offer free.download">
Since a form can have only one "action" URL to be submitted to, using the same input line for several search engines is not possible in plain HTML. But this can be done using JavaScript, saving you the trouble of typing the same search terms multiple times. Here is an example universal search form, for three search engines plus Stack Overflow, Google Scholar and Wikipedia. It opens the search results pages in a new tab or window and selects the search input for easy replacing.
This is a topic on which everybody is pretty much on his/her own. Finding a small number of words which characterise documents that are of use for a given specific purpose is hard, and there is certainly no algorithm for it. One has to try a set of terms and modify them according to the results one obtains.
Sometimes pertinent result documents contain other important terms that are worth searching for. Sometimes they suggest a completely different avenue of attacking your research problem, with a corresponding totally different set of search terms. It may be more efficient to search for a set of terms that are likely to occur closely together, since most search engine prefer such results — results in which terms appear together which are not supposed to may be less pertinent.
This vague advice given, I would like to mention one tool that can be very
useful for refining one's choice of search terms: Lexical FreeNet is a
database that maps relationships between words. It does not only list
synonyms, but also
generalisation and specialisation relationships and many others. Unfortunately
it may be going out of business; the icons representing the relationships are
already missing. If it does vanish, Wordsmyth
may be
worth trying, though it is less precise about relationships between words.
Simple synonym finders may also be occasionally useful (see Wikipedia's page on
synonyms
for a selection). Translating a word
into a different language and back can also help.
This section is going to be comparably brief, as I do not use specialised search engines much and web directories practically not at all. Besides, if you are especially interested in a specific domain, you are likely to know search engines and/or directories pertaining to it, and those I use will not be of use to you. Nonetheless, a few words on them.
Domain-specific search engines are search engines specialising on a subject or
field of topics by restricting either the sources they index or the criteria
they index by. On one end they border on specialist databases containing
references which may or may not be on the web but which are of narrowly-defined
type, such as scientific papers. On the other end, they cross over to web
directories, which list all kinds of web resources but are typically compiled
by humans, not automatically. Two lists of specialised search engines that
look useful to me are at
NoodleTools and on Phil Bradley's web site
.
Among the domain-specific search engines I have used myself is Scientific Commons, a database of scientific papers and preprints that
includes arXiv.org
and
other preprint servers and thereby provides a good chance of leading you to
something you can actually download, rather than just read the abstract of.
The RCLIS (Research in Computing, Library and Information Science) project
operates a database of e-prints
in library and information sciences. Among others, this
includes research on web search engines and information retrieval systems.
The Social Science Research Network
contains a database of social sciences papers. Wikipedia also has a list
of academic search engines
, but one should note
that "free" access cost refers to accessing the search only, not the
papers.
Google Scholar has improved greatly in recent
years in that it now provides links to freely downloadable versions of papers
rather than just sales pages of academic publishers. I had viewed it as more
of an advertising service rather than a search engine when it used to link only
to sales pages, but it has now become genuinely useful.
Finally, I have little experience to offer on web directories. The
WWW Virtual Library, the open
directory project
and Google directory
should
probably be mentioned. The reason I do not use web directories is that most of
my web research concerns not just special topics but also quite specific
questions. Web directories may be useful to find broad and introductory pages
on a subject. For that purpose, I usually start from
Wikipedia
and follow external links, or look at
Wikibooks
. An ordinary web search will serve just
as well if you include one of the words "introduction" or "tutorial". Finding
broad and general material on a subject tends not to be too hard, and there are
several ways of arriving at it.
In a way, finding something useful on the web is only part of the problem. Too often you realize only later that a given source may be useful, in the view of information you found elsewhere, or to supplement it. By then you will usually have forgotten both its URL and the exact terms you used to find it with a search engine. The latter matters, because often the search results depend on not just the search terms, but also their order and "+" flags.
A different but related situation occurs if you want to follow a link you have good reason to believe is useful, but find it no longer exists. This problem requires to find an archived copy of the page in question, while the previous is best dealt with by archiving it yourself.
If you find something that may be useful, grab hold of it. Save at very least the URL until you are sure that you will not be needing it. Err on the side of caution — you may only realize later how rare some information is. To save a URL, you might use a designated temporary bookmarks folder in your browser, or simply a text file (as I do).
You may in addition want to download and store a copy of the web page if one of the following applies:
If you decide to save multiple files, the automated downloaders wget or curl
may help you. wget can
retrieve web pages including "page requisites" such as images, while
curl allows a syntax similar to glob patterns to retrieve multiple
URLs.
Finally, some social bookmarking services are rumoured to provide archiving
services. Furl did this, but is now defunct. Anyway, since such services cannot make the archived
content publicly available for
copyright reasons, they are hardly better than your local computer.
Admittedly they might allow you to access your saved content from anywhere via
the web, but you can achieve the same thing by storing it in an online storage
system
.
When a web page you have good reason to judge useful has become unavailable, you have to turn to web archivers. When the page in question is the result of a web search, the easiest and most effective to retrieve the its copy cached by the search engine. Not only is it just one click away, but you can also be sure that this is the version which the search engine has indexed, and consequently which does contain the search terms you requested.
When you want to follow an obsolete link, things are less easy. But you still
have a fair chance of getting at the content. The Internet Archive
mirrors large parts of the web at regular intervals, as well as other content
placed in the archive deliberately. To search for stored copies of a URL, you
use its Wayback Machine
. It also allows to search the archived content, though I
have not tried it and cannot say how useful it is.
The URL of archived web content has a simple structure. The following gets you the overview page which indicates when a given URL was archived:
http://web.archive.org/web/*/<URL>
<URL> stands for the URL of which you want an archived copy, which may but need not include the "http://". This link is a JavaScript bookmarklet that forwards you to the overview page corresponding to the current URL. Bookmark it, and you can click it to go from a "not found" notice to the archive. The URLs of the actual archived pages are very similar to the above. Just replace the asterisk by the date and time in YYYYMMDDhhmmss format. You need not get the date and time at which it was archived right — the copy from the closest date/time will be automatically returned. The internet archive's FAQ notes that there may be a delay of up to 24 months between a site being archived and it being available in the Wayback Machine.
A web archiving service aimed specifically at academics is Webcite®.
It is a free service that allows to create a copy of web pages one wants to
cite, thereby ensuring the
continued availability of these sources. (See here for how to put both direct and
archived URLs into your bibliography using BibTeX.) Webcite also allows you to
search for a specific URL which anybody might have archived. Somewhat
annoyingly, considering you give keywords when archiving something, you cannot
search by keyword.
Neither the Internet Archive nor Webcite can help you if the owner of a web page wants to remove it. Both comply with explicit requests for removal. However, not everybody has heard of web archivers, so no such request may be made and an archived copy may be available. The admissability as evidence of archived web pages is dubious, so don't rely on them as such. Furthermore, since the Internet Archive uses robots to retrieve content to archive, it obeys the robot exclusion standard and may be barred from (parts of) some sites. As a consequence, some sites are not in the archive at all.
Web archiving is still young enough that it is worthwhile to keep up with
developments. Archive Team is a loose collection of
volunteers dedicated to preserving online resources. The Wikipedia pages on
link rot
,
digital preservation
and
web archiving
may also be
helpful.
Advertising is the dark half of any information transmission medium. Rather than providing reliable information, it aims to influence and manipulate by exploiting the irrationality of human nature. To the researcher, it impedes the search for knowledge by creating background noise and false positives.
How serious a problem this is depends on how close the topic of one's research is to somebody's commercial interests. Scientific topics tend to be almost unaffected, while obtaining technical information on products is made significantly harder. The impact of embedded advertisements also depends on how good one is at ignoring blinking pictures and videos. Those whom (like me) they keep from concentrating can help themselves by uninstalling their Flash player and block access to known ad servers (see here for a list).
Beside plain old embedded ads, I have encountered two new types of nuisance that I had not previously heard about. After a brief word on webspam, I will take the opportunity to present them here. If you are not interested in these annoyances, you can skip this part.
Webspam refers to camouflaging advertising as web content and employing a number of techniques to make search engines send users to them. Popular methods include putting up interlinked web pages to increase their ranking, creating fake blogs, embedding misleading keywords outside the visible content of web pages, and serving different pages to search engine bots and people.
These practices, serious though they are, tend not to affect researchers directly. Search engines rightly see them as potentially fatal to their business model and as a consequence devote significant resources to combating them. In my experience they are rather successful. Furthermore, webspam is not usually a problem when searching for scientific or other objective information. Product searches are a different matter, but allow an obvious sanction: do not buy products which are advertised via webspam.
While earlier in the web's history mistyping a domain name led to a name server lookup error, these days it usually results in a page with a generic logo, many links and a search bar. Some companies specialise in acquiring domains with names similar to popular sites in order to benefit from typos and redirect unsuspecting netizens to their paying clients. I like to call this practice "click trafficking" because of its intent of harvesting clicks as opposed to the more general "domain squatting" which aims at selling the domain at a profit.
This is usually more annoying than serious when accessing a domain one knows, but can be time-consuming when searching for a domain one has heard of or simply guessed. Nonetheless, it is to be expected that inexperienced users of the web are being fooled. The web pages presented usually repeat their own domain name, thus similar to the original but legally above board, and present links related to what the original domain is about, in addition to the usual suspects of dating, credit and domain purchases. Examples include (current 10/2009):
Spoof domain | Registrant | Main link target | Original domain | Type of original |
---|---|---|---|---|
wikipeda.org | Jasper Developments Pty Ltd. | ndparking.com | wikipedia.org | Encyclopedia |
alsa-project.com | Portfolio Brains, LLC | information.com | alsa-project.org | Audio subsystem |
ecomonist.com | Web Commerce Communications Ltd. | doubleclick.net, zazasearch.com | economist.com | Newspaper |
ruby.org | Tucows Inc. | ruby.org | ruby-lang.org | Programming language |
foldoc.com | Domreg Ltd. | information.com | foldoc.org | Dictionary |
via.com | Enom Inc. | sedo.com | via.com.tw | PC hardware |
gnuplot.com | Mdnh Inc. | doubleclick.net, googlesyndication.com | gnuplot.info | Plot program |
As one can see, click trafficking provides for a fair number of companies of the shadier parts of internet commerce. That the actual owners of the spoof domains link to sites owned by different companies (and that different owners sometimes link to the same site) suggests that they have subcontracted the handling of their clicks.
Click trafficking is a nuisance during eyeball searches, but may be more serious when using a spider for searching (see below). Automatically filtering of spoof domains may be possible, but I am not aware of any effort in that direction. Those domains which link to themselves tend to use links with extremely long query parameters. Most link to two domains at most, one of which is usually a domain broker or advertiser. The name of the registrant, on the other hand, cannot be used as a filtering criterion, as automated whois queries are banned.
My current internet service provider has hit on a racket I call "portal placement", and no doubt it is not the only one. Every time a name server lookup fails (such as when a dead link leads to a non-existent domain), the browser is redirected to a page from the ISP's favoured search portal, with the queried domain as the search term. This artificially increases traffic to the search portal and thereby increases advertising revenue.
How this is done is interesting in itself: The domain name server returns the IP address of a web server at the ISP instead of a failure notice. This web server returns a redirection to the search portal when the browser connects via HTTP. (This can be found out using the command-line downloader curl with the -v option, which makes it print the IP addresses of the servers it connects to.)
Like click trafficking, this is a minor nuisance. Because the portal's URL appears in the browser location bar, the original URL cannot be edited in case of typos. Again as for click trafficking, automatic filtering ought to be possible. The browser (or a local web proxy) must be made to ignore redirects to the portal in question or avoid contacting the redirect server in the first place. To my knowledge, no browser allows this out of the box at the moment, so this seems to require setting up a proxy.
General -- Downloaders -- Search spiders -- Custom spiders -- References
Squeezing information from the web can be quite time-consuming. Some of the effort involved is repetitive — checking that search engine results really contain the requested terms, judging each result according to additional criteria and saving the useful ones to disk or extracting specific information from them. Therefore it may help to automate some of these tasks if a large number of documents is to be processed.
As when automating a data processing task locally on a single computer, it has first to be decided which parts of it are best performed by a human and which can be automated. Unfortunately, in the case of information searches, the one part best done by a human is the one in the middle — deciding whether a source is useful. Ideally the user would be kept in the loop as a decision-maker, but that would require a hybrid between a robot and a browser that could automate some steps but also display selected pages and take user input. I am not aware of any such program. The second-best but feasible solution is to program a preliminary check designed to eliminate many false positives and store the results for later decision by the user.
Before we delve into details, a word on nomenclature: While a robot can be any automaton that uses the web, the term spider is usually reserved for robots that can follow hyperlinks and thereby traverse large parts of it. One could say that while spiders can walk around the net, robots are stuck where they hit it, just like insects that are not spiders ;). Both can be of use in information searches, depending on the task at hand, though spiders are obviously more powerful.
Before asking oneself how best to write a robot, one should consider two other
questions: when or whether to write one, and how it should behave. The answer
to the first results from the tradeoff between the time needed to write a robot
and the time it is likely to save you. Obviously there is no algorithm to
determine that, so the following will be a bit wooly. If you have been
searching manually for less than half an hour, continue doing so until you are
sure that you are not getting there. On the other hand, if you have to search
a known and large part of the web, you may want to start programming right
away. The more general the spider you write, the more often you will be able
to use it, but the more time it will also take to write. If you have clear
search criteria, a robot is likely to get good results, but if are
"fishing" to get clues for subsequent secondary searches, human
intelligence is clearly required. Is there perhaps already a web database that
contains the information your robot would extract? Consult a list of
specialised search engines and databases (such as this)
again. Only you can answer the question whether it makes sense to write a
robot for your particular problem, but you should think about it for a few
minutes before spending half an hour either programming or searching manually.
The second question, how decent robots should behave, is more easily answered, by a combination of common sense and convention. Hosting web pages costs money, and some hosting providers limit the amount of data that can be downloaded in a given period of time. Similarly, bandwidth on the net as a whole is a common good. So your robot should strive to avoid unnecessary downloads. In fact, using a robot as opposed to a web browser can actually help to limit bandwidth, because a robot does not typically download images and other data embedded into a web page automatically. For a typical web page today, this can save a lot.
A related topic is the Robot Exclusion Standard. A robots.txt file
located at the root directory of some sites (that is, at
http://servername.bla/robots.txt) advises robots what parts of the
site to traverse and what to pass up. Though robot authors are not required to
obey it, there are several good reasons to do so: It is good netiquette. Most
web site administrators are aware that robots.txt is only advisory, so
anything your robot is not supposed to see will be hidden by additional more
effective means. And finally, the most frequent use of robots.txt is
to keep robots away from junk. So observing it often saves the robot time and
the net bandwidth. The syntax of robots.txt is described on Wikipedia and robotstxt.org
,
and there is a Perl
module
for parsing it.
One more important matter of politeness is the frequency with which a robot requests pages. An extension of the Robot Exclusion Standard is the Crawl-delay field which gives the recommended number of seconds to wait between retrievals. If the robots.txt file of a server contains it, that is what you should choose. If the keyword is not present, or if you write a small spider and do not want to burden yourself with parsing robots.txt, my advice is to wait 10 seconds. This may seem much, but be realistic: The web operates at a time scale of seconds as opposed to the microseconds of your processor or the milliseconds of your hard drive. You are not going to sit around twiddling your thumbs waiting for your robot's results. Rather, it will run in the background for half an hour or more, while you occupy yourself productively with something else. Choosing a rudely small crawl delay is not going to change that, so you might as well be polite. Ten seconds is sufficient time for a competent browser user to download a page and type in a search term in the browser's search function, so this gives you a valid line of defence should someone accuse you of overwhelming their server with your robot.
Finally, if you use web services, you have to obey their terms of service,
which sometimes ban robots. Google and Ask both do this, while some other
search engines do not. (Dogpile, a
metasearch engine that includes Google results, does not seem to ban
robots
, so this may be a way to get at Google
results with a robot.) A
company's terms of service can also demand obeisance of the robot rules in
robots.txt, which makes them binding rather than advisory. If access
to some pages is banned for robots in this way, the only thing to do is to
download them manually and let your robot operate on the downloaded files or
the links contained in them.
Before we proceed to robots intended for searching, let us have a look at more
general and more widespread robots, primarily those intended for automated
downloading. The most well-known of these are wget and curl
. The primary difference between
them is that wget can recursively follow hyperlinks but curl
does not. wget is therefore suited for downloading whole subtrees of
interlinked HTML documents, and automatically downloading embedded images and
the like. In return, curl allows the user to give ranges of URLs in a
notation similar to glob patterns, which is useful for retrieving numbered
documents. Both support authentication (logging in at a web page), encryption
(SSL) and bandwidth limiting. wget supports a crawl delay.
wget respects the robot rules in a server's robots.txt when downloading recursively. curl, which does not do recursion, does not. Apparently the implied interpretation is that so long as a URL is given explicitly by the user, the program does not act as a spider and need not obey the robot rules, which has a certain logic. Whether this reasoning would be accepted in court in cases where observance of robots.txt is enforced by terms of service is a different matter.
In the interest of preserving network bandwidth, it should be mentioned that
curl allows so-called HEAD requests, which do not retrieve the
document in question. This can be useful to check whether a document is there
at all, or to extract information from the HTTP headers which are returned. The following table displays the most
important options of wget and curl:
Function | wget | curl |
---|---|---|
Set output file name | -O <file> (single file only) | -O / -o <file> |
Limit bandwidth | --limit-rate=<B/s> | --limit-rate <B/s> |
Crawl delay | -w <seconds> | — |
HEAD request | — | -I |
Set user agent string | -U <string> | -A <string> |
Set referrer URL | -referer=<URL> | -e <URL> |
Besides programs intended for automated downloads, text-based web browsers can
be very useful. The most widespread ones, lynx, w3m
and links
, all support outputting the rendered web page with the
option -dump. This provides a way to obtain a plain text version of a
HTML document.
By now you may be wondering what place a discussion of downloaders has in a text about web research. The reason is that before a document on the web can be investigated, it has to be retrieved. Automated downloaders are therefore a required component of robots written as shell scripts. Though robots can be written in other programming languages (and frequently are), shell scripts have the advantage of being a very high level language with correspondingly low development effort. They also allow you to make use of numerous standard UNIX tools designed to process plain text files where other languages may require you to duplicate their functionality. If you intend to perform multiple investigations on a limited number of web documents, it may make sense to separate the download and investigation steps altogether. You might first retrieve a set of web pages and then program one or several tools to run queries on them. This saves bandwidth compared to downloading them every time as would be done if the queries were performed by different spiders.
Finally it should be mentioned that some sites block automated downloaders by various means, sometimes even unintentionally. If the site in question states (in its terms of service or on the pages concerned) that it bans robots, this should be respected. However, if you find no such statement after looking carefully, I regard it as legitimate to try to bypass such restrictions. In my experience blocking robots is most frequently done in two ways. Every browser and robot sends a so-called "user agent" string along with its HTTP request, which usually gives its name, version and operating system. Some servers refuse to service requests unless they originate from well-known browsers. This can be prevented by providing the user agent string of a browser using the options above. If your download task is simple enough, it may also be worth trying to use links, which has a graphical mode and therefore tends to be regarded as a "real" browser and tolerated.
The second method of discouraging robots involves the "referrer" header (often spelled "referer" following a misspelling in an RFC). This gives the URL which references a file when downloading it, such as the page which hyperlinks to a page or which embeds an image. If a URL is retrieved directly, there is no referrer, and some servers refuse to reply. (This is also used to prevent "hotlinking", putting images on other servers into one's own pages.) An acceptable referrer URL can usually be easily found out (a page containing an image, a table of contents etc.), and the downloader can be told to transmit it with the options given above.
There is in fact a number of search spiders available on the web, from a variety of projects and individuals. I admit that I have no great experience using them, as they tend not to meet my needs for one reason or another, see the following.
The available search spiders differ in both abilities and purpose.
Many search spiders are intended for indexing, like the spiders of the big
search engines. Examples are Nutch,
which is affiliated with the Apache web server project, ht://Dig
, and the Aperture framework
, a Java library. Using such spiders makes most sense if
you are going to answer a
large number of queries for a limited number of documents and according to
known and limited criteria. Indexing these documents once or regularly with
respect to your criteria will save bandwidth because only the resulting
database has to be queried repeatedly. If you intend to make widely different
complex queries to unpredictably different parts of the web, indexers will not
help you. However, indexing an intranet or local files can be helpful to
obtain a preselection for further investigation. It should be mentioned that
ht://Dig supports a number of non-exact matching requirements such as
sound-alikes, synonyms and substrings.
Some other robots are downloaders like wget and curl (and
therefore not search spiders, though they can be used as part of such).
Examples for this category are Pavuk, Arale
and
Larbin
. I have not tried them out, as I have always found
wget and curl sufficient. If you miss a feature, you might
want to look at the others.
The bots section of the Searchlores site
presents a number of robots with accompanying articles. The most interesting
one for web search is W3S by Mhyst
. It
is a spider that can traverse part of the web according to criteria given by
the user and report on the number of keywords contained in each page. I have
to admit I have not yet got round to trying it out, but it has a GUI that seems
intuitive to use and is written in platform-independent Java. It seems quite
suitable for search tasks a search engine cannot perform, such as "real-time"
searches of the web and intranet. The other robots presented at Searchlores
are mostly special-purpose. Reading about them helps to learn about writing
robots, and some may fit your task at hand or could be adapted to do it.
Finally, if you have good knowledge of the web and HTML, and have demanding requirements for a search spider, I can recommend the one I have perpetrated myself, wfind. It is very powerful, but accordingly complex to use, and it takes an expert to exploit its full potential. If you are interested, read my introductory page and its manual page online.
The first choice one has to make after deciding to write a spider is what
programming language to use. This decision has the same tradeoff of
development effort versus flexibility as always. For me the answer is usually
either a shell script or Perl. Shell scripts are fine if you do little more
than downloading and do not use much control flow and data structures.
Retrievals are done via wget or curl, and standard text
utilities like grep, sed, tr, etc. can be used to
process the data and determine the next URL to visit. Programs like pdftotext, catdoc,
xls2csv
, etc. can be used to convert
non-plain-text formats. Attempting complex analysis or parsing of HTML, XML or
other plain-text based formats directly with the shell is probably not a good
idea. But be aware of the little-known fact that the shell bash
supports regular expressions:
[[ "$text" =~ $regex ]]
The code snippet above matches $text against the
extended regular expression $regex.
But a further drawback of shell scripts is that you are stuck with the features
of the downloader you are using.
For more complex tasks than are comfortable to do with a shell script, I use
Perl. This choice is partly due to my knowing Perl pretty well now; I am sure
that Python or Ruby are as suitable. The most important features are the ease
of retrieving web pages and the availability of analysis features you want,
such as regular expressions. If you want to analyse non-plain-text file
formats or search by special criteria (sound-alikes, fuzzy searches etc.), it
saves you work if there are packages for that in your chosen language. Perl
has the advantage that a large number of modules are available in a
comprehensive database, CPAN.
Spiders have also been written in non-scripting languages such as C, C++ and
Java. I think these are too low-level and consequently too work-intensive and
do not add enough in the way of flexibility to be worth the extra effort. But
if you know them well and/or know a library that provides an unusual feature
you need, choose them by all means. Be aware that some script languages
(for instance Perl) offer
interfaces to C, and all (AFAIK) can execute external programs. This allows
you to program the necessary part of your task (and only that) in a lower-level
language.
Finally, a relatively unknown scripting language that deserves a mention in
connection with spidering is Rebol. The fact that URL
retrieval is part of the language makes accessing documents easy, and simple
parsing can be programmed in a very concise way.
The details of how to write your spider are something I cannot help you with. Every (search) task is different, and so every spider for a specific task will also be. However, I can give you a few caveats:
In order to learn to write a robot, it can be useful to look at existing code.
The bots section of
Searchlores has already been mentioned. The code
snippet database Snipplr
contains some snippets related to automated downloading
and searching. The open book Web Client Programming with
Perl
conveys basic understanding of how web
browsing works technically, and how to do it with a Perl script. The non-open
book "Spidering Hacks" (cited in full below) is dedicated to writing
robots in Perl, most of which are not actually spiders by my definition above. An archive of examples can be
downloaded from its homepage
.
© 2009 - 2013 Volker Schatz. Licensed under the Creative Commons Attribution-Share Alike 3.0 Germany License