| ↑↑ Home | ↑ Net & Web |
About -- Search engines -- Directories -- Archiving -- Webspam -- Spidering -- References
The topic of this document is web research, or how to obtain a specific piece of information, or information about a specific topic, via the web. Achieving this includes but goes beyond using search engines. If you are looking for an introduction to using web search engines, this page is not for you. The word "introduction" in the title merely indicates that it is meant to be broad and inclusive rather than deep, covering all aspects of web research but not delving too deeply into any of them.
This document will also not teach you how to evaluate what you find. Besides obvious sanity checks (an introductory page recommends: "Look at the title text. Does it relate to the subject?"), I assume that the searcher is sufficiently familiar with the topic to weed out junk, or else unbiased enough to view a representative number of sources.
This page was written to collect my experience and thoughts on how to squeeze information from the web in the face of an overwhelmingly low signal-to-noise ratio. It is aimed at scientists and engineers, perhaps journalists and bloggers, generally everybody who requires information which can in principle be found on the web, if only one knew how to get at it. If you know what you are looking for, or if you regularly find yourself clicking through the first five to ten result pages of a search engine rather than just the first, the following may help you.
Caveats -- Advanced -- Search forms -- Search terms -- References
The first stop in any attempt to obtain information from the web is likely to be a general-purpose search engine, and to remain so for quite some time. They offer a fairly comprehensive index of the web as a whole and do not require prior knowledge of the topic of your search or of a specific place to start. While everybody probably has their own preferred search engine, I would like to present my own in the following and give my reasons for using them.
You may be surprised to learn that the search engine I use most is none of the
big well-known ones. Instead I use Clusty
, a
meta search engine which clusters results according to what additional words
appear in them. Who would have thought that "argv" is not just the
canonical variable name for a program's argument array, but also a German law?
Clusty allows to quickly discard results that contain the right search terms in
the wrong context. This is also helpful when searching for terms of commercial
interest because you can ignore the clusters containing marketing phrases. My
preference for Clusty reflects my experience that nowadays false positives
(spurious results) are a much larger problems than false negatives (good
results which are not found).
When Clusty does not turn up any satisfactory results, I use Google
. It is
acknowledged to have the
largest index of all single search engines, and its results are sometimes more
topical than those of Clusty, which suggests that it is ahead at ranking too.
Lastly, when other search engines return a large number of false positives,
I sometimes use Exalead
. You may
not have heard of it either, but it is one of few search engines (the only
one?) which supports advanced features like proximity search and truncation.
On the negative side, its small index size is sometimes painfully obvious.
Finally I would like to remark on a distinction between search engines which, though it is somewhat hard to pin down, I regard as important. One could call it the "strictness" of a search engine. By that I mean its tendency to interpret a query literally rather than trying to second-guess what the user might have meant, and adjusting its internal search accordingly. Especially for complex queries, this second-guessing is a nuisance, as it interferes with finding the right set of search terms and ultimately good results. Google is worst in that respect (see next section), which is one reason why I do not use it very much. Exalead is strictest, and Clusty is somewhere in between.
The most important thing to know about search engines is that they do not actually search for the terms you give them in the way the grep command or a search in a text editor would. Instead, they take the liberty of substituting words they consider "equivalent", by whatever unpublished definition. Sometimes words with the same stem seem to be considered equivalent, sometimes even synonyms. In some of my Google searches, I have discovered that around half the results did not contain the specified search term (though of course part of that could be due to page contents having changed since Google visited them). The extent of the problem rises for rare search terms, because like all good salesmen, search engines don't like admitting there is something they cannot offer.
This is bad, because no search terms are really equivalent when you search for something specific. Synonyms are always inexact, and words from the same family may be used in different contexts — "distribution" may get you a linux variant or information about product delivery, but "distributive" refers to a property in algebra. If the search engine understands the search term vaguely, you get no proper backfeed on whether your search terms were useful, and the iterative cycle of choosing terms and seeing if they are any good gets disrupted.
Therefore, when you see (by searching the web page with your browser) that some search terms are not contained in the result pages, you should force the search engine to respect your choice of terms strictly by prepending them with a plus sign. Most of the major search engines support that.
Another important caveat, depending on where you live, is that Google (and
probably other search engines too) returns results tailored towards the
language setting of your browser. After noticing that some of my searches
seemed irreproducible between different browsers, I have investigated this in
detail. Web browsers send a HTTP header to web servers along with
retrieval requests, which usually contains the entry HTTP_ACCEPT_LANGUAGE
giving their language preference. (See RFC 2616
for more on HTTP headers.) Google uses this header field to select search
results. In my investigation, I have found around 50% different results
between the first result page retrieved from different countries.
If you search for objective information, you usually do not want results skewed in that way, however well-meaning the intent. This is especially true for someone (like me) who can read several of the languages widespread on the web. When a search engine favours results in the language your browser reports as its own, you may miss out on good results in different languages. Admittedly the language of the search terms sometimes determines the language of the results; but for technical or scientific terms, which make up most of my searches, this is not necessarily true.
What one would ideally are results in any language ranked together in a single list. I do not know how to achieve that under the circumstances described above. Changing your browser's language setting will get you results with a different preferred language, but no common ranking with results with a different language preference. I myself have set my browsers to English and primarily use English search terms. English being the most widespread language on the web, I hope to get satisfactory results that way (and usually do), but I realise I may miss some good web pages in German and French.
That a prepended plus sign enforces your choice of search terms was already mentionend in the previous section. Correspondingly, prepending a minus sign excludes pages containing a term (but you likely knew that). (On a side note, Clusty's clustering often makes this unnecessary, because you can ignore clusters containing words unrelated to what you are looking for.) The third moderately advanced feature that one frequently needs is enclosing a phrase in double quotes. This causes the search engine to search for the phrase rather than just its words in any order. Putting periods between the words (without spaces) works just a well, because search engines treat punctuation as spaces but keep the words in the right order.
Further features that may be useful sometimes are operators prepended to search
words. These operators are often undocumented by the search engines. Google
used to have a help page about them, which seems to have been removed (archived
copy
), but the operators themselves still work.
Clusty documents site:
and filetype:
, but filetype:
never returns any results. However, a few undocumented operators do work.
Here is an overview based on my testing (current November 2009):
| Function | Clusty | Google | Exalead |
|---|---|---|---|
| Restrict domain | site:, host: | site: | site: |
| Restrict file type | — | filetype: | — |
| Require keyword in title | title: | intitle: | intitle: |
| Require keyword in URL | url: | inurl: | inurl: |
| Require link to site or URL | — | link: | link: |
| Documentation | partial | archived | online |
As a convenience feature, Google allows allinurl: and
allintitle: to require all search terms to appear in the URL or title.
The meagre or missing documentation suggests that Clusty and Google do not
consider their operators important. To be sure, you had better try them out
with a query the results of which you can easily verify before relying on them.
By contrast, Exalead's main selling point is its advanced features. For
example, I have found its link: operator to return considerably more
complete results than Google's. Besides, it allows the operators
language:, spellslike:, soundslike:,
before: and after:. The last two restrict the modification
date of the result document, not merely the date it was last indexed.
Amazingly, they seem to work even for servers that do not return a correct
modification date. (Most, unlike mine, just return the current date/time. The
browser links
allows to display this information with the = key.)
Apparently Exalead manages to extract date information from the URL or page
text.
The usefulness of the operators depends strongly on what kind of search you do. site: makes it possible to search a specific site or site type (by giving the top-level domain such as .net). Requiring an important keyword in the title or the URL may sometimes give more pertinent results, but you cannot rely on that. One should also remember that the definition of title: for non-HTML results is open to interpretation and may differ between search engines. Exalead's spellslike: and soundslike: can help if you have heard of a domain or organisation but do not know its spelling.
Beside the unary operators above, search engines support a number of binary operators. Exalead is in the lead (as it were...) here too. Its NEXT operator saves you searching for a phrase two words in both orders separately. NEAR searches for terms in close proximity. Capitalised AND and OR are logical operators, and parentheses are used for grouping. The same logical operators and grouping parentheses seem to be supported by Clusty and Google, though Google shows a strong preference for results that contain all search terms even when OR is given.
Finally, Exalead allows truncating search terms with an asterisk *. Any page containing a word starting with the given string matches. Google uses the asterisk to allow wildcard words, which I regard as useless. Because the number of words between search terms remains fixed, this is not at all equivalent to proximity search — an unforseen adjective will break the match.
To someone who uses web search engines extensively, downloading their search forms time and again soon becomes a serious overhead. The simplest solution is to download the search engine's form page, which is fine if you use primarily one search engine. Otherwise you would prefer to have multiple search engines available with minimum hassle. This can be achieved by copying and pasting together the HTML search forms of several search engines into one HTML document which you store locally. The only thing you may have to change is to convert the "action" URL of the form to an absolute URL. Here is an example with my three favourite search engines and Wikipedia, with some slight modifications. (If you like it, download it, or you will have to wait for my web server every time.)
You can also create your own search form. This requires some knowledge of the
syntax of HTML forms and how they are submitted, so the less technically minded
may want to skip ahead. I use the excellent
German-language Selfhtml
; for English-speakers, this site may be helpful. The
following will give a brief hands-on introduction with regard to search engine
forms.
As far as I know, all search engines use the GET method for transmitting search requests. This means that the search terms and other parameters are appended to a URL in the search engine domain to make the query. In their simplest forms, queries to the three search engines I use look like this:
http://clusty.com/search?query=<search terms> http://www.google.com/search?q=<search terms> http://www.exalead.com/search/web/results/?q=<search terms>
Here, <search terms> stands for the suitably encoded search terms, with the spaces replaced by plus signs etc. The simplest HTML form to submit a query such as these is:
<form action="http://clusty.com/search"> <input name="query"> <input type="submit" value="Search"> </form>
The action attribute of the form tag gives the URL to which the form data is to be submitted. The first input tag becomes the parameter query=... to the URL. Its name attribute determines the parameter key (before the equals sign); the text you entered in the input field is its value. The second input tag does not define any parameter; it merely represents a button that triggers submission of the form.
input tags can have several attributes. One which is worth knowing about is the size attribute, which defines the length of a text input. Add it or increase it to get an input line suitably long for complex searches:
<input name="query" size="70">
In order to add a parameter with a fixed value to your search requests, you can add a hidden input tag with a fixed value:
<input type="hidden" name="key" value="value">
The best way to identify valid parameters and their meaning is to use the advanced search form of a search engine and looking at the URL of the result pages. Parameters come after the question mark in the URL, are separated by ampersands (&) and have the form <key>=<value>. You have to sort out the many parameters which are always there and test the one you are after by modifying it in the URL line of your browser and reloading the results page. For the three search engines I use, I have identified the parameters determining which results are returned:
| Function | Clusty | Exalead | |
|---|---|---|---|
| n results per page | v:state=root|root-0-n|0 | num=n | elements_per_page=n |
| start at o-th result | v:state=root|root-o-10|0 | start=o | start_index=o |
| Navigation language | — | hl=(country code) | — |
You can set these parameters with a hidden input tag. Note that Google's num parameter differs from the others in that a longer list of results per page is not just the concatenation of pages containing fewer results, but often contains some other results too.
In addition, Google allows two parameters that modify the submitted query.
This seems to be documented nowhere, but I saw it demonstrated on searchlores
, and it is still working. The as_oq parameter
forces one of a set of terms to be
present, while the as_eq parameter requires none of them to be
present. Their values are a space-separated list of terms. So one can impose
additional constraints in specialised Google search forms via hidden inputs,
for instance for product test searches:
<input type="hidden" name="as_oq" value="test review evaluation"> <input type="hidden" name="as_eq" value="buy.now special.offer free.download">
Since a form can have only one "action" URL to be submitted to, using the same input line for several search engines is not possible in plain HTML. But this can be done using JavaScript, saving you the trouble of typing the same search terms multiple times. Here is an example universal search form, again for my three favourites plus Wikipedia. Please note that the syntax of operators is more or less the same between Google and Exalead, but different for the others.
This is a topic on which everybody is pretty much on his/her own. Finding a small number of words which characterise documents that are of use for a given specific purpose is hard, and there is certainly no algorithm for it. One has to try a set of terms and modify them according to the results one obtains.
Sometimes pertinent result documents contain other important terms that are worth searching for. (Again I find using Clusty is an advantage, as it identifies words that frequently occur in the results, which may consequently serve as additional search terms.) Sometimes they suggest a completely different avenue of attacking your research problem, with a corresponding totally different set of search terms. It may be more efficient to search for a set of terms that are likely to occur closely together, since most search engine prefer such results — results in which terms appear together which are not supposed to may be less pertinent.
This vague advice given, I would like to mention one tool that can be very
useful for refining one's choice of search terms: Lexical FreeNet
is a
database that maps relationships between words. It does not only list
synonyms, but also
generalisation and specialisation relationships and many others. Unfortunately
it may be going out of business; the icons representing the relationships are
already missing. If it does vanish, Wordsmyth
may be
worth trying, though it is less precise about relationships between words.
Simple synonym finders may also be occasionally useful (see Wikipedia's page on
synonyms
for a selection). Translating a word
into a different language and back can also help.
This section is going to be comparably brief, as I do not use specialised search engines much and web directories practically not at all. Besides, if you are especially interested in a specific domain, you are likely to know search engines and/or directories pertaining to it, and those I use will not be of use to you. Nonetheless, a few words on them.
Domain-specific search engines are search engines specialising on a subject or
field of topics by restricting either the sources they index or the criteria
they index by. On one end they border on specialist databases containing
references which may or may not be on the web but which are of narrowly-defined
type, such as scientific papers. On the other end, they cross over to web
directories, which list all kinds of web resources but are typically compiled
by humans, not automatically. Two lists of specialised search engines that
look useful to me are at
NoodleTools
and on Phil Bradley's web site
.
Among the domain-specific search engines I have used myself is Scientific Commons
, a database of scientific papers and preprints that
includes arXiv.org
and
other preprint servers and thereby
provides a good chance of leading you to something you can actually download,
rather than just read the abstract of. The RCLIS (Research in Computing,
Library and Information Science) project operates a database of e-prints
in library and information sciences. Among others, this
includes research on web search
engines and information retrieval systems. Wikipedia also has a list
of academic search engines
, but one should note
that "free" access cost refers to accessing the search only, not the
papers.
Google Scholar I actually do not use at all. I regard it as commercial in that its searches return results based on a full-text search of papers that are not actually available on the web. This typically causes sales pages of academic publishers (which display only abstracts) to precede self-archived versions and costs the researcher a lot of time to sort out false positives before an online resource can be found. Apparently this is achieved by allowing Google's robots to retrieve the full text of the papers. Therefore I regard it as an advertising service for academic publishers rather than a search service for researchers and can only discourage anyone who cares about informational freedom from using it.
Finally, I have little experience to offer on web directories. The WWW Virtual Library
, the open directory project
and Google directory
should probably be mentioned. The reason I do not use web
directories is that most of my web
research concerns not just special topics but also quite specific questions.
Web directories may be useful to find broad and introductory pages on a
subject. For that purpose, I usually start from Wikipedia and follow external links, or look at Wikibooks
. An
ordinary web search will serve just as well if you include one of the words
"introduction" or "tutorial". Finding broad and general
material on a subject tends not to be too hard, and there are several ways of
arriving at it.
In a way, finding something useful on the web is only part of the problem. Too often you realize only later that a given source may be useful, in the view of information you found elsewhere, or to supplement it. By then you will usually have forgotten both its URL and the exact terms you used to find it with a search engine. The latter matters, because often the search results depend on not just the search terms, but also their order and "+" flags.
A different but related situation occurs if you want to follow a link you have good reason to believe is useful, but find it no longer exists. This problem requires to find an archived copy of the page in question, while the previous is best dealt with by archiving it yourself.
If you find something that may be useful, grab hold of it. Save at very least the URL until you are sure that you will not be needing it. Err on the side of caution — you may only realize later how rare some information is. To save a URL, you might use a designated temporary bookmarks folder in your browser, or simply a text file (as I do).
You may in addition want to download and store a copy of the web page if one of the following applies:
If you decide to save multiple files, the automated downloaders wget
or curl
may help you. wget can
retrieve web pages including "page requisites" such as images, while
curl allows a syntax similar to glob patterns to retrieve multiple
URLs.
Finally, some social bookmarking services are rumoured to provide archiving
services. Furl did this, but is now defunct
. Anyway, since such services cannot make the archived
content publicly available for
copyright reasons, they are hardly better than your local computer.
Admittedly they might allow you to access your saved content from anywhere via
the web, but you can achieve the same thing by storing it in an online storage
system
.
When a web page you have good reason to judge useful has become unavailable, you have to turn to web archivers. When the page in question is the result of a web search, the easiest and most effective to retrieve the its copy cached by the search engine. Not only is it just one click away, but you can also be sure that this is the version which the search engine has indexed, and consequently which does contain the search terms you requested.
When you want to follow an obsolete link, things are less easy. But you still
have a fair chance of getting at the content. The Internet Archive
mirrors large parts of the web at regular intervals, as well as other content
placed in the archive deliberately. To search for stored copies of a URL, you
use its Wayback Machine
. It also allows to search the archived content, though I
have not tried it and cannot say how useful it is.
The URL of archived web content has a simple structure. The following gets you the overview page which indicates when a given URL was archived:
http://web.archive.org/web/*/<URL>
<URL> stands for the URL of which you want an archived copy, which may but need not include the "http://". This link is a JavaScript bookmarklet that forwards you to the overview page corresponding to the current URL. Bookmark it, and you can click it to go from a "not found" notice to the archive. The URLs of the actual archived pages are very similar to the above. Just replace the asterisk by the date and time in YYYYMMDDhhmmss format. You need not get the date and time at which it was archived right — the copy from the closest date/time will be automatically returned.
A web archiving service aimed specifically at academics is Webcite®
.
It is a free service that allows to create a copy of web pages one wants to
cite, thereby ensuring the
continued availability of these sources. (See here for how to put both direct and
archived URLs into your bibliography using BibTeX.) Webcite also allows you to
search for a specific URL which anybody might have archived. Somewhat
annoyingly, considering you give keywords when archiving something, you cannot
search by keyword.
Neither the Internet Archive nor Webcite can help you if the owner of a web page wants to remove it. Both comply with explicit requests for removal. However, not everybody has heard of web archivers, so no such request may be made and an archived copy may be available. The admissability as evidence of archived web pages is dubious, so don't rely on them as such. Furthermore, since the Internet Archive uses robots to retrieve content to archive, it obeys the robot exclusion standard and may be barred from (parts of) some sites. As a consequence, some sites are not in the archive at all.
Web archiving is still young enough that it is worthwhile to keep up with
developments. The Wikipedia pages on link rot
, digital
preservation
and web archiving
may be helpful for that.
Advertising is the dark half of any information transmission medium. Rather than providing reliable information, it aims to influence and manipulate by exploiting the irrationality of human nature. To the researcher, it impedes the search for knowledge by creating background noise and false positives.
How serious a problem this is depends on how close the topic of one's research
is to somebody's commercial interests. Scientific topics tend to be almost
unaffected, while obtaining technical information on products is made
significantly harder. The impact of embedded advertisements also depends on
how good one is at ignoring blinking pictures and videos. Those whom (like me)
they keep from concentrating can help themselves by uninstalling their Flash
player and block access to known ad servers (see here
for a
list).
Beside plain old embedded ads, I have encountered two new types of nuisance that I had not previously heard about. After a brief word on webspam, I will take the opportunity to present them here. If you are not interested in these annoyances, you can skip this part.
Webspam refers to camouflaging advertising as web content and employing a number of techniques to make search engines send users to them. Popular methods include putting up interlinked web pages to increase their ranking, creating fake blogs, embedding misleading keywords outside the visible content of web pages, and serving different pages to search engine bots and people.
These practices, serious though they are, tend not to affect researchers directly. Search engines rightly see them as potentially fatal to their business model and as a consequence devote significant resources to combating them. In my experience they are rather successful. Furthermore, webspam is not usually a problem when searching for scientific or other objective information. Product searches are a different matter, but allow an obvious sanction: do not buy products which are advertised via webspam.
While earlier in the web's history mistyping a domain name led to a name server lookup error, these days it usually results in a page with a generic logo, many links and a search bar. Some companies specialise in acquiring domains with names similar to popular sites in order to benefit from typos and redirect unsuspecting netizens to their paying clients. I like to call this practice "click trafficking" because of its intent of harvesting clicks as opposed to the more general "domain squatting" which aims at selling the domain at a profit.
This is usually more annoying than serious when accessing a domain one knows, but can be time-consuming when searching for a domain one has heard of or simply guessed. Nonetheless, it is to be expected that inexperienced users of the web are being fooled. The web pages presented usually repeat their own domain name, thus similar to the original but legally above board, and present links related to what the original domain is about, in addition to the usual suspects of dating, credit and domain purchases. Examples include (current 10/2009):
| Spoof domain | Registrant | Main link target | Original domain | Type of original |
|---|---|---|---|---|
| wikipeda.org | Jasper Developments Pty Ltd. | ndparking.com | wikipedia.org | Encyclopedia |
| alsa-project.com | Portfolio Brains, LLC | information.com | alsa-project.org | Audio subsystem |
| ecomonist.com | Web Commerce Communications Ltd. | doubleclick.net, zazasearch.com | economist.com | Newspaper |
| ruby.org | Tucows Inc. | ruby.org | ruby-lang.org | Programming language |
| foldoc.com | Domreg Ltd. | information.com | foldoc.org | Dictionary |
| via.com | Enom Inc. | sedo.com | via.com.tw | PC hardware |
| gnuplot.com | Mdnh Inc. | doubleclick.net, googlesyndication.com | gnuplot.info | Plot program |
As one can see, click trafficking provides for a fair number of companies of the shadier parts of internet commerce. That the actual owners of the spoof domains link to sites owned by different companies (and that different owners sometimes link to the same site) suggests that they have subcontracted the handling of their clicks.
Click trafficking is a nuisance during eyeball searches, but may be more serious when using a spider for searching (see below). Automatically filtering of spoof domains may be possible, but I am not aware of any effort in that direction. Those domains which link to themselves tend to use links with extremely long query parameters. Most link to two domains at most, one of which is usually a domain broker or advertiser. The name of the registrant, on the other hand, cannot be used as a filtering criterion, as automated whois queries are banned.
My current internet service provider has hit on a racket I call "portal placement", and no doubt it is not the only one. Every time a name server lookup fails (such as when a dead link leads to a non-existent domain), the browser is redirected to a page from the ISP's favoured search portal, with the queried domain as the search term. This artificially increases traffic to the search portal and thereby increases advertising revenue.
How this is done is interesting in itself: The domain name server returns the IP address of a web server at the ISP instead of a failure notice. This web server returns a redirection to the search portal when the browser connects via HTTP. (This can be found out using the command-line downloader curl with the -v option, which makes it print the IP addresses of the servers it connects to.)
Like click trafficking, this is a minor nuisance. Because the portal's URL appears in the browser location bar, the original URL cannot be edited in case of typos. Again as for click trafficking, automatic filtering ought to be possible. The browser (or a local web proxy) must be made to ignore redirects to the portal in question or avoid contacting the redirect server in the first place. To my knowledge, no browser allows this out of the box at the moment, so this seems to require setting up a proxy.
General -- Downloaders -- Search spiders -- Custom spiders -- References
Squeezing information from the web can be quite time-consuming. Some of the effort involved is repetitive — checking that search engine results really contain the requested terms, judging each result according to additional criteria and saving the useful ones to disk or extracting specific information from them. Therefore it may help to automate some of these tasks if a large number of documents is to be processed.
As when automating a data processing task locally on a single computer, it has first to be decided which parts of it are best performed by a human and which can be automated. Unfortunately, in the case of information searches, the one part best done by a human is the one in the middle — deciding whether a source is useful. Ideally the user would be kept in the loop as a decision-maker, but that would require a hybrid between a robot and a browser that could automate some steps but also display selected pages and take user input. I am not aware of any such program. The second-best but feasible solution is to program a preliminary check designed to eliminate many false positives and store the results for later decision by the user.
Before we delve into details, a word on nomenclature: While a robot can be any automaton that uses the web, the term spider is usually reserved for robots that can follow hyperlinks and thereby traverse large parts of it. One could say that while spiders can walk around the net, robots are stuck where they hit it, just like insects that are not spiders ;). Both can be of use in information searches, depending on the task at hand, though spiders are obviously more powerful.
Before asking oneself how best to write a robot, one should consider two other
questions: when or whether to write one, and how it should behave. The answer
to the first results from the tradeoff between the time needed to write a robot
and the time it is likely to save you. Obviously there is no algorithm to
determine that, so the following will be a bit wooly. If you have been
searching manually for less than half an hour, continue doing so until you are
sure that you are not getting there. On the other hand, if you have to search
a known and large part of the web, you may want to start programming right
away. The more general the spider you write, the more often you will be able
to use it, but the more time it will also take to write. If you have clear
search criteria, a robot is likely to get good results, but if are
"fishing" to get clues for subsequent secondary searches, human
intelligence is clearly required. Is there perhaps already a web database that
contains the information your robot would extract? Consult a list of
specialised search engines and databases (auch as this
)
again. Only you can answer the question whether it makes sense to write a
robot for your particular problem, but you should think about it for a few
minutes before spending half an hour either programming or searching manually.
The second question, how decent robots should behave, is more easily answered, by a combination of common sense and convention. Hosting web pages costs money, and some hosting providers limit the amount of data that can be downloaded in a given period of time. Similarly, bandwidth on the net as a whole is a common good. So your robot should strive to avoid unnecessary downloads. In fact, using a robot as opposed to a web browser can actually help to limit bandwidth, because a robot does not typically download images and other data embedded into a web page automatically. For a typical web page today, this can save a lot.
A related topic is the Robot Exclusion Standard. A robots.txt file
located at the root directory of some sites (that is, at
http://servername.bla/robots.txt) advises robots what parts of the
site to traverse and what to pass up. Though robot authors are not required to
obey it, there are several good reasons to do so: It is good netiquette. Most
web site administrators are aware that robots.txt is only advisory, so
anything your robot is not supposed to see will be hidden by additional more
effective means. And finally, the most frequent use of robots.txt is
to keep robots away from junk. So observing it often saves the robot time and
the net bandwidth. The syntax of robots.txt is described on Wikipedia
and robotstxt.org
,
and there is a Perl
module
for parsing it. As an aside, if your
research concerns the Robot Exclusion Standard itself, have a look at BotSeer
, a
database of robots and robots.txt files.
One more important matter of politeness is the frequency with which a robot requests pages. An extension of the Robot Exclusion Standard is the Crawl-delay field which gives the recommended number of seconds to wait between retrievals. If the robots.txt file of a server contains it, that is what you should choose. If the keyword is not present, or if you write a small spider and do not want to burden yourself with parsing robots.txt, my advice is to wait 10 seconds. This may seem much, but be realistic: The web operates at a time scale of seconds as opposed to the microseconds of your processor or the milliseconds of your hard drive. You are not going to sit around twiddling your thumbs waiting for your robot's results. Rather, it will run in the background for half an hour or more, while you occupy yourself productively with something else. Choosing a rudely small crawl delay is not going to change that, so you might as well be polite. Ten seconds is sufficient time for a competent browser user to download a page and type in a search term in the browser's search function, so this gives you a valid line of defence should someone accuse you of overwhelming their server with your robot.
Finally, if you use web services, you have to obey their terms of service,
which sometimes ban robots. Google currently does this again, while other
search engines do not. (Dogpile
, a
metasearch engine that includes Google results, does not seem to ban
robots
, so this may be a way to get at Google
results with a robot.) A
company's terms of service can also demand obeisance of the robot rules in
robots.txt, which makes them binding rather than advisory. If access
to some pages is banned for robots in this way, the only thing to do is to
download them manually and let your robot operate on the downloaded files or
the links contained in them. If your aim is to refine Google's search results,
you should probably use the following URL rather the one given above:
http://www.google.com/ie?q=<search terms>
The resulting page is much cleaner and therefore easier to parse than Google's normal search results. Remember that you have to type this query manually into your browser and may not automate it! To be safe, you also should check at least anecdotally whether the results are the same as for the regular Google search.
Before we proceed to robots intended for searching, let us have a look at more
general and more widespread robots, primarily those intended for automated
downloading. The most well-known of these are wget
and curl
. The primary difference between
them is that wget can recursively follow hyperlinks but curl
does not. wget is therefore suited for downloading whole subtrees of
interlinked HTML documents, and automatically downloading embedded images and
the like. In return, curl allows the user to give ranges of URLs in a
notation similar to glob patterns, which is useful for retrieving numbered
documents. Both support authentication (logging in at a web page), encryption
(SSL) and bandwidth limiting. wget supports a crawl delay.
wget respects the robot rules in a server's robots.txt when downloading recursively. curl, which does not do recursion, does not. Apparently the implied interpretation is that so long as a URL is given explicitly by the user, the program does not act as a spider and need not obey the robot rules, which has a certain logic. Whether this reasoning would be accepted in court in cases where observance of robots.txt is enforced by terms of service is a different matter.
In the interest of preserving network bandwidth, it should be mentioned that
curl allows so-called HEAD requests, which do not retrieve the
document in question. This can be useful to check whether a document is there
at all, or to extract information from the HTTP headers
which are returned. The following table displays the most
important options of wget and curl:
| Function | wget | curl |
|---|---|---|
| Set output file name | -O <file> (single file only) | -O / -o <file> |
| Limit bandwidth | --limit-rate=<B/s> | --limit-rate <B/s> |
| Crawl delay | -w <seconds> | — |
| HEAD request | — | -I |
| Set user agent string | -U <string> | -A <string> |
| Set referrer URL | -referer=<URL> | -e <URL> |
Besides programs intended for automated downloads, text-based web browsers can
be very useful. The most widespread ones, lynx
, w3m
and links
, all support outputting the rendered web page with the
option -dump. This provides a way to obtain a plain text version of a
HTML document.
By now you may be wondering what place a discussion of downloaders has in a text about web research. The reason is that before a document on the web can be investigated, it has to be downloaded. Automated downloaders are therefore a required component of robots written as shell scripts. Though robots can be written in other programming languages (and frequently are), shell scripts have the advantage of being a very high level language with correspondingly low development effort. They also allow you to make use of numerous standard UNIX tools designed to process plain text files where other languages may require you to duplicate their functionality. If you intend to perform multiple investigations on a limited number of web documents, it may make sense to separate the download and investigation steps altogether. You might first retrieve a set of web pages and then program one or several tools to run queries on them. This saves bandwidth compared to downloading them every time as would be done if the queries were performed by different spiders.
Finally it should be mentioned that some sites block automated downloaders by various means, sometimes even unintentionally. If the site in question states (in its terms of service or on the pages concerned) that it bans robots, this should be respected. However, if you find no such statement after looking carefully, I regard it as legitimate to try to bypass such restrictions. In my experience blocking robots is most frequently done in two ways. Every browser and robot sends a so-called "user agent" string along with its HTTP request, which usually gives its name, version and operating system. Some servers refuse to service requests unless they originate from well-known browsers. This can be prevented by providing the user agent string of a browser using the options above. If your download task is simple enough, it may also be worth trying to use links, which has a graphical mode and therefore tends to be regarded as a "real" browser and tolerated.
The second method of discouraging robots involves the "referrer" header (often spelled "referer" following a misspelling in an RFC). This gives the URL which references a file when downloading it, such as the page which hyperlinks to a page or which embeds an image. If a URL is retrieved directly, there is no referrer, and some servers refuse to reply. (This is also used to prevent "hotlinking", putting images on other servers into one's own pages.) An acceptable referrer URL can usually be easily found out (a page containing an image, a table of contents etc.), and the downloader can be told to transmit it with the options given above.
There is in fact a number of search spiders available on the web, from a variety of projects and individuals. I admit that I have no great experience using them, as they tend not to meet my needs for one reason or another, see the following.
The available search spiders differ in both abilities and purpose.
Many search spiders are intended for indexing, like the spiders of the big
search engines. Examples are Nutch
,
which is affiliated with the Apache web server project, ht://Dig
, and the Aperture framework
, a Java library. Using such spiders makes most sense if
you are going to answer a
large number of queries for a limited number of documents and according to
known and limited criteria. Indexing these documents once or regularly with
respect to your criteria will save bandwidth because only the resulting
database has to be queried repeatedly. If you intend to make widely different
complex queries to unpredictably different parts of the web, indexers will not
help you. However, indexing an intranet or local files can be helpful to
obtain a preselection for further investigation. It should be mentioned that
ht://Dig supports a number of non-exact matching requirements such as
sound-alikes, synonyms and substrings.
Some other robots are downloaders like wget and curl (and
therefore not search spiders, though they can be used as part of such).
Examples for this category are Pavuk
, Arale
and
Larbin
. I have not tried them out, as I have always found
wget and curl sufficient. If you miss a feature, you might
want to look at the others.
The bots section
of the Searchlores site presents a number of robots with
accompanying articles. The most interesting one for web search is W3S by Mhyst
. It is a spider that can traverse part of the web
according to criteria given by the user and report on the number of keywords
contained in each page. I have to admit I have not yet got round to trying it
out, but it has a GUI that seems intuitive to use and is written in
platform-independent Java. It seems quite suitable for search tasks a search
engine cannot perform, such as "real-time" searches of the web and
intranet. The other robots presented at Searchlores are mostly
special-purpose. Reading about them helps to learn about writing robots, and
some may fit your task at hand or could be adapted to do it.
Finally, if you have good knowledge of the web and HTML, and have demanding requirements for a search spider, I can recommend the one I have perpetrated myself, wfind. It is very powerful, but accordingly complex to use, and it takes an expert to exploit its full potential. If you are interested, read my introductory page and its manual page online.
The first choice one has to make after deciding to write a spider is what
programming language to use. This decision has the same tradeoff of
development effort versus flexibility as always. For me the answer is usually
either a shell script or Perl. Shell scripts are fine if you do little more
than downloading and do not use much control flow and data structures.
Retrievals are done via wget or curl, and standard text
utilities like grep, sed, tr, etc. can be used to
process the data and determine the next URL to visit. Programs like pdftotext
, catdoc,
xls2csv
, etc. can be used to convert
non-plain-text formats. Attempting complex analysis or parsing of HTML, XML or
other plain-text based formats directly with the shell is probably not a good
idea. But be aware of the little-known fact that the shell bash
supports regular expressions:
[[ "$text" =~ $regex ]]
The code snippet above matches $text against the extended regular expression
$regex. A further drawback of shell scripts is
that you are stuck with the features of the downloader you are using.
For more complex tasks than are comfortable to do with a shell script, I use
Perl. This choice is partly due to my knowing Perl pretty well now; I am sure
that Python or Ruby are as suitable. The most important features are the ease
of retrieving web pages and the availability of analysis features you want,
such as regular expressions. If you want to analyse non-plain-text file
formats or search by special criteria (sound-alikes, fuzzy searches etc.), it
saves you work if there are packages for that in your chosen language. Perl
has the advantage that a large number of modules are available in a
comprehensive database, CPAN
.
Spiders have also been written in non-scripting languages such as C, C++ and
Java. I think these are too low-level and consequently too work-intensive and
do not add enough in the way of flexibility to be worth the extra effort. But
if you know them well and/or know a library that provides an unusual feature
you need, choose them by all means. Be aware that some script languages
(for instance Perl
) offer
interfaces to C, and all (AFAIK) can execute external programs. This allows
you to program the necessary part of your task (and only that) in a lower-level
language.
The details of how to write your spider are something I cannot help you with. Every (search) task is different, and so every spider for a specific task will also be. However, I can give you a few caveats:
In order to learn to write a robot, it can be useful to look at existing code.
The bots section of
Searchlores
has already been mentioned. The code
snippet database Snipplr
contains some snippets related to automated downloading
and searching. The open book Web Client Programming with
Perl
conveys basic understanding of how web
browsing works technically, and how to do it with a Perl script. The non-open
book "Spidering Hacks" (cited in full below) is dedicated to writing
robots in Perl, most of which are not actually spiders by my definition above. An archive of examples can be
downloaded from its homepage
.
© 2009 Volker Schatz. Licensed under the Creative Commons
Attribution-Share Alike 3.0 Germany License