|↑↑ Home||↑ Net & Web|
About -- Search engines -- Directories -- Archiving -- Webspam -- Spidering -- References
The topic of this document is web research, or how to obtain a specific piece of information, or information about a specific topic, via the web. Achieving this includes but goes beyond using search engines. If you are looking for an introduction to using web search engines, this page is not for you. The word "introduction" in the title merely indicates that it is meant to be broad and inclusive rather than deep, covering all aspects of web research but not delving too deeply into any of them.
This document will also not teach you how to evaluate what you find. Besides obvious sanity checks (an introductory page recommends: "Look at the title text. Does it relate to the subject?"), I assume that the searcher is sufficiently familiar with the topic to weed out junk, or else unbiased enough to view a representative number of sources.
This page was written to collect my experience and thoughts on how to squeeze information from the web in the face of an overwhelmingly low signal-to-noise ratio. It is aimed at scientists and engineers, perhaps journalists and bloggers, generally everybody who requires information which can in principle be found on the web, if only one knew how to get at it. If you know what you are looking for, or if you regularly find yourself clicking through the first five to ten result pages of a search engine rather than just the first, the following may help you.
Caveats -- Advanced -- Customising -- Search terms -- References
The first stop in any attempt to obtain information from the web is likely to be a general-purpose search engine. The main reason for this is that most searches are trivial (a forgotten domain name of a company; information on a general topic). For specific searches and serious research, search engines are quickly becoming more and more useless. Nonetheless this section will discuss how to get the most out of some search engines.
While everybody probably has their own preferred search engine, I will briefly name the ones I use and their relative merits. The search engine I tend to use first is DuckDuckGo. In my experience, it returns good results also for specific queries about technological topics, which are the kinds of searches I do most often.
When DuckDuckGo does not turn up any satisfactory results, I use a major search engine such as Google, Startpage (Google-derived, with more privacy) or Bing. Their larger index gives them an advantage that their loose matching criteria do not entirely negate. Google in particular is acknowledged to have the largest index of all single search engines and is comparatively topical if the query is simple enough.
Lastly, when other search engines return a large number of false positives (i.e., results which do not meet my criteria), I use a powerful search spider of mine to filter the results of the meta search engine Dogpile. If that fails, I conclude that what I want probably cannot be found on the open internet (i.e. outside pay/registration walls).
Two other search engines that deserve a special mention are Exalead and Clusty. Exalead is one of very few search engines (the only one?) which supports advanced features like proximity search and truncation. On the negative side, its small index size is sometimes painfully obvious. Clusty is a meta search engine which clusters results according to what additional words appear in them. This allows to quickly discard results that contain the right search terms in the wrong context and sometimes suggests related search terms.
A few years ago, this section contained some words about how strictly different search engines obeyed the search terms. Now, all major search engines have a strong bias in favour of popularity over topicality as per the given search terms. Exalead and DuckDuckGo are notable exceptions in that they take search terms comparatively seriously.
The most important thing to know about search engines is that they are mostly run by advertising companies seeking to lure viewers to their ads. This has two separate consequences that present difficulties for researchers. For one thing, search engines tend to prefer popular results over topical ones, since most people prefer being entertained to learning. Besides, advertising companies are subject to an obvious conflict of interest between providing even-handed search results and preferring their paying customers.
Among the effects of the first problem is most search engines' lax interpretation of search terms. They do not actually search for the terms you give them in the way the grep command or a search in a text editor would. Instead, they take the liberty of substituting words they consider "equivalent", by whatever unpublished definition. Sometimes words with the same stem seem to be considered equivalent, sometimes even synonyms. In some of my Google searches, I have discovered that around half the results did not contain the specified search term (though of course part of that could be due to page contents having changed since they had been indexed). The extent of the problem rises for rare search terms, because like all salesmen, search engines hate admitting they have nothing to offer.
This is bad, because no search terms are really equivalent when you search for something specific. Synonyms are always inexact, and words from the same family may be used in different contexts — "distribution" may get you a linux variant or information about product delivery, but "distributive" refers to a property in algebra. With the vague interpretation of the search terms common today, you get no proper backfeed on whether your search terms were useful, and the iterative cycle of choosing terms and seeing if they are any good gets disrupted. Most search engines in theory support forcing a search term by prepending a plus sign, but in reality this merely increases the chances slightly.
Another important caveat, depending on where you live, is that Google (and possibly other search engines too) returns results tailored towards the language setting of your browser. After noticing that some of my searches seemed irreproducible between different browsers, I have once investigated this in detail. I have found around 50% different results between the first result page retrieved from different countries. This is partly due to the HTTP header HTTP_ACCEPT_LANGUAGE that browsers send to the server, and partly to Google using geolocation to detect where requests come from. This feature is clearly designed for targeted advertising, and the bias it creates in search results is the reason I avoid Google whenever possible. Ask, for example, allows to perform English-language searches from Germany by using the form at http://www.ask.com/?o=312&l=dir. DuckDuckGo and Exalead seem unbiased with respect to client geolocation.
The second problem mentioned above, an advertiser's conflict of interest, results in links to certain sites being inserted spuriously or disproportionately into the result list. Often these links lead to services of the search engine company itself, such as Google Maps. Some search engines mirror other sites in order to keep viewers at their own sites (and in reach of advertising): For example, Ask only references its own copy of Wikipedia in its search results, which is not always up to date. Clusty (or its alias Yippy) strongly favours results at the domain highbeam.com, apparently an affiliated company. Similarly, Ask sometimes inserts results linking to news articles at moreover.com.
What one should learn from all this is not to trust search engine results blindly. Nonetheless the following sections present the features they offer.
That a prepended plus sign enforces your choice of search terms was already mentionend in the previous section. Correspondingly, prepending a minus sign excludes pages containing a term (but you likely knew that). (However, Google now uses prepended punctuation for other purposes.) The third moderately advanced feature that one frequently needs is enclosing a phrase in double quotes. This causes the search engine to search for the phrase rather than just its words in any order. Putting periods between the words (without spaces) works just as well, because search engines mostly treat punctuation as spaces but keep the words in the right order.
Further features that may be useful sometimes are operators prepended to search words. These operators are often undocumented by the search engines. Google used to have a help page about them, which seems to have been removed (archived copy) and hidden in a support answers database, but the operators themselves still work. Ask does not document its operators, but one can reverse engineer them from the result URL of its advanced search form. Exalead provides complete documentation. DuckDuckGo does not document the standard operators either, even though it supports most of them. However, unlike other search engines, it supports category keywords prepended by an exclamation mark.
site: may work better (possibly only) when followed by the full server name, i.e. www.example.foo not example.foo. A few undocumented operators work too. Here is an overview:
|Restrict file type||filetype: (not for Exalead)|
|Require keyword in title||intitle: (DuckDuckGo/Clusty: title:)|
|Require keyword in URL||inurl: (Clusty: url:)|
|Require link to site or URL||link:|
As a convenience feature, Google allows allinurl: and allintitle: to require all search terms to appear in the URL or title. Ask seems to always apply inurl: and intitle: to all following search terms. The missing documentation suggests that Google and Ask do not consider their operators important. To be sure, you had better try them out with a query the results of which you can easily verify before relying on them. By contrast, Exalead's main selling point is its advanced features. For example, I have found its link: operator to return considerably more complete results than Google's. Besides, it allows the operators language:, spellslike:, soundslike:, before: and after:. The last two restrict the modification date of the result document, not merely the date it was last indexed. Amazingly, they seem to work even for servers that do not return a correct modification date. (Most, unlike mine, just return the current date/time. The browser links allows to display this information with the = key.) Apparently Exalead manages to extract date information from the URL or page text.
The usefulness of the operators depends strongly on what kind of search you do. site: makes it possible to search a specific site or site type (by giving the top-level domain such as .net). Requiring an important keyword in the title or the URL may sometimes give more pertinent results, but you cannot rely on that. One should also remember that the definition of intitle: for non-HTML results is open to interpretation and may differ between search engines. Exalead's spellslike: and soundslike: can help if you have heard of a domain or organisation but do not know its spelling.
Beside the unary operators above, search engines support a number of binary operators. Exalead is in the lead (as it were...) here too. Its NEXT operator saves you searching for a phrase of two words in both orders separately. NEAR searches for terms in close proximity. Capitalised AND and OR are logical operators, and parentheses are used for grouping. The same logical operators seem to be supported by Ask and Google.
Finally, Exalead allows truncating search terms with an asterisk *. Any page containing a word starting with the given string matches. Google uses the asterisk to allow wildcard words, which I regard as useless. Because the number of words between search terms remains fixed, this is not at all equivalent to proximity search — an unforseen adjective will break the match.
To someone who uses web search engines extensively, retrieving their search forms time and again soon becomes a serious overhead. The simplest solution is to download and save the search engine's form page, which is fine if you use primarily one search engine. Otherwise you would prefer to have multiple search engines available with minimum hassle. This can be achieved by copying and pasting together the HTML search forms of several search engines into one HTML document which you store locally. The only thing you may have to change is to convert the "action" URL of the form to an absolute URL. Here is an example with four search engines and Wikipedia, with some slight modifications. (If you like it, download it, or you will have to wait for my web server every time.)
You can also create your own search form. This requires some knowledge of the syntax of HTML forms and how they are submitted, so the less technically minded may want to skip ahead. I use the excellent German-language Selfhtml; for English-speakers, this site or this site may be helpful. The following will give a brief hands-on introduction with regard to search engine forms.
Most search engines use the GET method for transmitting search requests. This means that the search terms and other parameters are appended to a URL in the search engine domain to make the query. In their simplest forms, queries to some search engines look like this:
http://www.ask.com/web?q=<search terms> http://www.google.com/search?q=<search terms> http://www.exalead.com/search/web/results/?q=<search terms> https://duckduckgo.com/html?q=<search terms> https://startpage.com/do/search?query=<serach terms>
Here, <search terms> stands for the suitably encoded search terms, with the spaces replaced by plus signs etc. The simplest HTML form to submit such a query is:
<form action="http://www.ask.com/web"> <input name="q"> <input type="submit" value="Search"> </form>
The action attribute of the form tag gives the URL to which the form data is to be submitted. The first input tag becomes the parameter query=... to the URL. Its name attribute determines the parameter key (before the equals sign); the text you entered in the input field is its value. The second input tag does not define any parameter; it merely represents a button that triggers submission of the form.
Privacy-conscious search engines like DuckDuckGo and Startpage serve results via an encrypted HTTPS connection, and receive their arguments via a POST method (also encrypted), which can provide more privacy than GET in some situations:
<form action="https://duckduckgo.com/html" method="post"> <input name="q"> <input type="submit" value="Quack"> </form>
If you want to retrieve its results with a command-line downloader or a simple program, it is usually easier to use a GET request as above.
input tags can have several attributes. One which is worth knowing about is the size attribute, which defines the length of a text input. Add it or increase it to get an input line suitably long for complex searches:
<input name="query" size="70">
In order to add a parameter with a fixed value to your search requests, you can add a hidden input tag with a fixed value:
<input type="hidden" name="key" value="value">
The best way to identify valid parameters and their meaning is to use the advanced search form of a search engine and looking at the URL of the result pages. Parameters come after the question mark in the URL, are separated by ampersands (&) and have the form <key>=<value>. You have to sort out the many parameters which are always there and test the one you are after by modifying it in the URL line of your browser and reloading the results page. For some search engines, I have identified the parameters determining which results are returned:
|n results per page||—||num=n||count=n||elements_per_page=n||v:state=root|root-0-n|0|
|start at o-th result / p-th page||page=p||start=o||first=o||start_index=o||v:state=root|root-o-10|0|
|Navigation language||—||hl=(country code)||—||—||—|
You can set these parameters with a hidden input tag. Note that Google's num parameter differs from the others in that a longer list of results per page is not just the concatenation of pages containing fewer results, but often contains additional results. DuckDuckGo and Startpage do not allow direct access to the n-th result page, be it as a side effect of their security-consciousness or intentionally. Startpage allows generating an obfuscated additional parameter to its search submission that contains user preferences, among others the number of results per page, which could also be added to a submission form.
An important Google-specific parameter is nfpr=1. Recently Google has taken to replacing search terms by more popular ones without user confirmation — for some searches, instead of the link to "did you mean...", the results for the replacement search terms are displayed and a link to the results for the user's search terms is added. nfpr=1 prevents this.
In addition, Google allows two parameters that modify the submitted query. This seems to be documented nowhere, but I saw it demonstrated on Searchlores, and it is still working. The as_oq parameter forces one of a set of terms to be present, while the as_eq parameter requires none of them to be present. Their values are a space-separated list of terms. So one can impose additional constraints in specialised Google search forms via hidden inputs, for instance for product test searches:
<input type="hidden" name="as_oq" value="test review evaluation"> <input type="hidden" name="as_eq" value="buy.now special.offer free.download">
This is a topic on which everybody is pretty much on his/her own. Finding a small number of words which characterise documents that are of use for a given specific purpose is hard, and there is certainly no algorithm for it. One has to try a set of terms and modify them according to the results one obtains.
Sometimes pertinent result documents contain other important terms that are worth searching for. Sometimes they suggest a completely different avenue of attacking your research problem, with a corresponding totally different set of search terms. It may be more efficient to search for a set of terms that are likely to occur closely together, since most search engine prefer such results — results in which terms appear together which are not supposed to may be less pertinent.
This vague advice given, I would like to mention one tool that can be very useful for refining one's choice of search terms: Lexical FreeNet is a database that maps relationships between words. It does not only list synonyms, but also generalisation and specialisation relationships and many others. Unfortunately it may be going out of business; the icons representing the relationships are already missing. If it does vanish, Wordsmyth may be worth trying, though it is less precise about relationships between words. Simple synonym finders may also be occasionally useful (see Wikipedia's page on synonyms for a selection). Translating a word into a different language and back can also help.
This section is going to be comparably brief, as I do not use specialised search engines much and web directories practically not at all. Besides, if you are especially interested in a specific domain, you are likely to know search engines and/or directories pertaining to it, and those I use will not be of use to you. Nonetheless, a few words on them.
Domain-specific search engines are search engines specialising on a subject or field of topics by restricting either the sources they index or the criteria they index by. On one end they border on specialist databases containing references which may or may not be on the web but which are of narrowly-defined type, such as scientific papers. On the other end, they cross over to web directories, which list all kinds of web resources but are typically compiled by humans, not automatically. Two lists of specialised search engines that look useful to me are at NoodleTools and on Phil Bradley's web site.
Among the domain-specific search engines I have used myself is Scientific Commons, a database of scientific papers and preprints that includes arXiv.org and other preprint servers and thereby provides a good chance of leading you to something you can actually download, rather than just read the abstract of. The RCLIS (Research in Computing, Library and Information Science) project operates a database of e-prints in library and information sciences. Among others, this includes research on web search engines and information retrieval systems. The Social Science Research Network contains a database of social sciences papers. Wikipedia also has a list of academic search engines, but one should note that "free" access cost refers to accessing the search only, not the papers.
Google Scholar has improved greatly in recent years in that it now provides links to freely downloadable versions of papers rather than just sales pages of academic publishers. I had viewed it as more of an advertising service rather than a search engine when it used to link only to sales pages, but it has now become genuinely useful.
Finally, I have little experience to offer on web directories. The WWW Virtual Library, the open directory project and Google directory should probably be mentioned. The reason I do not use web directories is that most of my web research concerns not just special topics but also quite specific questions. Web directories may be useful to find broad and introductory pages on a subject. For that purpose, I usually start from Wikipedia and follow external links, or look at Wikibooks. An ordinary web search will serve just as well if you include one of the words "introduction" or "tutorial". Finding broad and general material on a subject tends not to be too hard, and there are several ways of arriving at it.
Local archiving -- Web archivers -- References
In a way, finding something useful on the web is only part of the problem. Too often you realize only later that a given source may be useful, in the view of information you found elsewhere, or to supplement it. By then you will usually have forgotten both its URL and the exact terms you used to find it with a search engine. The latter matters, because often the search results depend on not just the search terms, but also their order and "+" flags.
A different but related situation occurs if you want to follow a link you have good reason to believe is useful, but find it no longer exists. This problem requires to find an archived copy of the page in question, while the previous is best dealt with by archiving it yourself.
If you find something that may be useful, grab hold of it. Save at very least the URL until you are sure that you will not be needing it. Err on the side of caution — you may only realize later how rare some information is. To save a URL, you might use a designated temporary bookmarks folder in your browser, or simply a text file (as I do).
You may in addition want to download and store a copy of the web page if one of the following applies:
If you decide to save multiple files, the automated downloaders wget or curl may help you. wget can retrieve web pages including "page requisites" such as images, while curl allows a syntax similar to glob patterns to retrieve multiple URLs.
Finally, some social bookmarking services are rumoured to provide archiving services. Furl did this, but is now defunct. Anyway, since such services cannot make the archived content publicly available for copyright reasons, they are hardly better than your local computer. Admittedly they might allow you to access your saved content from anywhere via the web, but you can achieve the same thing by storing it in an online storage system.
When a web page you have good reason to judge useful has become unavailable, you have to turn to web archivers. When the page in question is the result of a web search, the easiest and most effective to retrieve the its copy cached by the search engine. Not only is it just one click away, but you can also be sure that this is the version which the search engine has indexed, and consequently which does contain the search terms you requested.
When you want to follow an obsolete link, things are less easy. But you still have a fair chance of getting at the content. The Internet Archive mirrors large parts of the web at regular intervals, as well as other content placed in the archive deliberately. To search for stored copies of a URL, you use its Wayback Machine. It also allows to search the archived content, though I have not tried it and cannot say how useful it is.
The URL of archived web content has a simple structure. The following gets you the overview page which indicates when a given URL was archived:
A web archiving service aimed specifically at academics is Webcite®. It is a free service that allows to create a copy of web pages one wants to cite, thereby ensuring the continued availability of these sources. (See here for how to put both direct and archived URLs into your bibliography using BibTeX.) Webcite also allows you to search for a specific URL which anybody might have archived. Somewhat annoyingly, considering you give keywords when archiving something, you cannot search by keyword.
Neither the Internet Archive nor Webcite can help you if the owner of a web page wants to remove it. Both comply with explicit requests for removal. However, not everybody has heard of web archivers, so no such request may be made and an archived copy may be available. The admissability as evidence of archived web pages is dubious, so don't rely on them as such. Furthermore, since the Internet Archive uses robots to retrieve content to archive, it obeys the robot exclusion standard and may be barred from (parts of) some sites. As a consequence, some sites are not in the archive at all.
Web archiving is still young enough that it is worthwhile to keep up with developments. Archive Team is a loose collection of volunteers dedicated to preserving online resources. The Wikipedia pages on link rot, digital preservation and web archiving may also be helpful.
Webspam -- Click trafficking -- Portal placement
Advertising is the dark half of any information transmission medium. Rather than providing reliable information, it aims to influence and manipulate by exploiting the irrationality of human nature. To the researcher, it impedes the search for knowledge by creating background noise and false positives.
How serious a problem this is depends on how close the topic of one's research is to somebody's commercial interests. Scientific topics tend to be almost unaffected, while obtaining technical information on products is made significantly harder. The impact of embedded advertisements also depends on how good one is at ignoring blinking pictures and videos. Those whom (like me) they keep from concentrating can help themselves by uninstalling their Flash player and block access to known ad servers (see here for a list).
Beside plain old embedded ads, I have encountered two new types of nuisance that I had not previously heard about. After a brief word on webspam, I will take the opportunity to present them here. If you are not interested in these annoyances, you can skip this part.
Webspam refers to camouflaging advertising as web content and employing a number of techniques to make search engines send users to them. Popular methods include putting up interlinked web pages to increase their ranking, creating fake blogs, embedding misleading keywords outside the visible content of web pages, and serving different pages to search engine bots and people.
These practices, serious though they are, tend not to affect researchers directly. Search engines rightly see them as potentially fatal to their business model and as a consequence devote significant resources to combating them. In my experience they are rather successful. Furthermore, webspam is not usually a problem when searching for scientific or other objective information. Product searches are a different matter, but allow an obvious sanction: do not buy products which are advertised via webspam.
While earlier in the web's history mistyping a domain name led to a name server lookup error, these days it usually results in a page with a generic logo, many links and a search bar. Some companies specialise in acquiring domains with names similar to popular sites in order to benefit from typos and redirect unsuspecting netizens to their paying clients. I like to call this practice "click trafficking" because of its intent of harvesting clicks as opposed to the more general "domain squatting" which aims at selling the domain at a profit.
This is usually more annoying than serious when accessing a domain one knows, but can be time-consuming when searching for a domain one has heard of or simply guessed. Nonetheless, it is to be expected that inexperienced users of the web are being fooled. The web pages presented usually repeat their own domain name, thus similar to the original but legally above board, and present links related to what the original domain is about, in addition to the usual suspects of dating, credit and domain purchases. Examples include (current 10/2009):
|Spoof domain||Registrant||Main link target||Original domain||Type of original|
|wikipeda.org||Jasper Developments Pty Ltd.||ndparking.com||wikipedia.org||Encyclopedia|
|alsa-project.com||Portfolio Brains, LLC||information.com||alsa-project.org||Audio subsystem|
|ecomonist.com||Web Commerce Communications Ltd.||doubleclick.net, zazasearch.com||economist.com||Newspaper|
|ruby.org||Tucows Inc.||ruby.org||ruby-lang.org||Programming language|
|via.com||Enom Inc.||sedo.com||via.com.tw||PC hardware|
|gnuplot.com||Mdnh Inc.||doubleclick.net, googlesyndication.com||gnuplot.info||Plot program|
As one can see, click trafficking provides for a fair number of companies of the shadier parts of internet commerce. That the actual owners of the spoof domains link to sites owned by different companies (and that different owners sometimes link to the same site) suggests that they have subcontracted the handling of their clicks.
Click trafficking is a nuisance during eyeball searches, but may be more serious when using a spider for searching (see below). Automatically filtering of spoof domains may be possible, but I am not aware of any effort in that direction. Those domains which link to themselves tend to use links with extremely long query parameters. Most link to two domains at most, one of which is usually a domain broker or advertiser. The name of the registrant, on the other hand, cannot be used as a filtering criterion, as automated whois queries are banned.
My current internet service provider has hit on a racket I call "portal placement", and no doubt it is not the only one. Every time a name server lookup fails (such as when a dead link leads to a non-existent domain), the browser is redirected to a page from the ISP's favoured search portal, with the queried domain as the search term. This artificially increases traffic to the search portal and thereby increases advertising revenue.
How this is done is interesting in itself: The domain name server returns the IP address of a web server at the ISP instead of a failure notice. This web server returns a redirection to the search portal when the browser connects via HTTP. (This can be found out using the command-line downloader curl with the -v option, which makes it print the IP addresses of the servers it connects to.)
Like click trafficking, this is a minor nuisance. Because the portal's URL appears in the browser location bar, the original URL cannot be edited in case of typos. Again as for click trafficking, automatic filtering ought to be possible. The browser (or a local web proxy) must be made to ignore redirects to the portal in question or avoid contacting the redirect server in the first place. To my knowledge, no browser allows this out of the box at the moment, so this seems to require setting up a proxy.
General -- Downloaders -- Search spiders -- Custom spiders -- References
Squeezing information from the web can be quite time-consuming. Some of the effort involved is repetitive — checking that search engine results really contain the requested terms, judging each result according to additional criteria and saving the useful ones to disk or extracting specific information from them. Therefore it may help to automate some of these tasks if a large number of documents is to be processed.
As when automating a data processing task locally on a single computer, it has first to be decided which parts of it are best performed by a human and which can be automated. Unfortunately, in the case of information searches, the one part best done by a human is the one in the middle — deciding whether a source is useful. Ideally the user would be kept in the loop as a decision-maker, but that would require a hybrid between a robot and a browser that could automate some steps but also display selected pages and take user input. I am not aware of any such program. The second-best but feasible solution is to program a preliminary check designed to eliminate many false positives and store the results for later decision by the user.
Before we delve into details, a word on nomenclature: While a robot can be any automaton that uses the web, the term spider is usually reserved for robots that can follow hyperlinks and thereby traverse large parts of it. One could say that while spiders can walk around the net, robots are stuck where they hit it, just like insects that are not spiders ;). Both can be of use in information searches, depending on the task at hand, though spiders are obviously more powerful.
Before asking oneself how best to write a robot, one should consider two other questions: when or whether to write one, and how it should behave. The answer to the first results from the tradeoff between the time needed to write a robot and the time it is likely to save you. Obviously there is no algorithm to determine that, so the following will be a bit wooly. If you have been searching manually for less than half an hour, continue doing so until you are sure that you are not getting there. On the other hand, if you have to search a known and large part of the web, you may want to start programming right away. The more general the spider you write, the more often you will be able to use it, but the more time it will also take to write. If you have clear search criteria, a robot is likely to get good results, but if are "fishing" to get clues for subsequent secondary searches, human intelligence is clearly required. Is there perhaps already a web database that contains the information your robot would extract? Consult a list of specialised search engines and databases (such as this) again. Only you can answer the question whether it makes sense to write a robot for your particular problem, but you should think about it for a few minutes before spending half an hour either programming or searching manually.
The second question, how decent robots should behave, is more easily answered, by a combination of common sense and convention. Hosting web pages costs money, and some hosting providers limit the amount of data that can be downloaded in a given period of time. Similarly, bandwidth on the net as a whole is a common good. So your robot should strive to avoid unnecessary downloads. In fact, using a robot as opposed to a web browser can actually help to limit bandwidth, because a robot does not typically download images and other data embedded into a web page automatically. For a typical web page today, this can save a lot.
A related topic is the Robot Exclusion Standard. A robots.txt file located at the root directory of some sites (that is, at http://servername.bla/robots.txt) advises robots what parts of the site to traverse and what to pass up. Though robot authors are not required to obey it, there are several good reasons to do so: It is good netiquette. Most web site administrators are aware that robots.txt is only advisory, so anything your robot is not supposed to see will be hidden by additional more effective means. And finally, the most frequent use of robots.txt is to keep robots away from junk. So observing it often saves the robot time and the net bandwidth. The syntax of robots.txt is described on Wikipedia and robotstxt.org, and there is a Perl module for parsing it.
One more important matter of politeness is the frequency with which a robot requests pages. An extension of the Robot Exclusion Standard is the Crawl-delay field which gives the recommended number of seconds to wait between retrievals. If the robots.txt file of a server contains it, that is what you should choose. If the keyword is not present, or if you write a small spider and do not want to burden yourself with parsing robots.txt, my advice is to wait 10 seconds. This may seem much, but be realistic: The web operates at a time scale of seconds as opposed to the microseconds of your processor or the milliseconds of your hard drive. You are not going to sit around twiddling your thumbs waiting for your robot's results. Rather, it will run in the background for half an hour or more, while you occupy yourself productively with something else. Choosing a rudely small crawl delay is not going to change that, so you might as well be polite. Ten seconds is sufficient time for a competent browser user to download a page and type in a search term in the browser's search function, so this gives you a valid line of defence should someone accuse you of overwhelming their server with your robot.
Finally, if you use web services, you have to obey their terms of service, which sometimes ban robots. Google and Ask both do this, while some other search engines do not. (Dogpile, a metasearch engine that includes Google results, does not seem to ban robots, so this may be a way to get at Google results with a robot.) A company's terms of service can also demand obeisance of the robot rules in robots.txt, which makes them binding rather than advisory. If access to some pages is banned for robots in this way, the only thing to do is to download them manually and let your robot operate on the downloaded files or the links contained in them.
Before we proceed to robots intended for searching, let us have a look at more general and more widespread robots, primarily those intended for automated downloading. The most well-known of these are wget and curl. The primary difference between them is that wget can recursively follow hyperlinks but curl does not. wget is therefore suited for downloading whole subtrees of interlinked HTML documents, and automatically downloading embedded images and the like. In return, curl allows the user to give ranges of URLs in a notation similar to glob patterns, which is useful for retrieving numbered documents. Both support authentication (logging in at a web page), encryption (SSL) and bandwidth limiting. wget supports a crawl delay.
wget respects the robot rules in a server's robots.txt when downloading recursively. curl, which does not do recursion, does not. Apparently the implied interpretation is that so long as a URL is given explicitly by the user, the program does not act as a spider and need not obey the robot rules, which has a certain logic. Whether this reasoning would be accepted in court in cases where observance of robots.txt is enforced by terms of service is a different matter.
In the interest of preserving network bandwidth, it should be mentioned that curl allows so-called HEAD requests, which do not retrieve the document in question. This can be useful to check whether a document is there at all, or to extract information from the HTTP headers which are returned. The following table displays the most important options of wget and curl:
|Set output file name||-O <file> (single file only)||-O / -o <file>|
|Limit bandwidth||--limit-rate=<B/s>||--limit-rate <B/s>|
|Crawl delay||-w <seconds>||—|
|Set user agent string||-U <string>||-A <string>|
|Set referrer URL||-referer=<URL>||-e <URL>|
Besides programs intended for automated downloads, text-based web browsers can be very useful. The most widespread ones, lynx, w3m and links, all support outputting the rendered web page with the option -dump. This provides a way to obtain a plain text version of a HTML document.
By now you may be wondering what place a discussion of downloaders has in a text about web research. The reason is that before a document on the web can be investigated, it has to be retrieved. Automated downloaders are therefore a required component of robots written as shell scripts. Though robots can be written in other programming languages (and frequently are), shell scripts have the advantage of being a very high level language with correspondingly low development effort. They also allow you to make use of numerous standard UNIX tools designed to process plain text files where other languages may require you to duplicate their functionality. If you intend to perform multiple investigations on a limited number of web documents, it may make sense to separate the download and investigation steps altogether. You might first retrieve a set of web pages and then program one or several tools to run queries on them. This saves bandwidth compared to downloading them every time as would be done if the queries were performed by different spiders.
Finally it should be mentioned that some sites block automated downloaders by various means, sometimes even unintentionally. If the site in question states (in its terms of service or on the pages concerned) that it bans robots, this should be respected. However, if you find no such statement after looking carefully, I regard it as legitimate to try to bypass such restrictions. In my experience blocking robots is most frequently done in two ways. Every browser and robot sends a so-called "user agent" string along with its HTTP request, which usually gives its name, version and operating system. Some servers refuse to service requests unless they originate from well-known browsers. This can be prevented by providing the user agent string of a browser using the options above. If your download task is simple enough, it may also be worth trying to use links, which has a graphical mode and therefore tends to be regarded as a "real" browser and tolerated.
The second method of discouraging robots involves the "referrer" header (often spelled "referer" following a misspelling in an RFC). This gives the URL which references a file when downloading it, such as the page which hyperlinks to a page or which embeds an image. If a URL is retrieved directly, there is no referrer, and some servers refuse to reply. (This is also used to prevent "hotlinking", putting images on other servers into one's own pages.) An acceptable referrer URL can usually be easily found out (a page containing an image, a table of contents etc.), and the downloader can be told to transmit it with the options given above.
There is in fact a number of search spiders available on the web, from a variety of projects and individuals. I admit that I have no great experience using them, as they tend not to meet my needs for one reason or another, see the following.
The available search spiders differ in both abilities and purpose. Many search spiders are intended for indexing, like the spiders of the big search engines. Examples are Nutch, which is affiliated with the Apache web server project, ht://Dig, and the Aperture framework, a Java library. Using such spiders makes most sense if you are going to answer a large number of queries for a limited number of documents and according to known and limited criteria. Indexing these documents once or regularly with respect to your criteria will save bandwidth because only the resulting database has to be queried repeatedly. If you intend to make widely different complex queries to unpredictably different parts of the web, indexers will not help you. However, indexing an intranet or local files can be helpful to obtain a preselection for further investigation. It should be mentioned that ht://Dig supports a number of non-exact matching requirements such as sound-alikes, synonyms and substrings.
Some other robots are downloaders like wget and curl (and therefore not search spiders, though they can be used as part of such). Examples for this category are Pavuk, Arale and Larbin. I have not tried them out, as I have always found wget and curl sufficient. If you miss a feature, you might want to look at the others.
The bots section of the Searchlores site presents a number of robots with accompanying articles. The most interesting one for web search is W3S by Mhyst. It is a spider that can traverse part of the web according to criteria given by the user and report on the number of keywords contained in each page. I have to admit I have not yet got round to trying it out, but it has a GUI that seems intuitive to use and is written in platform-independent Java. It seems quite suitable for search tasks a search engine cannot perform, such as "real-time" searches of the web and intranet. The other robots presented at Searchlores are mostly special-purpose. Reading about them helps to learn about writing robots, and some may fit your task at hand or could be adapted to do it.
Finally, if you have good knowledge of the web and HTML, and have demanding requirements for a search spider, I can recommend the one I have perpetrated myself, wfind. It is very powerful, but accordingly complex to use, and it takes an expert to exploit its full potential. If you are interested, read my introductory page and its manual page online.
The first choice one has to make after deciding to write a spider is what programming language to use. This decision has the same tradeoff of development effort versus flexibility as always. For me the answer is usually either a shell script or Perl. Shell scripts are fine if you do little more than downloading and do not use much control flow and data structures. Retrievals are done via wget or curl, and standard text utilities like grep, sed, tr, etc. can be used to process the data and determine the next URL to visit. Programs like pdftotext, catdoc, xls2csv, etc. can be used to convert non-plain-text formats. Attempting complex analysis or parsing of HTML, XML or other plain-text based formats directly with the shell is probably not a good idea. But be aware of the little-known fact that the shell bash supports regular expressions:
[[ "$text" =~ $regex ]]
The code snippet above matches $text against the extended regular expression $regex. But a further drawback of shell scripts is that you are stuck with the features of the downloader you are using.
For more complex tasks than are comfortable to do with a shell script, I use Perl. This choice is partly due to my knowing Perl pretty well now; I am sure that Python or Ruby are as suitable. The most important features are the ease of retrieving web pages and the availability of analysis features you want, such as regular expressions. If you want to analyse non-plain-text file formats or search by special criteria (sound-alikes, fuzzy searches etc.), it saves you work if there are packages for that in your chosen language. Perl has the advantage that a large number of modules are available in a comprehensive database, CPAN.
Spiders have also been written in non-scripting languages such as C, C++ and Java. I think these are too low-level and consequently too work-intensive and do not add enough in the way of flexibility to be worth the extra effort. But if you know them well and/or know a library that provides an unusual feature you need, choose them by all means. Be aware that some script languages (for instance Perl) offer interfaces to C, and all (AFAIK) can execute external programs. This allows you to program the necessary part of your task (and only that) in a lower-level language.
Finally, a relatively unknown scripting language that deserves a mention in connection with spidering is Rebol. The fact that URL retrieval is part of the language makes accessing documents easy, and simple parsing can be programmed in a very concise way.
The details of how to write your spider are something I cannot help you with. Every (search) task is different, and so every spider for a specific task will also be. However, I can give you a few caveats:
In order to learn to write a robot, it can be useful to look at existing code. The bots section of Searchlores has already been mentioned. The code snippet database Snipplr contains some snippets related to automated downloading and searching. The open book Web Client Programming with Perl conveys basic understanding of how web browsing works technically, and how to do it with a Perl script. The non-open book "Spidering Hacks" (cited in full below) is dedicated to writing robots in Perl, most of which are not actually spiders by my definition above. An archive of examples can be downloaded from its homepage.
© 2009 - 2013 Volker Schatz. Licensed under the Creative Commons Attribution-Share Alike 3.0 Germany License