An introduction to web research

About -- Search engines -- Directories -- Archiving -- Webspam -- Spidering -- References

This text can now be regarded are primarily historical. Though some of the methods still work, web search engines returning pages by content are all but extinct, so finding content on the web is a lost cause. You can still read this text as a guide to what the possibilities the web could offer but for the commercial interests of companies and the active disinterest of politicians in independent information.

Is this page for you?

The topic of this document is web research, or how to obtain a specific piece of information, or information about a specific topic, via the web. Achieving this includes but goes beyond using search engines. If you are looking for an introduction to using web search engines, this page is not for you. The word "introduction" in the title merely indicates that it is meant to be broad and inclusive rather than deep, covering all aspects of web research but not delving too deeply into any of them.

This document will also not teach you how to evaluate what you find. Besides obvious sanity checks (an introductory page recommends: "Look at the title text. Does it relate to the subject?"), I assume that the searcher is sufficiently familiar with the topic to weed out junk, or else unbiased enough to view a representative number of sources.

This page was written to collect my experience and thoughts on how to squeeze information from the web in the face of an overwhelmingly low signal-to-noise ratio. It is aimed at scientists and engineers, perhaps journalists and bloggers, generally everybody who requires information which can in principle be found on the web, if only one knew how to get at it. If you know what you are looking for, or if you regularly find yourself clicking through the first five to ten result pages of a search engine rather than just the first, the following may help you.

General-purpose search engines

Caveats -- Advanced -- Customising -- Search terms -- References

The first stop in any attempt to obtain information from the web is likely to be a general-purpose search engine. The main reason for this is that most searches are trivial (a forgotten domain name of a company; information on a general topic). For specific searches and serious research, search engines are quickly becoming more and more useless. Nonetheless this section will discuss how to get the most out of some search engines.

While everybody probably has their own preferred search engine, I will briefly name the ones I use and their relative merits. The search engine I tend to use first is DuckDuckGo. In my experience, it returns good results also for specific queries about technological topics, which are the kinds of searches I do most often.

When DuckDuckGo does not turn up any satisfactory results, I use a major search engine such as Google, Startpage (Google-derived, with more privacy) or Bing. Their larger index gives them an advantage that their loose matching criteria do not entirely negate. Google in particular is acknowledged to have the largest index of all single search engines and is comparatively topical if the query is simple enough.

Lastly, when other search engines return a large number of false positives (i.e., results which do not meet my criteria), I use a powerful search spider of mine to filter the results of the meta search engine Dogpile. If that fails, I conclude that what I want probably cannot be found on the open internet (i.e. outside pay/registration walls).

Two other search engines that deserve a special mention are Exalead and Clusty. Exalead is one of very few search engines (the only one?) which supports advanced features like proximity search and truncation. On the negative side, its small index size is sometimes painfully obvious. Clusty is a meta search engine which clusters results according to what additional words appear in them. This allows to quickly discard results that contain the right search terms in the wrong context and sometimes suggests related search terms.

A few years ago, this section contained some words about how strictly different search engines obeyed the search terms. Now, all major search engines have a strong bias in favour of popularity over topicality as per the given search terms. Exalead and DuckDuckGo are notable exceptions in that they take search terms comparatively seriously.

Function	Prefix
Restrict domain	`site:`
Restrict file type	`filetype:` (not for Exalead)
Require keyword in title	`intitle:` (DuckDuckGo/Clusty: `title:`)
Require keyword in URL	`inurl:` (Clusty: `url:`)
Require link to site or URL	`link:`

Function	Ask	Google	Bing	Exalead	Clusty
n results per page	—	`num=`n	`count=`n	`elements_per_page=`n	`v:state=root`\|`root-0-`n\|`0`
start at o-th result / p-th page	`page=`p	`start=`o	`first=`o	`start_index=`o	`v:state=root`\|`root-`o`-10`\|`0`
Navigation language	—	`hl=`(country code)	—	—	—

Spoof domain	Registrant	Main link target	Original domain	Type of original
wikipeda.org	Jasper Developments Pty Ltd.	ndparking.com	wikipedia.org	Encyclopedia
alsa-project.com	Portfolio Brains, LLC	information.com	alsa-project.org	Audio subsystem
ecomonist.com	Web Commerce Communications Ltd.	doubleclick.net, zazasearch.com	economist.com	Newspaper
ruby.org	Tucows Inc.	ruby.org	ruby-lang.org	Programming language
foldoc.com	Domreg Ltd.	information.com	foldoc.org	Dictionary
via.com	Enom Inc.	sedo.com	via.com.tw	PC hardware
gnuplot.com	Mdnh Inc.	doubleclick.net, googlesyndication.com	gnuplot.info	Plot program

Function	`wget`	`curl`
Set output file name	`-O` <file> (single file only)	`-O` / `-o` <file>
Limit bandwidth	`--limit-rate=`<B/s>	`--limit-rate` <B/s>
Crawl delay	`-w` <seconds>	—
HEAD request	—	`-I`
Set user agent string	`-U` <string>	`-A` <string>
Set referrer URL	`-referer=`<URL>	`-e` <URL>

An introduction to web research

Is this page for you?

General-purpose search engines

Caveats

Advanced features

Customised search forms

Finding the right search terms

References

Domain-specific search engines and web directories

References

Keeping what you have found - archiving

Local archiving

Web archivers

References

Web advertising and its nastier cousins

Webspam

Click trafficking

Portal placement

The last resort - spidering

General considerations

Automated downloaders

Existing search spiders

Writing custom search spiders

References

Global references