↑↑ Home ↑ Net & Web

wfind, a web search spider

Intro -- Installing -- Manual -- Outlook

Intro

Have you ever tried to download specific pages from a web site? To compile a list of external links from a site? To find recently modified web sites about a specific topic? Search engines can be used for all these purposes. But they are comparably blunt instruments that require a fair amount of manual postprocessing. Their results may be out of date and may be distorted by what is euphemistically called search engine "optimisation".

wfind is a spidering program I have written to perform tasks such as those named above without human intervention. Its name and command-line usage is modelled after the UNIX find program. (I realize some may consider this user interface ancient and obscure, but it is one people who habitually use computers to search for information are likely to have encountered before.)

wfind supports a range of tests, such as keyword and regular expression searches, optionally truncated and case sensitive. These tests can be applied to the content of a web document as well as the URL or a HTTP header field. Numerical comparisons of header fields such as the document size and modification date are also supported. These test criteria can be combined by logical functions including the requirement that a certain proportion of them be true. Multiple content tests can have proximity requirements imposed to search for documents in which certain words occur close to each other.

These features alone would make wfind a very powerful document tester. But it is also a spider, that is, a program traverses hyperlinks. Besides the test criteria for result documents, it supports additional test expressions for when to stop traversing and which links to follow. The robot exclusion standard is supported, and the frequency of requests to the same server is limited.

Obtaining and installing wfind

Download wfind 0.1beta2 from Sept 2010 (gzipped tar archive)

The archive contains the Perl script wfind, POD documentation, an example blacklist, a copy of the GPL, and a shell script installer. Unpack it, descend into the resulting directory and run the shell script to install wfind and its manual page.

Being a Perl script, wfind requires that you have a Perl interpreter installed. Depending on which file formats and document tests you want to use, it needs a number of Perl modules and/or external programs. Run wfind -status -verbose to see which are available and which you could install. But the following modules are always required: Cwd, File::Spec, Storable, Encode, Unicode::Normalize, Time::Local, Safe, Socket, LWP:ÜserAgent, URI, URI::file, URI::Split, URI::Escape, HTML::PullParser, HTML::Tagset and HTML::Entities. (The number of packages containing them is slightly smaller, and many of them are widely used and may therefore already be installed on your system.) The modules File::Glob, File::Copy, ... are also required but usually come with the Perl installation.

wfind has been tested only under Linux so far, so if you really want to use it under Windoze, Cygwin is probably your best bet. The primary source of incompatibilities, particularly with non-UNIX systems, is likely to be related to concurrency and inter-process communication. If you see trouble, you may want to try to disable it with the option -slaves 0.

Read wfind's manual page online

Outlook

Though I am glad to have brought wfind to a point where it provides what I wanted, and is quite useful and powerfule, much more could (and probably should) be done. As many a reader will be itching to point out, the find-like interface is ancient and obscure. Besides, many spidering tasks which are not link-recursive searches could use components embedded in wfind. These can be performed by wfind if one places sufficient restrictions on which links to follow, but that is a kludge. It would be much better to have other interfaces to the same low-level retrieval, decoding and testing functionality.

That is why in the long run wfind will have to be refactored into a largish number of classes containing its core functionality and a much shorter main script doing the recursive search. Other scripts could then be written (by me as well as others) that use those same classes. I envisage a spider controlled by a domain-specific formal language suitable for the purpose. But for the time being I am working on other projects, so the refactoring will have to wait for a bit.