↑↑ Home ↑ Net & Web  

wfind, a web search spider

Intro -- Features -- Installing -- Manual

Intro

Have you ever tried to download specific pages from a web site? To compile a list of external links from a site? To find recently modified web sites about a specific topic? Search engines can be used for all these purposes. But they are comparably blunt instruments that require a fair amount of manual postprocessing. Their results may be out of date and may be distorted by what is euphemistically called search engine "optimisation".

wfind is a spidering program I have written to perform tasks such as those named above without human intervention. Its name and command-line usage is modelled after the UNIX find program. (I realize some may consider this user interface ancient and obscure, but it is one people who habitually use computers to search for information are likely to have encountered before.)

wfind supports a range of tests, such as keyword and regular expression searches, optionally truncated and case sensitive. These tests can be applied to the content of a web document as well as the URL or a HTTP header field. Numerical comparisons of header fields such as the document size and modification date are also supported. These test criteria can be combined by logical functions including the requirement that a certain proportion of them be true. Multiple content tests can have proximity requirements imposed to search for documents in which certain words occur close to each other.

These features alone would make wfind a very powerful document tester. But it is also a spider, that is, a program that traverses hyperlinks. Besides the test criteria for result documents, it supports additional test expressions for when to stop traversing and which links to follow. The robot exclusion standard is supported, and the frequency of requests to the same server is limited.

Feature summary

Obtaining and installing wfind

wfind is now primarily distributed via a git repository. Use the following command line (or point your graphical git client at the same URL):

git clone http://volkerschatz.com/repositories/wfind

Run the shell script install-from-git.sh to install the current version.

Prerequisites

Being a Perl script, wfind requires that you have a Perl interpreter installed. Depending on which file formats and document tests you want to use, it needs a number of Perl modules and/or external programs. Run wfind -status -verbose to see which are available and which you could install. But the following modules are always required: Cwd, File::Spec, Storable, Encode, Unicode::Normalize, Time::Local, Safe, Socket, LWP::UserAgent, URI, URI::file, URI::Split, URI::Escape, HTML::PullParser, HTML::Tagset and HTML::Entities. (The number of packages containing them is slightly smaller, and many of them are widely used and may therefore already be installed on your system.) The modules File::Glob, File::Copy, ... are also required but usually come with the Perl installation.

wfind has been tested only under Linux so far, so if you really want to use it under Windoze, Cygwin is probably your best bet. Some sources of incompatibilities, particularly with non-UNIX systems, are likely to be related to concurrency and inter-process communication. If you see trouble, you may want to try to disable it with the option -slaves 0.

Read wfind's manual page online


TOS / Impressum