| ↑↑↑ Home | ↑↑ Net & Web | ↑ wfind introduction |
wfind - find and grep for the web
wfind URLs... [--] [options] tests
wfind is a generic web search spider. In a syntax similar to the UNIX
find (1) command, it allows you to search internet documents, recursively
following hyperlinks. Given the appropriate Perl modules or/and external
programs, it can process HTML, plain text, PDF, PostScript, RTF, troff and MS
Office formats. wfind can also search local directories in a similar way as
find ... -exec grep ... would.
wfind's applications include postprocessing search engine results (especially of search-it-yourself site searches), checking web pages against dead links, and full text search of manual pages or other documentation. It supports regular expressions, phonetic search, logical operators and proximity search.
The first argument(s) of wfind are the URLs to start from. These can be
file://, http(s):// or ftp(s):// URLs or relative or absolute paths.
The paths accepted have some restrictions to avoid ambiguities with test
expressions (see below). Absolute paths on UNIX systems should best start with
// for that reason, and relative paths must start with ./ .
Alternatively, the marker -- (two hyphens) can be used to separate URLs from
test expressions and options. It must occur only once on the command line and
forces arguments to its left to be interpreted as URLs and those to its right
as tests or options.
The pseudo-URL - (a single hyphen) enables wfind to read URLs from
standard input, which allows generating them with a different program.
wfind supports several ways to specify what words, phrases or expressions to search for. The most important difference is between those which are word-oriented and those which are not. The former are intended for searching for phrases in human-readable text. Their search patterns are split into words of letters and digits, and all non-word characters serve as word separators and are silently discarded. (This tolerance makes it easier to copy and paste search phrases.) Characters with accents and diaereses are converted to their base characters before comparison, both in the pattern and in the text. This is similar to the behaviour of web search engines. Non-word-oriented tests are performed on the "raw" text, which may still have sequences of spaces converted to single spaces and characters and punctuation converted to their ASCII equivalent (see Regular expressions and Unicode below for details).
Search expressions are specified by a punctuation character indicating the type of match, the phrase to be searched for, and optionally the same indicator character followed by flags. The following search expressions are supported:
The simplest match type are glob patterns. Their indicator character is an
equals sign and is optional if no flags are given. The wildcard characters
? and * can be used to denote a single word character and an arbitrary
number of them, respectively. (Remember tho prepend them with a backslash or
to enclose them in single quotes to prevent the shell from interpreting them.)
Character ranges or multiple subpatterns enclosed in braces are not allowed by
wfind. If the pattern contains no wildcard characters, it is just a search
for a string of words, which can be given without any markup in one
command-line argument to wfind. Two examples:
"these three words"
=Wor\*=I
In the first example, double quotes were used to tell the shell to pass all three words as one command-line argument. wfind does not get to see the quotes. The second example searches for all capitalised words starting with "Wor". Glob pattern searches are case insensitive by default, but the I flag makes them case sensitive.
The other supported flag is v, the verbatim flag. It turns the word search into a tolerant substring search which is not word-oriented. The "tolerance" of the search consists in that spaces are allowed before and after all non-word characters, and that unicode characters compare as equal to their base character (unlike in verbatim regular expressions, see below). Wildcard characters still stand for word (letter or digit) characters only.
Regular expressions are a mainstay of several UNIX utilities and of Perl, so it
is no surprise that wfind supports them too. Their indicator character is a
slash /, and the required syntax is that of Perl regular expressions. The
given regex is used largely unmodified, so unfiltered input from untrusted
sources should not be passed to wfind as a regex.
wfind takes regexes to be case insensitive by default. The flag I given after the concluding slash makes them case sensitive. Regex searches are never word oriented, so will be applied to the complete text of a web page, not to individual words. In particular, punctuation will be kept and can be matched. Unless the v (verbatim) flag is given, some simplifications will still be made before the regex match: accented and similarly modified characters are converted to their base character, and exotic punctuation converted to the ASCII equivalent (see Unicode below for details). So the first of the following regexes will match the German name of the city of Munich, but the second will not match the phrase "cause celebre" in french texts:
/Munchen/I
"/\bcause celebre\b/v"
wfind will always add the s flag to the Perl regular expression it uses
so that newline characters are treated as any kind of white space. The
characters ^ and $ denoting the beginning and end of the string should
not be used in the regex, because the document may be matched part by part (see
How wfind processes documents below). The regex may contain additional
slashes between the two delimiters, which are matched and should not be
escaped.
The Levenshtein distance between two words is the number of one-character modifications, additions or removals it takes to convert one into the other. wfind supports approximate searches for words within a given Levenshtein distance. The indicator character is a tilde ~, and the concluding tilde can be followed by a specification of the maximum tolerated distance. This can either be a number (the distance itself) or a percentage. The latter will be interpreted as the ratio of the tolerated distance relative to the maximum possible distance, which is the maximum of the lengths of the two words being compared. If no maximum distance is given, a relative distance of 20% is used. Some examples:
'~levenstine~3'
'~can wizards spell~30%'
The first search just matches the correctly spelt "Levenshtein". (Insertion of the h, removal of the last e, insertion of the last e.) The second example searches for a sequence of three words. The 30% threshold is applied to each word comparison individually.
Like the other searches, Levenshtein equivalence is case insensitive by default. To switch on case sensitivity, the I flag has to be added before or after the distance specification.
Soundex codes are a phonetic encoding of words inspired by their pronunciation
by English-speakers. When wfind finds search words prepended with a dollar
sign $, it converts them to their soundex codes and searches for sequences
of words which have the same codes. Some words have no soundex code (in that
case, wfind aborts with an error message), and many different words have the
same code. The single characters ? and * can be used as wildcards which
match any word. The search for soundex-equivalent words takes no flags, so the
dollar sign need not be repeated after the word(s) (but may be). Remember to
escape the dollar sign from the shell, as demonstrated in these examples:
\$word
'$three * mice$'
Several search expressions next to each other are combined with the logical conjunction (AND) by default. To specify a different logical function, you can use the following operators, which are compatible with the find(1) command. They are listed in decreasing order of precedence.
Parentheses force precedence. Remember to escape these from the shell.
Logical negation.
Logical conjunction. This is the default if two tests are found side by side without an explicit operator.
Logical disjunction.
In addition to the operators above, the option -require allows defining generalised logical operators. It is described in the Subexpression options section below.
There are situations where you might want to use wfind in several stages. For instance, you may first want to sift through a list of URLs based on one set of keywords and then search through links from the hits using different criteria. Or you may want to search through external links from local documentation pages which themselves match some criteria. wfind's pipe sections provide the means to do that.
Pipe sections are notionally similar to pipes of commands on your shell command-line. Just as the shell feeds the output of one command into the input of the following command in a pipe, wfind uses the hits of the tests of one pipe section as the starting URLs of the following one. Successive pipe sections are separated by the pipie character, which has to be escaped from the shell:
wfind URLs test1 \| test2 \| test3
This would be the syntax for three pipe sections. See Advanced usage below for examples.
Not only can different pipe sections have different search criteria, they can also differ in some options. By default, options given for the first pipe section will be reused in the following section(s), but explicitly given options for that section will override them, and will stay active for the following sections. Not all options are scoped by pipe section. Those that are mostly concern wfind's behaviour for recursively following links, see Pipe section options below.
wfind's options (comand-line switches) determine the details of its behaviour. Test options prepended to test expressions cause the tests to be applied to something other than a document's contents. Other options influence the following of hyperlinks, reporting of results and other things.
Options may be uniquely abbreviated. Boolean-valued options take no arguments; they are set to true by the presence of their command-line switch, and to false by prepending the option name with "no". They default to false unless documented otherwise below.
Test options are part of the boolean expression which determines if a document will be reported as a result. Other options set parameters whose scope may be global, the current pipe section, or the subexpression in which they are given. Options scoped by pipe section serve as the default for the following pipe sections too, so they need not be repeated, but may have to be reset explicitly.
The following table gives a brief description of all command-line switches in alphabetical order. The character in the second column gives the type of the option: O for operators, T for test options, G for global options, P for pipe section options and L for locally scoped options (subexpression options). Operators are described in the Operators section above; a detailed description for the other options can be found in the corresponding section below.
-- G End of URL arguments (must not occur later) -and, -a O And operator -asplain P Parse all ASCII files as plain text -charnear L Proximity requirement in characters -decompress G Store decompressed result documents in current directory -depth P Recursion depth -download G Store result documents in current directory -downxform G Transform URL to save file name -echo G Echo arguments and exit -exif T Test EXIF metainformation in image or audio files -false T Always false -follow P HTML tags containing links to follow -gdepth G Global recursion depth -header T Test HTML, PDF or OLE document header -holdoff G Holdoff time between downloads -httpheader T Test HTTP header -linksto T Test hyperlink URLs -linktags P HTML tags relevant for -linksto -linktext P Follow only hyperlinks with the given text -max P Maximal number of results -maxsize G Maximal download size -modified T Test Last-Modified time -name T Test document name -nostart P Disqualify start URLs -not O Negation operator -or, -o O Or operator -pessimistic P Discard ambiguous results -plainurls P Extract non-hyperlinked URLs -print G What to output -redirects G Maximum number of HTTP redirects -require L Generalised logical operator -rsize T Test file size after download -rtype T Test MIME type after download -samedomain P Follow hyperlinks to same domain only -silent G No warnings -size T Test file size as reported by server -status G Print capabilities and exit -symlinks P Follow symbolic links in filesystem -slaves G Number of slave processes to spawn -timeout G Connection timeout -transform P Transform result URLs -true T Always true -type T Test MIME type given by server -uagent G User agent string -unrestricted P Follow hyperlinks from files to WWW -url T Test document URL -urlgroup T Test document URL against URL group -verbose G More detailed output -version G Print wfind version -while P Follow hyperlinks while expression is true -wordnear L Proximity requirement in words
When wfind encounters a search expression on its own, it searches for it in a document's content. To search for documents with a certain URL, size or headers, you can use the following options. Many of them are followed by parameters detailing what to search for in the data they indicate. For those that take search expressions as parameters, only primitive expressions are allowed, that is without operators. The test options can be combined with content tests and each other using operators (see Operators above).
True if a file with embedded EXIF metadata has an EXIF tag matching keyexpression with a value matching valueexpression. Typically image files from digital cameras and sometimes audio files have embedded EXIF data, but this test is not useful for text document formats. For files without EXIF data, or if no tag matches keyexpression, the result is undecided, and the value of the complete test expression may depend on -pessimistic. If multiple tags match keyexpression, one matching value is enough to make the result true (OR semantics).
Always false.
True if a document header with a key matching keyexpression has a value matching valueexpression. This test applies only to HTML, PDF and MS OLE documents. For other documents, or if no key matches keyexpression, the result is undecided, and the value of the complete test expression may depend on whether -pessimistic is given. If several headers have matching keys, one matching value is enough to make this test true (OR semantics).
True if a HTTP header received from the server has a key matching keyexpression and a value matching valueexpression. wfind will usually try to decide this test using a HEAD request which only retrieves the HTTP headers, and put off downloading the document. If no header key matches keyexpression, the result is undecided, and the value of the complete test expression may depend on whether -pessimistic is given. If several headers have matching keys, one matching value is enough to make this test true (OR semantics).
True if one of the hyperlinks extracted from the document matches expression. Also applies to URLs in the text if -plainurls is given.
+|-]timeCompare modification time from the "Last-Modified" field of the HTTP header.
The result is true for an older document if + is given, for a newer document
if - is given, or a document of the specified modification date otherwise.
time may be given as a plain number followed by m or d or as a time
and date. The former is interpreted as the age of the document in minutes or
days; the suffix d is optional. If a date is given, it is parsed using
Date::Parse::str2time; see the Date::Parse manpage for the allowed syntax.
wfind will usually try to decide this test using a HEAD request which only retrieves the HTTP headers, and put off downloading the document. Sadly, most servers simply return the current time and date as the "Last-Modified" time. wfind detects this, and marks the test as undecided. The result of the complete test expression may then depend on whether -pessimistic is given. All comparisons of the last-modified time are performed with a tolerance of 3 minutes in either direction to allow for variations of the clocks of different computers on the web. For local files, the comparison is exact.
Apply test expression to document name (file name or last part of the URL). This test does not require downloading the document.
+|-]number[k|M|G]Compare size of downloaded file. The size is determined after decompression.
If the given number of bytes is prepended by +, the result is true for files
at least as large, in the case of - for files at most as large, otherwise
the size has to match exactly. The suffixes k, M or G (case
insensitive) allow abbreviating kilobytes, megabytes and gigabytes. (That's
informatics kilobytes etc., which are powers of 2, not marketing kilobytes
etc.)
Apply test expression to the MIME type of the document. The MIME type is
determined using the the File::MimeInfo::Magic manpage module or the file command and
is therefore consistent across servers.
+|-]number[k|M|G]Compare size of remote file as given by the Content-Length HTTP header with
given number in bytes. If prepended by +, the result is true for files at
least as large, in the case of - for files at most as large, otherwise the
size has to match exactly. The suffixes k, M or G (case insensitive)
allow abbreviating kilobytes, megabytes and gigabytes. (That's informatics
kilobytes etc., which are powers of 2, not marketing kilobytes etc.) wfind
will usually try to decide this test using a HEAD request which only retrieves
the HTTP headers, and put off downloading the document.
Always true.
Apply test expression to the MIME type of the document as given by the Content-Type HTTP header. wfind will usually try to decide this test using a HEAD request which only retrieves the HTTP headers, and put off downloading the document. Unfortunately different servers can have different opinions on the MIME type of a given document format. The MIME type should usually allow for compression (ie give the uncompressed MIME type), but only does that if the server can discern the enclosed format. The similar -rtype test downloads the document to determine the MIME type on the local host, which provides consistency across servers.
Apply test expression to document URL. This test does not require downloading the document.
Test if the document URL is part of the URL group name (see THE CONFIGURATION DIRECTORY below). This test does not require downloading the document.
Two more options exist which are related to tests: -linktext and -while. Both have tests as their arguments which determine which links are followed. They may occur only once in every pipe section and are listed among the Pipe section options below.
This is not really an option. It serves as a separator between URL arguments and test / option arguments in cases when they could be confused. It may occur only before all other tests and options.
Store those documents which match the criteria to the current directory, after decompression or other decoding. This implies -download.
Store those documents which match the criteria to the current directory. Documents are stored exactly as they were received from the server, without decompressing or otherwise decoding them. The output file name is the same as the file name part of the URL. If it already exists, a number is appended before the file extension. There may be a race condition if two instances of wfind run in the same directory and simultaneously try to store a file with the same name, in which case one may overwrite the other.
Expression transforming the result URLs into a file name to save the downloaded
document to. This implies -download. expression is a Perl expression
that operates on $_. If the resulting $_ is the empty string or
undefined or an error occurs, wfind will construct a file name based on the
URL as usual. The Perl features the code fragment can use are restricted, as
for -transform (see below).
Perform no search, just print the start URLs, options and search expressions. Both the options specified on the command line and the default options are printed. Useful to check wfind's interpretation of your command line.
If the -verbose option is also given, both every URL argument and all its corresponding canonicalised URLs are printed. (There is only one unless the argument contained a glob pattern.) Without -verbose, at most the first three URLs are printed for a glob pattern, and only the canonicalised form is printed for single URLs. The URL arguments are always printed in the order in which they were given on the command line.
Global maximum search depth. The search depth is 0 for every URL given on the command line and is increased by one for every hyperlink followed or (for directory searches) every directory hierarchy. The search is continued until either the maximum depth of this pipe section (see -depth) or the global maximum depth have been exceeded. wfind's depth counting differs from that of the find(1) command in that it regards files as at the same depth as the directory where they are located. So a maximum depth of 0 for a directory search may return files from the directory given on the command line.
+|-]secondsSet the delay between successive retrievals from the same server. The sign
determines how the delay applies to HTTP and FTP retrievals. For +, the
delay is imposed between any successive retrievals from the same server.
Otherwise, it applies for each protocol separately, i.e. one HTTP and one FTP
retrieval may be in progress simultaneously. The - sign allows the delay
for HTTP to be overridden by the Crawl-Delay entry of the robots.txt file of
the server if present. The default holdoff is "-10", 10 seconds with separate
accounting and override allowed.
The holdoff time is applied only to recursively visited HTTP resources on the same server. It is not applied to multiple URLs on the same server given on the command line, to HTTP HEAD requests or to retrievals of robots.txt.
k|M|G]Set the maximum size of retrievals, in bytes. A useful safeguard to avoid
wasting bandwith and processing time if links to source archives or other large
data abound in the vicinity of your search, and/or if a server has flaky MIME
type reporting. The suffixes k, M or G (case insensitive) allow
abbreviating kilobytes, megabytes and gigabytes. (That's informatics kilobytes
etc., which are powers of 2, not marketing kilobytes etc.) Unlike -size,
this does not disqualify a larger document. The given number of bytes will
still be retrieved, and all tests performed as usual. The default is no limit.
Determine what should be printed out for each result of the last pipe section. The argument is a list of keywords separated by commas and/or spaces. If spaces are used, the list has to be quoted so that it remains a single command-line argument. The following keywords are allowed, and will be printed in the following order:
all all of the below
url URL (including C<file://> for local files)
furl file name or URL (this is the default)
type MIME type
size size of the document
time modification time
trace chain of of hyperlinked URLs which led to the result
The type and size output data may come from different sources depending
on whether the document in question had to be retrieved. wfind will
perform a HEAD request if necessary to obtain the size or MIME type from the
HTTP headers but will not retrieve the document just for printing the size
or type. If the document was retrieved to perform content tests, on the other
hand, the actual size will be reported, and the MIME type will be determined by
wfind if the the File::MimeInfo::Magic manpage module or the file program are
available. For the output format of these fields, seee below under Result reporting.
Set the maximum number of redirects for the Perl WWW library to follow. If not given, the default is left unchanged (probably 7; see the LWP::UserAgent manpage for your version's default).
Suppress all warnings. Because wfind performs only one pass over its command-line arguments, this option will not affect warnings related to arguments which precede it.
Number of slave processes to use for downloading and testing documents. Using several slaves makes wfind robust against being stalled by slow servers. Creating the slave processes and communicating with them requires some low-level Perl functions inspired by UNIX systems and may therefore cause portability trouble. If you experience unexpected behaviour, try setting -slaves to 0, so that wfind works sequentially and does not fork.
Print which document formats and tests wfind can support with the Perl modules and external programs installed on your system. With -verbose, indicates which modules or programs are used to process which formats and suggests what you can install to enable additional features. If other options or URLs are given, they are ignored. wfind tries to find modules and external programs on each run and therefore can use them as they become available. See Perl modules and external programs below for details.
Set the timeout for the Perl WWW library. If not given, the default is left unchanged (possibly 180 seconds; see the LWP::UserAgent manpage for your version's default).
Set user agent string. This is how wfind identifies itself to the server at every HTTP request.
Enables verbose error reporting, such as messages when a document could not be retrieved or parsed. Overrides -silent if that option is also given, irrespective of the order. Also makes the output of -echo or -status more detailed; see there.
Print wfind version and exit.
These options are scoped by pipe section. They are mostly about how to parse documents and which links to follow. Options of one pipe section continue to apply to the following sections unless they are explicitly reset.
Treat non-binary documents as plain text when searching the content. This
allows for instance to search HTML documents for the occurrence specific tags,
such as img. The extraction of link URLs is unaffected by this switch, so
wfind will still recognise and follow HTML hyperlinks. See -plainurls
for how to follow plain-text URLs.
Maximum search depth for this pipe section. The search depth is 0 for every URL given on the command line or resulting from the previous pipe section and is increased by one for every hyperlink followed or (for directory searches) every directory hierarchy. The search is continued until either the maximum depth of this pipe section or the global maximum depth (see -gdepth) have been exceeded. wfind's depth counting differs from that of the find(1) command in that it regards files as at the same depth as the directory where they are located. So a maximum depth of 0 for a directory search may return files from the directory given on the command line.
+|-]listTag attributes to use for following hyperlinks. The argument is a list of
tag-attribute pairs of the form tag.attribute separated by commas and/or
spaces. (If you use spaces, put the list in quotes; it must be contained in
one command-line argument.) The list of allowed tags and attributes is
generated at runtime using the the HTML::Tagset manpage module. You can view the
default setting and a list of allowed attributes with the following commands,
respectively:
wfind -echo
wfind -echo -follow all
If only one attribute of a tag can contain a URL, it is sufficient to give the
tag name, without a dot and attribute name. In addition to the explicit
tag/attribute pairs, the keyword all is allowed to select all of them, as
demonstrated above. If the list is prepended with a plus sign, the given
tag/attribute pairs are added to the list, with a minus sign they are removed,
and otherwise the list is set to exactly the given tags/attributes.
+|-]listTag attributes to take into account for -linksto tests. The argument format is exactly the same as for -follow above.
Follow only links whose text matches a search expression. The following test expression may not contain test options (see Test options), only content tests and operators. If multiple tests are given, they have to be enclosed in parentheses, or only the first will be assigned to -linktext. See also -while, which allows more freedom but has slightly different semantics.
Maximum number of results to obtain for this pipe section. Note that due to varying server reply times, the actual results obtained may not be deterministic.
Do not consider start URLs of the current section as candidates for search results, only follow their links. The start URLs of the first pipe section are those given on the command line; the start URLs of the other pipe sections are the result URLs of the previous section. The negative of -nostart is -nonostart.
Be pessimistic about test result. By default, wfind will give URLs the benefit of doubt if the result of a test cannot be decided. For instance, the user may search for web pages which have been modified recently, or which have certain values for a given HTML header. But many servers return bogus Last-Modified dates, and not all web pages may have the required HTML header. wfind includes the URL in the result list if the test cannot be decided for such reasons. Note that it is not the value of the individual -modified or -header test which will be true by default, but that of the complete test expression. So for instance it does not matter whether an undecidable test is negated; and if its value is irrelevant due to the values of other tests, the result is unchanged.
-pessimistic changes that default. If the test result for an URL is undecided, it will now be discarded. The switch also applies to -while tests, when wfind decides whether to continue following links.
Follow non-hyperlink URLs. Normally, wfind will follow hyperlinks in HTML, PDF and MS OLE documents. This switch causes any URL appearing in the document to be treated as a link. Among others, this allows following links from plain text and other document formats which do not have hyperlinks. -linktext tests are ignored for these links, they are always followed. To avoid misidentifying URLs, only links with the protocols which wfind parses (http(s) and ftp(s)) are recognised; this may affect -linksto tests, for which an URL from a hyperlink can have any protocol. (Workaround: use an additional content test.)
Follow hyperlinks only within a domain. Subdomains are treated as equal, so
links from en.wikipedia.org to de.wikipedia.org will be followed, for
instance. This option defaults to true in the first pipe section, unless most
of the given start URLs refer to the site of a well-known search engine,
when it defaults to false. See -while and the -url test for more
fine-grained control over which links are followed.
Follow symbolic links when searching local directories. This affects both links to directories (which are followed recursively) and links to files (which are considered as potential results). Symlinks to directories are treated as though their targets were located in the directory of the symlink and carry no "cost" in terms of search depth. Chains of symlinks are followed, loops are ignored gracefully.
Transform the result URLs before use in the following pipe section or output.
expression is a Perl expression that operates on $_. If $_ is the
empty string or undefined after evaluating the expression, the URL is removed
from the list of results. The functions that the Perl code fragment can use
are slightly retricted to prevent it messing up wfind. wfind's variables
are also not visible to it. However, the functions uri_unescape,
uri_escape and uri_escape_utf8 from the URI::Escape module and
uri_split and uri_join from URI::Split are available.
Be aware that like all pipe section options, -transform will be inherited by
following pipe sections, which in this case is probably not what you want.
Passing -transform with an empty string as an argument disables
URL transformations.
Because the transformation is performed only after a URL is tested, lists of transformed URLs may contain duplicates if different URLs are transformed to the same one. Currently a URL transformation does not trigger a renewed retrieval, so -transform in the last pipe section together with -download will not download anything. An additional pipe section can be used to remove duplicates and download the transformed URL, see Advanced usage below.
Follow links from local files to the internet. Files found in a local directory search will still not have their links extracted; use an additional pipe section to achieve that (see PIPE SECTIONS above and EXAMPLES below).
Stop following links when the associated test becomes false. The document in
question may still be reported as a result, i.e. the semantics is that of a
do ... while loop. If multiple tests are given, they have to be enclosed in
parentheses, or only the first will be assigned to -while. See also
-linktext.
The few options scoped by subexpression deal with proximity search requirements and a generalisation of the simpler -and and -or operators. You have to be a little careful where to put them: If you do not put explicit parentheses around every subexpression, -wfind will assign the option to the subexpression to which the test or operator to its left belongs.
Proximity requirement for the current (sub)expression, in characters. The distance gives the maximum distance between the first and last character which are part of phrases matching a sub-test. Be aware that a character proximity requirement may be hard to fulfil if you have greedy wildcards in a regular expression. You should probably use the non-greedy version instead (see the perlre) manual page). A subexpression cannot have a more permissive proximity requirement than its superexpression. wfind will adjust the subexpression's proximity requirement automatically if that happens, and output a warning. If both -charnear and -wordnear are given in the same subexpression, both requirements have to be met.
This option allows to define a more general logical function than the standard -and and -or. It requires at least the given number of subexpressions to be true for the current expression to give a true result. The -and or -or operators in the current expression are ignored and are therefore best omitted. Operator subexpressions must be enclosed in parentheses.
Proximity requirement for the current (sub)expression, in words. The distance gives the maximum distance between the first and last word contributing to true-valued subexpressions. A distance of 0 means all have to be the same word, which allows requiring a word to pass more than one test. A subexpression cannot have a more permissive proximity requirement than its superexpression. wfind will adjust the subexpression's proximity requirement automatically if that happens, and output a warning. If both -charnear and -wordnear are given in the same subexpression, both requirements have to be met.
By default, wfind prints the file names or URLs of the documents matching the required criteria. Results are printed as tests are completed, so the first results are available as soon as possible. Results are not sorted, neither alphabetically nor in the order in which hyperlinks were encountered, and the order is not deterministic.
With the -print option, you can specify additional information to be printed. Except for the trace, the fields are printed as columns in the order in which they are documented above. The trace consists of the URLs which led from the start URL to the result and is printed indented in reverse chronological order below the result URL. If no trace is available (for instance when the start URL was a result), a remark to that effect is printed. Traversal of local or FTP directories does not contribute to the trace.
Besides the results (which are printed to standard output), the number and weight of documents that wfind could not test are printed to standard error, if applicable. This comprises documents which could not be retrieved, were forbidden to robots or blacklisted, the numbers of which are printed separately. The first category includes retrieval errors, decoding or decompression errors, unsupported file formats and dead symbolic links on local file systems.
The weight of a document is the weight of its parent multiplied with its share among its siblings, which are documents hyperlinked from the same previous document or fellow start documents. This notion is based on the idea of a regular tree of hyperlinked documents. Though real-life hyperlinks rarely form trees, let alone regular ones, it gives an estimate of the share of documents which may be missed because one was unavailable or forbidden.
When a document cannot be retrieved and content tests are given, it is never reported as a result, even if the -pessimistic option is not in effect. The rationale behind this is that the aim of the search is to locate content, so returning unavailable URLs is pointless. Pure link extraction without content tests may still return dead links.
In case of errors in options, wfind outputs the index of the offending
command-line argument, starting with the first non-URL argument which has index
0. (The optional -- separator between URLs and tests does not count.)
Besides the following options are useful to avoid misinterpretation of your
command line:
To find out what document types and compressoins wfind supports and which Perl modules and external programs it uses, use the -status option:
wfind -status -verbose
(Output is briefer without -verbose.)
If wfind does not do what you expect, you can run it with the -echo option to see how wfind interprets your command line:
wfind ... -echo ...
Executing just wfind -echo without further arguments allows you to see the
default settings.
Sometimes the MIME type of a given document can depend on the module or program used to determine it. To see what MIME type wfind assigns to a certain file, run the following command:
wfind file -print mime -gdepth 0
The default global search depth (across all pipe sections) is 3, the default search depth per pipe section is 2. These numbers may seem small, but in most cases it is more sensible to reduce them than to raise them. Web pages tend to have tens of hyperlinks, so a depth of 3 puts thousands within reach. If all or most of them are within the same domain, wfind may run for hours due to its holdoff time, depending on how effective your -while tests are at pruning unwanted links.
The whole web is said to have a depth of only 15 to 20 hyperlinks, so a number in this order of magnitude is definitely too large a depth unless you are very restrictive about the URLs you admit. (These numbers are probably not quite correct, as that would imply that there are no more than 20 pages that are hyperlinked only serially on the whole web, which is unlikely. But they are a useful range to consider for unrestricted searches.)
The only situations where you should raise the default search depth is in searches of (local, HTTP or FTP) directories, where the search cannot run amok. For most web searches, a depth of 1 or 2 will probably suffice.
Unless you want to do a wholesale search of a complete site, you are going to have to tell wfind which links to follow. In order to create appropriate URL tests, do not just look at the links you want to follow, but also at those you want to ignore, to avoid leading wfind astray. Remember that the -samedomain option is active for non-search-engine start URLs by default. You have to deactivate it to follow links to other sites.
If you want to investigate series of documents (photo pages, comics, serialised
web fiction etc.), be aware that their URLs may not be entirely consistent.
Look at several before creating your URL test, and after running wfind sort
the results and look for gaps. (Numbering lines with nl can be helpful for
the latter.)
If you search through hyperlinks from downloaded web pages, remember to give the -unrestricted option to allow following these links. Be aware that relative links in downloaded documents can only be followed if you downloaded their targets too (unless the document base URL is given in a BASE tag).
A few brief remarks on search terms. When a search term is given as a simple
word (not a regular expression or other test), wfind acts similarly as a
search engine by reducing accented and other composed characters to their base
character before comparison (see Unicode below for details). However,
British and American spelling is still distinguished. Wildcards can be used to
allow both spellings: ...i?e matches both ...ise and ...ize, and ...o*
matches both ...or and ...our. Admittedly these search terms admit some
erroneous matches, especially for shorter words.
Regarding Levenshtein (editing distance) tests, one should be aware that they are not for short words. The proportion of meaningful short words is so large that every tested document can be expected to contain one which matches a given Levenshtein test. (Puzzles which require transforming one word into another by editing moves attest to that.) It is less easy to say how long exactly a word has to be to count as short in this sense, but words up to four or five characters probably qualify. The usefulness of Levenshtein tests probably starts at double that, around ten characters.
Lastly, you should not use wildcard constructions in regular expression tests
which potentially match text unlimited in length, such as .*. They may not
do what you expect. wfind does not read documents in one go, but tries to
match them bit by bit as they are parsed. This imposes a maximum length on
char-based tests, which can be found in the variable
$globals{"defneedchars"} in the source code (currently 500). Instead of
.*, use two separate regular expression tests for the parts before and after
it and combine them with the -and operator. There is a similar implicit
restriction on the number of words available when applying word-based tests,
$globals{"defneedwords"}.
Filtering the output of search engines is probably the most difficult application of wfind, though it is also one of the most useful. In order to do that, run the search engine and paste the URL of the results page into wfind's command line. Before you run wfind, make sure the search engine does not ban spiders (Google does, for example).
The main difficulty in postprocessing general-purpose search engine results is
that their result pages contain many links that are not search results. Most
are links to ads, other searches and cached pages, all of which also contain
the search terms. To exclude some of these (at least), it is helpful not to
give all of your search terms to the search engine. (But because not all
search engine results actually contain the search terms, you should repeat
those passed to the search engine as arguments of wfind.) A -while test
banning links to the domain of the search engine can eliminate suggestions for
other searches, translations and the like.
Site searches and other special-purpose search engines can be less trouble than the general-purpose sort. Sometimes the result URLs habe the same form (e.g. articles from a newspaper or blog) and can be easily filtered. On the whole, there tends to be less clutter.
wfind creates a directory .wfind in the user's home directory.
Subdirectories are created to cache the retrieved files of each run. Besides,
configuration files can be located there, notably wfindrc.
The wfindrc contains default options which are parsed before command-line
options and can be overridden by them. The option names are the same as global
or pipe section command-line options without the leading hyphen. Each line of
the wfindrc can contain one option and its value. Boolean-valued options
are switched on by their presence, but may also be followed by one of the
values on, true, yes, 1, off, false, no or 0 (case
insensitive). Comments are started with the # character and extend up to
the end of the line.
Besides the wfindrc, the configuration directory may contain a subdirectory
urlgroups. The files in this directory define a number of URL groups that
can be used with the -urlgroup test. These files can contain URL prefixes,
domain names or include directives. URL prefixes have to be absolute, i.e.
start with a scheme such as http://. Any URL that starts with a prefix from
the URL group or is in a subdomain of a domain from the URL group is taken to
be part of that group. Trailing asterisks after URL prefixes are ignored.
Include directives start with the word include followed by the name of
another URL group. Comments in URL group files start with # and extend up
to the end of the line. The special URL group blacklist defines URLs which
wfind should ignore.
Even though the wfindrc file is read before the parsing of the command line,
the command line is briefly scanned for the options -verbose and -silent
and their negations, so that the parsing of wfindrc and the URL groups can
be reported accordingly.
Through LWP, wfind supports HTTP and FTP proxies set by the environment
variables http_proxy, ftp_proxy and no_proxy. Set these variables and
export them using your shell to make wfind use a proxy.
Some servers allow users to authenticate themselves via parameters in the URL.
You can use that method with wfind if you reverse-engineer the URL yourself.
Be aware that the password will then be visible to others on your system via
the ps program. With the same caveat, FTP logins can be put in the URL:
wfind ftp://user:password@ftp.someserver.foo/... ...
Authentication using challenges from the HTTP server or cookies, and reading
passwords from the .netrc is not supported by wfind at this time.
Extract hyperlinks from a web page:
wfind <URL> -nostart -depth 1
Extract all hyperlinks (including external ones) from a downloaded web page:
wfind file.html -nostart -depth 1 -unrestricted
Find out which Perl manual pages have something to say about arrays:
wfind /usr/share/man/man1/perl* =array
The = indicates a word test. It can be omitted, except when there is a file
or subdirectory called "array" in the current directory, when it would be
mistaken for a URL (file) argument. This ambiguity can also be avoided by
separating tests from URLs with --. For the search above, is is not
necessary to restrict the search depth: Because all the URLs are files and
groff files contain no hyperlinks, the search depth is effectively zero.
Postprocess search engine results (from Clusty/Yippy), requiring documents on Karate to contain the Kanji characters for Karate-do:
wfind http://search.yippy.com/search?query=karate -depth 1 '/\x{7A7A}\x{624B}\x{9053}/'
This should at least turn up the Wikipedia page. The command will run for a
rather long time due to the comparably large number of links from the search
results page to others in the yippy.com domain. See below for more advanced
examples of search engine postprocessing.
Find the URLs of the 25 latest Dilbert comic strips:
wfind http://dilbert.com/strips/?Page=[1-5] -depth 1 -follow img.src -url '/strip(?:\.sunday)?\.gif$/'
Add the -download option to retrieve the images. The pages displaying
recent comics are named systematically, with each page displaying five strips.
The comic images themselves end on "strip.gif" or "strip.sunday.gif", unlike
any other images on these pages. The -follow option restricts link
traversal to image URLs embedded with an <img src=...> tags, and the
regex URL test picks out the comic image URLs.
Assume you have a set of URLs some of which link to pages you consider interesting. The interesting hyperlinks all have "bar" in their link text, and you are sure only the initial pages containing "foo" link somewhere interesting (replace "foo" and "bar" by appropriate phrase). The following wfind command line finds the interesting pages for you:
wfind <start URLs> -depth 0 foo \| -depth 1 -nostart -linktext bar
The command line has two pipe section, which allows the content test for "foo"
to be imposed on the start URLs but not the link targets. The first pipe
section does not follow links (-depth 0), it only discards the initial URLs
not containing "foo". The second pipe section never returns initial URLs as
results (-nostart), only the targets of links containing "bar" in the link
text.
Note that the second pipe section follows hyperlinks to other domains. If you wanted to omit the test for "foo", you would have to give the -nosamedomain option to get equivalent results in the first pipe section:
wfind <start URLs> -depth 1 -nostart -nosamedomain -linktext bar
The following command line extracts unique results from the search engine dogpile:
wfind http://www.dogpile.com/dogpile/ws/results/... -nosta -depth 1 \
-transform 's!^(?:.*/clickserver/.*rawURL=(.*?)\&.*|.*)$!uri_unescape($1 || "")!e;' \
\| -depth 0 -nonosta -transform ""
Because dogpile uses a HTTP POST request to trigger searches, one cannot construct a start URL for the search results page. You have to perform the search in your browser (starting from http://www.dogpile.com/) and then copy the URL of search results into the comand line.
The first pipe section discards the initial URL (abbreviated -nostart) and
extracts links (-depth 1). Dogpile does not provide direct links to the
result pages, but instead redirects clicks through a URL on its own domain
starting with http://www.dogpile.com/clickserver/. The -transform option
extracts the result URLs (which are present in escaped form) and discards all
links which do not go to the clickserver and are therefore not results.
The second pipe section serves to make the result URLs unique. The -nostart
and -transform options, which would otherwise be inherited from the first
pipe secion, are negated. The -download option could be added to retrieve
and save the documents, or test options could be added to select some of them.
This section contains details about wfind's inner workings. It is intended as a last-but-one resort before reading the source code, and as a more precise specification of its behaviour than the previous parts of the documentation can give. If wfind works as you expect it to and you would not dream of programming something like wfind yourself, you may want to skip this. On the other hand, studying this section is indispensible if you want to understand wfind and use it to its full potential. Threspassers may be confused to the edge of sanity and be found semi-concious in a ditch with froth in front of their mouths several days later. You have been warned.
wfind consists of a dispatcher process and several slave processes which download and test documents. The number of slaves is given by the -slaves option, and only when its value is zero, no slave processes are used, but the main process handles all URLs sequentially.
The main (dispatcher) process maintains a queue of URLs to look at. Whenever a slave process is idle, it looks for a job to assign to it. Depending on the -holdoff option, jobs can be postponed until the holdoff time for their server has expired (unless the URL was given on the command line or is considered part of a document by a frame or redirect). If the robots.txt file of its server forbids it, it is removed from the queue. Jobs for retrieving robots.txt files themselves are created as needed. If a URL is to be tested again (such as in a different pipe section) or if it does not require a retrieval (only URL tests), the holdoff test is skipped. If a job is ready, it is preferentially assigned to a slave which already has a connection to its server open.
A slave process waits for jobs sent by the dispatcher. It retrieves the
document, performs the tests and extracts hyperlinks. Taking the results of
-linktext and -while tests into account, jobs for testing hyperlinked
URLs are created and passed back to the dispatcher. For ordinary hyperlinks,
the remaining recursion depth is decreased by one. Frames, <layer>
tags and <meta content=...> tags are treated as hyperlinks, but the
depth is not decremented, and no holdoff period is imposed, reflecting the fact
that these are part of the current document.
The dispatcher process receives the hyperlinked URLs from the slaves and adds them to the end of the job queue. If the link structure is a tree (which it rarely is), this results roughly in a breadth-first search. If a given URL has been tested before, the dispatcher finds it in the list of completed jobs and transfers its established data to its new job, notably the file name where its content is cached if it was already retrieved. If the tests of a completed job evaluated to true (or undefined unless -pessimistic is given), a job is moved to the next pipe section or output as a result.
Local directories are disqualified as search results. Subdirectories are searched recursively with the remaining search depth decremented (as for hyperlinks), while files are considered as results and have their pipe section depth set to 0, but their global depth unchanged from their directory. This has the effect that recursion from a directory is restricted to subdirectories within the pipe section, but can be continued to hyperlinks of matching documents in the following section.
FTP directories are handled as local directories, with recursion restricted to subdirectories within the pipe section. HTTP directory listings are a different matter, because there is no sure way to tell when one encounters them. wfind employs a heuristic to identify them after retrieval, but does not try to tell subdirectories apart from documents. So the search depth is decremented for either, as it would for hyperlinks from a document.
wfind's slave processes perform tests in an order that allows to postpone retrieving the document, and possibly avoiding it altogether. First, URL tests are performed, and the logical test expression evaluated as far as possible. If the result depends on HTTP header tests, a HEAD request is sent, HTTP header tests are performed and the test expression evaluated again. If the result depends on content tests, or if the -while expression has evaluated to true, or if the test expression is true and -download is given, the document is retrieved.
After retrieval, the data format of the retrieved file is determined, it is decoded as applicable and possibly converted into a format that wfind can read. The resulting document is parsed, content tests are performed on the text and hyperlinks are extracted.
Elementary content tests are performed as text is parsed, and the positions of matches are noted. Periodically, the expression tree is evaluated, taking the proximity options into account. If the expression can be fully evaluated, parsing is aborted.
wfind can support a fair number of content encodings and data formats when the appropriate Perl modules or external programs are available. In order to avoid dependency hell, it does not require all of those modules and programs to be present. Rather, wfind detects their existence every time it runs, so additionally installed modules and programs are available immediately, without reinstalling wfind.
After retrieving a document (or after arriving at a specific local file), its data type is first determined by inspecting the first part of its contents. This is done by wfind itself, ignoring the HTTP Content-Type header, so as to be consistent between web documents and local files. If it is a type that wfind can read, the document is parsed; if the type is unknown, the file is ignored; if it is a compression or other encoding format, it is decompressed and the data type is determined again. Postscript is special in that it is treated like an encoding: it is converted to PDF which is then parsed.
The MIME types used for the -type and -rtype tests are independent of the data types for parsing. -type refers to the contents of the Content-Type HTTP header. It may therefore differ according to the server or even the directory on the same site if local server configuration files differ. For compressed documents, it may return the type of the compression (such as application/x-bzip2) or the type of the document (such as text/html).
The -rtype test uses the the File::MimeInfo::Magic manpage module or the file
command depending on which is available. It always tests the decompressed file
and is therefore independent of the encoding as well as the server, and also
works on local files. However, its results may depend on whether
File::MimeInfo::Magic or the file command is used, and possibly on their
version.
Current versions of Perl support unicode, and so does wfind. The character
encoding of a web page is obtained from the Content-Type HTTP header or the
corresponding HTTP-Equiv meta HTML tag, if present. Otherwise, ISO-8859-1 is
used as the default character encoding. The output of external parsing
programs such as pdftotext and catdoc is read taking their character
encoding into account.
Before performing word tests and regular expression tests without the v flag, the text is transformed to "base characters". Accents, cedillas and diaereses are removed, and ligatures such as the Scandinavian ae are transformed to two ASCII characters. Punctuation is treated similarly: The more than 5 kinds of unicode hyphens are transformed to ASCII, zero-width joiners and discretionary hyphens are removed, and glyphs resembling (single or multiple) punctuation signs are transformed to (single or multiple) ASCII characters.
These transformations make the simpler searches similarly tolerant as internet
search engines. The search words (or regexes without the v flag) are also
transformed, so they may contain non-ASCII characters. In order to accept
command-line arguments in UTF-8 encoding, wfind specifies Perl's -C
option in its interpreter command line (the first line in the script starting
with #!). I have known this not to work on one installation; if you get an
error message regarding this option, remove it and put the following line in
your shell resource file (bashrc or equivalent):
export PERL_UNICODE="SAL"
In addition, your locale has to specify a UTF-8 encoding, so Perl and your shell understand each other.
wfind observes the robot excusion standard. Since the standard is slightly open to interpretation, this section gives the details. If a server provides a file robots.txt in its root directory, it is retrieved and parsed before any hyperlinks are followed on that server. URLs disallowed for wfind (or whatever you have set as its user agent name) are not retrieved recursively. URLs that are explicitly allowed are still retrieved recursively, even if they also match disallowed paths, on the grounds that it would be pointless to list them as allowed otherwise.
The restrictions by robots.txt are not applied to URLs given on the command
line. This reflects the fact that the robot exclusion standard is aimed at
programs that traverse the web automatically (see http://www.robotstxt.org/)
and follows the practice of other programs such as curl. Retrieving and
testing documents explicitly named by the user does not (yet) make wfind a
crawler.
By default, and if the -holdoff option has a negative argument, wfind uses the Crawl-delay entry of robots.txt (if present) as its holdoff time. This can actually speed things up: I have seen a fair number of servers that set a Crawl-delay of 0.
In order to treat different document types equally, wfind ignores the <meta name="robots" ...> HTML tag. Flags such as noindex and nofollow have no influence on results and recursion.
For combining tests in expressions, wfind supports logical AND and OR operators as well as general logical operators requiring a minimum number of the subexpressions in the expression to be true (with the -require option). In addition, proximity requirements may be imposed using the -wordnear and -charnear options.
The common AND and OR operators are special cases of a -require operator, namely stipulating all or at least one subexpression. Therefore the following will be described in terms of -require operators, which covers everything. The proximity options impose the restriction that all required tests not only be true for a word in the document, but that these words lie in an interval of n+1 words or characters, where n is the argument of the proximity option. As a special case, a proximity condition of -wordnear 0 allows imposing several tests on the same word.
If one of the required tests is a subexpression, all of its tests' matches must also lie within the proximity interval of the superexpression, even if it has a proximity option of its own. The negation of an expression with proximity option is true if the required number of tests cannot be fulfilled or if their matches do not fit into the proximity interval.
If expressions with proximity options contain tests that do not apply to the document text, these (if true) contribute to the required tests but do not restrict the proximity interval. For example, an AND expression with proximity option containing one in-text test and one header test effectively has no proximity requirement, as the header test does not narrow the interval for the text match. If one such test is undecidable (such as a header test for a non-existing header), the overall expression may become undecidable, so that the presence of the -pessimistic option may decide about the outcome.
find(1), grep(1), bool(1), curl(1), wget(1), perlre(1)
wfind is (c) 2008-2011 Volker Schatz. It is free software and may be copied and/or modified according to the Gnu General Public Licence version 3 or later (see http://www.gnu.org/licenses/gpl.html).