Ad servers — a nuisance for humans

Advertisements are a nuisance when browsing, distracting from real content and redirecting you on accidental clicks. But thankfully some browsers allow filtering ads and do not display content from blacklisted URL paths. The following blacklist is in the format for Opera's urlfilter.ini. It contains URL paths with the wildcard "*" appended and can be pasted after the keyword "[exclude]" in urlfilter.ini, which is located in ~/.opera/ on Linux systems. Many of the URLs were copied from this list, but I have continuously added more as I have encountered them.

Link spam — a nuisance for spiders

When using a program (spider) to examine web pages and follow hyperlinks, advertisements are less of a problem because they are rarely so well targeted that they are confused with real results containing the information one is looking for. However, some other web content can lead a spider astray.

There are several types of such problematic content. Perhaps the most misleading are defunct domains. Hyperlinks from trustworthy sites to them still exist, but the domain is in the hands of domain squatters, registrars or advertisers (not mutually exclusive). Their home pages typically contain very topical keywords and many links to the same domain, as well as a few links to the registrar / advertiser. Because the domain squatter knows the previous content of the site, the keywords are similar enough to make it difficult to sort out by an automated spider. Because it is hard to keep track of when a site goes out of business, I blacklist the link spam target sites rather than trying to keep a list of such domains up to date. This part of the blacklist below contains many well-known registrars, so you may have to remove them if you spider for content likely to be found on their sites.

The second type of sites spiders should ignore are few but bothersome. Some sites have perfected the manipulation of search engines and crop up frequenytly in searches for technical information; but the content cited by the search engine requires payment, or else is trivial and contains links to commercial offerings. Then there is straightforward link spam, usually consisting of links inserted into forum or blog posts or code snippet databases. These are harder to filter, as their targets are much more numerous. Finally advertisement links should not be followed by a spider either. But the URL to filter, unlike for web browsers, is not the URL of the ad image, but the link target.

Finally, some web pages contain excessive numbers to the submission scripts of social bookmarking services in order to enhance their popularity. These links are useless for information searches. I have taken care to restrict their blacklisting to the submission pages.

The blacklist file below contains all these types (comments indicate which list starts where). It is in a format understood by my search spider wfind: Partial URLs are paths which are supposed to include all URLs of which they are a prefix (no wildcard is appended). Domain names (without http://) denote all URLs within that domain and all subdomains.

