↑↑ Home ↑ TeX tricks

Advanced Citing

Citing URLs -- Citation types -- Archived web citations

Citing long URLs

The web being the information source it is, it is increasingly common to reference URLs in papers and other documents. Style files, provided by journal publishers that compete by prestige rather than quality of service, do not always handle URLs well. One particular problem can occur for URLs longer than 80 characters or so. It results from a daft choice of default parameters of the BibTeX program in the teTeX implementation of LaTeX. The BibTeX source (texk/web2c/bibtex.web in the source distribution) contains the line:

This sets the maximum output line length to 79 characters. If you compile TeX yourself, you can avoid much trouble by appending a zero to give 790. Otherwise BibTeX limits its output line length in the .bbl file severely, and longer passages of text without spaces are forcibly broken with a comment character (percent sign) added before the newline. This applies to URLs as well as text passages containing long control sequences.

How serious this is depends on how the broken lines are processed further. The control sequences \url and \href of the hyperref package are designed to work around the problem by discarding the percent sign when followed by a newline. Some bibliography styles automatically enclose the url BibTeX entry in \url{...}, thereby taking care of the problem. But if you give a plain URL in the note entry, you may be in trouble. Apparently TeX keeps the percent sign in this context rather than interpreting it as the start of a comment. In some cases I have also seen URLs truncated up to the percent sign. Things are even worse when a control sequence is forcibly broken: Its remainder at the start of the following line is not appended, but rather taken as ordinary text, which leads to "undefined control sequence" errors.

So what can one do? If you always cite URLs verbatim, you should be fine using \url. In that case, you may have another problem, namely TeX's hyphenation. TeX treats URLs the same as text, hyphenating words at syllable boundaries. This leads to an ambiguity between hyphens that are due to hyphenation and those that are part of the URL. For some style files, TeX seems to avoid hyphenating URLs altogether, which leads to overlong lines. The solution to the hyphenation problem is to make TeX break lines at slashes (and perhaps at other points) without adding a hyphen. This is achieved by adding a zero penalty at those points, for instance for this URL:

http://www.\penalty 0 volkerschatz.com/\penalty 0 tex/\penalty 0 advbib.html

TeX ignores white space after control sequences (and their numerical arguments), so no gaps in the URL result. The disadvantage of this is that it does not work together with \url, and therefore does not work in the url entry for many bibliograhpy styles either. Apparently \url manages to ignore the \penalty sequences. I have not yet found a solution that works in that situation (\href has the same problem). A beneficial side effect of the above is that it solves BibTeX's line breaking problem too because of the spaces after the \penalty control sequence.

Finally, what if you do not print the URL as the hyperlink text? I prefer to put a simple word indicating the hyperlink ("Online"), and for versatility it is given by a control sequence (\urlprefix). Obviously there are no problems with hyphenation. But for a URL of the wrong length, the line in the .bbl file is separated in the middle of the control sequence \urlprefix which makes up the second argument for href, the link text. Short of recompiling BibTeX, the only workaround is to postprocess the .bbl file to join the broken lines again. This can be done with the following sed script:


The code in braces appends the next line and loops (via label s) while the current line ends in a percent sign. Then all percent signs followed by newlines are removed before the line is output. You can download the short script here and execute it as follows:

sed -i -f fixbbl.sed document.bbl

The -i makes sed edit the destination file in place. Alternatively, for easier use you can define a shell function as described here.

Indicating different kinds of citations

When reviewing literature on a new topic, I have frequently regretted that citation lists lack information about how the cited work is related to the work in which it is cited. For instance, some references are cited because the current work builds on their results, some contain different approaches to solving the same problem, others are background information or reviews on the subject. While this information can often be obtained from the title of the cited work or the context in which it is cited, this is error-prone and slows you down researching literature. Misunderstanding why a paper was cited is especially annoying when the journal in question is not available at your institution. You can end up ordering the paper from a library only to find days later that it is not at all what you expected and that you need a different one, which can stretch hours of literature search into weeks.

So, when I decided to write some more-or-less scientific investigations I had done privately up and put them on the web, I remembered this issue and thought about a way to solve it. I decided that an informative and yet unobtrusive way to include the desired information in the citation list would be to put little icons or graphical characters in front of every reference to indicate with which intent it was cited. The three-dot "therefore" sign indicates intuitively that this work builds on the reference, a similarity sign denotes work with the same aim, a fat bullet a review or reference book, and so on.

Now how to implement such a scheme? I use BibTeX for citation lists, so there were two programs involved, (La)TeX and BibTeX, and two corresponding style files. Because I was creating a personal LaTeX and BibTeX style file at the time, my solution is designed to be put into style files. (Note on the side: The custom-bib package from CTAN is a great way of generating custom BibTeX style files.) The LaTeX part could in principle go into the preamble of the LaTeX source (before the \begin{document}) if the "@" characters in the macro names are replaced by something else.

So how is it done? The BibTeX part is relatively easy. A control sequence which we will have to find a way to define is inserted immediately after the \bibitem command which defines the reference and prints its number or mnemonic. The TeX code for the citations is passed to LaTeX from BibTeX through the .bbl file generated by BibTeX. What is written to this file is controlled by BibTeX's own style file which has the suffix .bst. The output of the \bibitem is done in the function output.bibitem in the .bst file. I modified it to output an additional control sequence the name of which depends on the citation label.

FUNCTION {output.bibitem}
{ newline$
  "\bibitem[" label * "]{" * write$
  cite$ write$
  "} \csname vs@citetype@" write$
  cite$ write$
  "\endcsname" write$
  before.all 'output.state :=

The most important thing to know about the language of .bst files is that it has postfix syntax: Arguments precede functions. I do not understand very much of the file format myself, but from looking over a file I learnt that cite$ denotes the citation label and that write$ causes its arguments to be written to the .bbl file. This is quite enough for the simple modification that we need. After inserting the green part, BibTeX outputs an additional control sequence the name of which is composed of "vs@citetype@" and the citation label. The \csname and \endcsname in which it is enclosed cause the string between them to be interpreted as a control sequence name and that sequence to be called. They are necessary because the contents of the .bbl file are executed with the category codes of the user TeX source files where "@" is not allowed in control sequence names. (And some of the characters in the citation label may not be allowed either.) \csname...\endcsname overcomes that.

The second part of the solution is to define several control sequences in a LaTeX style file which allow to set a reference's citation type. I implemented five different types: prior work, comparable work, more loosely related work, reviews/reference books and background info. When a reference is cited as a given type, the control sequence \vs@citetype@<label> is defined as the corresponding indicator symbol (where <label> is the citation label). Here is what I added to my LaTeX style file:


% Citation type definitions
\def\mark@vsctpw{$\therefore$ }
\def\mark@vsctcf{$\thicksim$ }
\def\mark@vsctrf{$\bullet$ }
\def\mark@vsctrl{$\leftrightarrowtriangle$ }
\def\mark@vsctba{$\sphericalangle$ }
  \if@vsctpw\mark@vsctpw prior~work\ \ \fi%
  \if@vsctcf\mark@vsctcf confer\ \ \fi%
  \if@vsctrf\mark@vsctrf reference\ \ \fi%
  \if@vsctrl\mark@vsctrl related\ \ \fi%
  \if@vsctba\mark@vsctba background\fi}}
  \@for \vs@cit:={#2}\do{\def\vs@cite{vs@citetype@\vs@cit}%
  \@for \vs@cit:={#2}\do{\def\vs@cite{vs@citetype@\vs@cit}%

First the AMS symbol package is included because the indicator symbols I use are from the AMS fonts. You do not need this if you choose different symbols. The core of the matter are the two control sequences \citet and \citent which stand for "cite reference as given type" and "cite reference with no output as type". The only difference between them is that the first calls \cite after the citation type stuff has been done. Both are used the same way: Their first argument is a punctuation character vaguely resembling the indicator symbols I chose, which gives the citation type. The second is a comma separated list of citation labels, as would be passed to \cite. Bearing in mind that curly braces around the first argument are not required here, the usage is as follows:

See \citet~{foo1997}.  But don't forget \citet-{bar2005,bar2003}.  For a recent
review, see \citet.{allfoobar2006}.

The obvious drawback is that all references cited in one go receive the same citation type. It would be preferable if citation types could be given per reference instead of per \citet command. That would require stripping the first character from the citation labels, and I haven't yet worked out how to do that. The current workaround is to use a following \citent command to correct the citation types which should be different.

So now let's have a look at how \citet works. First the control sequence \vs@ctype is called which sets the control sequence \vs@ct globally to the indicator symbol corresponding to its argument, the first argument of \citet. (The indicator symbols are defined in separate control sequences because we will also use them to generate a key.) Then the LaTeX construct \@for...\do is used to split up the comma separated list of citation labels and loop over the labels. In the loop, the control sequence \vs@cite is set to the name of the control sequence \vs@citetype@<label> which will be inserted into the citation list. This control sequence is then defined as the indicator symbol stored in \vs@ct, using \expandafter and \csname...\endcsname to get the target control sequence from \vs@cite.

The remaining control sequences serve to generate very brief descriptions for the indicator symbols in the control sequence \vs@citetypekey, but only if that type was indeed used. To that end, an \if@vsct... control sequence is defined for every citation type, which is defined to be \iftrue when and if the type is encountered in \vs@ctype. \vs@citetypekey then prints out a brief description for every citation type that was used. To output this key, a line has to be added to the definition of \thebibliography in your style file:

    \par\noindent\vs@citetypekey\par\vskip 1mm%

The following extract from the citation list of one of my nonpapers shows what the result looks like:

Archived web citations

When citing resources on the web, it is simplest just to cite the URL. But the page in question may change, be moved or disappear altogether. In order to allow the reader to obtain such references (almost) indefinitely, it helps to also cite an archived copy, either on the Internet Archive or one you have created yourself using the free WebCite® or a similar service.

But even though common BibTeX bibliography styles support an URL field, they do not support citing an additional archive URL. This is easily added by duplicating the code for the URL field in the .bst file. Before we start, a few words on the appearance of the URL. The default, apparently dating back to printed documents, still spells out the URL in text. Since I mostly create electronic PDF documents, I have replaced that by a PDF hyperlink to the URL, with the brief link text "Online". The link to the archive URL which will be added below will be just as brief, but I will remark on what you have to change to get something more verbose.

Each field (key-value pair which may be part of a BibTeX entry) appears in the BibTeX style file (.bst file) in several places. First, all field types are listed in a block of curly braces after the keyword ENTRY. I decided to call the new field for the archive URL "cached". (This is the name of the key after which you have to give the archive URL in the bibliograhy (.bib) file later.) The first thing to do is to list it among the other entry types.

  { address

(The three dots are not literal, more entries follow.)

The next most important thing is to define a formatting function for our new field. I called it format.cached in analogy to other field-specific formatting functions, which are also called format.<field name>. It is more or less a copy of format.url.

FUNCTION {format.cached}
{ cached empty$
    { "" }
    { "\href{" cached * "}{\cachedprefix}" * }

It is not too difficult to understand once you have got over the postfix notation. If the cached field has an empty value, the empty string is returned. Otherwise, a string containing an \href control sequence is put together and returned. The hyperlinked URL is given by the contents of the cached field, and the text is the contents of the \cachedprefix control sequence, which will be defined later on. The asterisk is simply the concatenation operator, which is preceded by its two arguments because of the postfix notation.

Now all that remains to be done is to add the output of the cached field to all entry types which support URLs. The following example is for the article entry type, and puts the archive URL after the live URL:

FUNCTION {article}
{ ...
  format.url output
  format.cached output

Finally we have to define the \cachedprefix control sequence somewhere. The right place to do this is the begin.bib function, which is executed before processing the references:

FUNCTION {begin.bib}
{ ...
  write$ newline$

The control sequence \providecommand allows the definition to be overridden in the TeX source file (for instance for texts in a different language from English). This would not have been possible had we put a literal text in the definition of format.cached.

The cached field can now be added to any web citation you want to reference:

  author= {L. User},
  title= {The fwark of the gnorbledook},
  url= {http://www.bla.foo/important-text.djvu},
  cached= {http://yourarchiver.org/short-link}

It will be output as a hyperlink under the word "Archived".

It has to be said that the changes described above may not apply without modification to any style file you may be using. BibTeX is a programming language, which gives authors of .bst files the freedom of coding in their own style. The initialisation may not be located in a separate function begin.bib, and possibly the processing of a bibliography entry may be organised in different subroutines. But judging from the .bst files installed with my copy of BibTeX, most seem to have the structure I assumed above, and the custom-bib package also generates them that way.

TOS / Impressum