Automated YAML file processing

YAML is a hierarchical data format similar to the more well known JSON. Compared to JSON, it is more easily readable for humans and can be modified with any text editor, but also more powerful (for example, supporting references). The syntactical difference between YAML and JSON can be compared to that between Python and C, and YAML is accordingly more readable.

This makes YAML a very clear and compatible way of storing nested data structures, in contrast to light-weight databases such as SQLite which are restricted to tables.

yaml, the program distributed on this web page, allows automated processing of data structures from YAML files. It is as yet experimental software, mostly working, but with incompatible changes possible in the future. It corresponds to tools such as xmlstarlet or xmllint that allow automatic XML processing. yaml operates on the array of top-level entries in the YAML file when there are several, or (more commonly) on the array or hash/dictionary contained in its only entry. The sub-commands are described in the documentation below. An especially useful feature that cannot be found elsewhere in a general-purpose program is the depends (and rdepends) command, which converts a list of items and their dependencies into a dependency tree.

yaml is available in a git repository on this server that can be cloned like this:

git clone http://volkerschatz.com/repositories/yaml

yaml is written in Perl and licenced under the Gnu Public License Version 3. Its main dependency is the YAML::Syck Perl module.

The program documentation is here:

usage: yaml <command> <file.yaml> [ <arguments> ... ]

Valid commands are: transform, grep, sort, extract, depends, rdepends, import, export

yaml transform <file.yaml> <Perl code>

Applies the <Perl code> to all top-level data structures and outputs the transformation result in YAML format. A reference to the data structure is passed in $_, and the result has to be passed back in $_.

yaml grep <file.yaml> <Perl code>

Filters for data structures for which the <Perl code> evaluates to true and outputs them in YAML format. A reference to the data structure is passed in $_.

yaml sort <file.yaml> <Perl code>

Sort top-level data structures using <Perl code> as a comparison expression. The expression must compare $a to $b, as in the code argument for the sort function. If the top-level data structure is an array, $a and $b are array elements; if it is a hash, $a and $b are hash keys, and a copy of the hash is stored in %H.

yaml extract <file.yaml> <path>

Outputs an array of subordinate data structures in YAML format. <path> may contain ranges (endpoints separated by "..") or wildcards ("*").

yaml depends <file.yaml> <path> [ <node> ... ]

Print a dependency tree in YAML format. Each top-level data structure is denoted by its key or index. The relative <path> describes where to find the sub-structure (scalar, array or hash) that contains their dependency/ies. If any <node>s are specified, only their dependencies are printed.

yaml rdepends <file.yaml> <path> [ <node> ... ]

Similar to the depends command, but prints the reverse dependency tree.

yaml import <file> [ -t <type> ] [ -H ] [ -C <index> ]

Converts the data from a different file format to YAML. -t allows to force the input file type, otherwise the file extension is used to decide.

XML:
Imports an XML file as nested associative arrays with tag names as keys. Multiple tags with the same name are represented as an array of associative arrays. Tag attributes have "@" prepended to their keys. Content-only tags are represented as simple strings; tags that have both sub-tags or attributes and text content receive the content in a key that is a single double quote. Leading and trailing white space is removed from the content. Requires XML::Parser::Expat and its dependency expat.

HTML containing one or more tables:
Imports all tables in an HTML page as an array of arrays or of hashes (if table headers are present). The result will likely have to be edited, because tables are often used for layout and other purposes, so unwanted arrays are going to end up in the YAML output. Column spanning in headers concatenates neighbouring cell to form the value of the corresponding hash key. Column spanning in table data cells across multiple different headers causes the cell to be copied to all those columns. Multiple header rows at the top of the table will be concatenated column-wise, and subdivisions in a following row will cause the common first row header to be copied. When tables are nested, only the outer table will be reproduced, and the inner table(s)' cells concatenated. Tables with headers in the first row and first column are not yet supported; they should be represented as a hash of hashes.

JSON:
Converted to YAML one-to-one. If available, JSON::XS is used to parse the JSON input; otherwise YAML::Syck is used, which should work with up-to-date JSON generators.

CSV:
Comma-separated value table according to RFC 4180. Fields may be quoted by double quotes, with original double quotes doubled in the quoted string. Quoting with single quotes or partial quoting of fields is not allowed. With -H, the first row is taken for table headers, and an array of associate arrays with these keys is output.

Plain text, assumed to contain a space-separated table:
-H causes the first row to be taken as table headers and an associative array data structure to be created from each row with those keys. The 0-based index passed after -C denotes the column to be used as a key for the top-level associative array then created; without -C, a top-level array is generated instead.

DBF (dBase level 5 database file):
Imports database table as an array of hashes with field names as keys. The deletion flag does not prevent a record from being imported but is itself imported as the value of the "_deletion" key. Leading and trailing white space is stripped from values. Fields of type L (boolean) are converted to 0, 1 or undef; all other fields are imported as the strings that represent them. Thus fields of type M (strings from memo file) are imported as the block index only, the DBT file is not parsed.

yaml export <file.yaml> <file.sqlite>

Converts a YAML file to a different format.

SQLite:
Converts an array of hashes to an SQLite database. This requires the DBI Perl module. A table named after the input file with columns named after hash keys will be created and filled. All column types are "numeric", which stores numerical data in numerical types. Non-scalar values will be stored as text containing YAML expressions.

Export will be refused if an heuristic decides that the hash keys are too diverse between array entries to make sense as database table columns. The output file may already contain a database, but if a table with the target name exists already, the export is also aborted.