HackML — home page and intro

HackML is a simple — and I do mean simple — XML application for the publishing industry. The purpose is to enable article authors, and especially external or temporary contributors, to quickly and easily ‘tag up’ an article as they write it. The end purpose of the article might result in a web page, a PDF for print, or any number of other electronic output destinations.

An article–centric XML markup vocabulary

by Ian Tindale

HackML is my own idea for a pretty bare–bones and skimpy mark–up vocabulary, enabling an article author, writer or editor to ‘tag–up’ an article as it is written or edited. This will impart semantic meaningfulness to the written structure, with at least enough differentiation of parts to enable useful layout and processing to occur.

HackML isn’t particularly fully featured. HackML is in fact, pretty much a no–brainer, offering quite basic functionality, which more or less translates to some of the choice available to an HTML author, combined with some of the choice available to the average magazine layout designer. Hopefully, one could impart all the necessary within five minutes. If they met someone else that also knows HackML, they'd probably know about the same amount.

Easy peasy lemon squeezy

Whilst the notion of XML markup vocabularies for editorial use is laudable, most of the current offerings seem to luxuriate in their complexity and expansiveness. Almost to the extent of losing all but the most academic or tolerant of potential users.

Life’s not like that. People who write articles are not necessarily affectionate about the prospect of learning a whole bunch of new technological structures, especially as up until recently, the art editor would have done it all for them, or the subeditor, or someone else.

In order for ‘tagging up’ to occur at the authorship phase, without a lot of resistance or apathy, we need a very simple, very easy, very uncomplicated and bare minimum set of ‘tags’ that almost any writer can understand. Simple enough to use without confusion or distraction. Hence the name — ‘HackML’ — from the idea of a hack writer, who would much rather finish the article and get down to the pub than bury their head in page 422 of the DTD of their organisation’s markup arcana.

Not quite like HTML

You might notice that although HackML seems vaguely similar in complexity and implementation to HTML, there are significant and important differences. Unlike HTML, HackML has no ‘head’ / ‘body’ separation. As a rough analogous existence, instead of a ‘head’ there is a ‘meta’ element set — into which we can place much information that either doesn’t actually get rendered, or does get rendered but not directly within the flow of the body copy galley. The ‘copy’ element isn’t quite so global as the ‘body’ is in HTML, where it seems to encompass everything intended to be rendered. The ‘meta’ element is actually quite functionally crucial for many transformation activities regarding presentation and layout.

Instead, our HackML ‘copy’ element is simply a container for body copy, in the sense of a flow of galley. Renderable elements can exist outside the body, such as banners, headlines, straplines, standfirsts or figures, or derived from content within the meta element.

Figures follow the copy element because they aren't part of the flow of galley copy (although can be anchored to a point in there if desired). Figures can either consist of passages of text, tables, lists, or else illustrations, photos and the like. The latter can form a composite unit consisting of photo, caption, credit elements).

Another important difference between HackML and HTML is that HackML doesn't use the ‘p’ element as a container for paragraphs. Instead, any text within a ‘copy’ element is taken to be a continuous run of galley, and any specific element (specified in the XSLT transformation) interspersed between text nodes will break the text run into paragraphs.

For example, the following markup...

One of the interesting things about insects is that you can divide them into two distinct types.<q/>The ones that have incomplete metamorphosis simply hatch out from their eggs as small replicas of the adult form. Each successive moult will see them grow bigger.<q/>The ones that have complete metamorphosis are those that have distinct stages of development. From an egg, they will hatch out into a larva, which then pupates, finally hatching out into the full grown adult.<q/>Each stage of the complete metamorphosis life cycle is quite different from the preceding and following.

...should result in the following:

One of the interesting things about insects is that you can divide them into two distinct types.

The ones that have incomplete metamorphosis simply hatch out from their eggs as small replicas of the adult form. Each successive moult will see them grow bigger.

The ones that have complete metamorphosis are those that have distinct stages of development. From an egg, they will hatch out into a larva, which then pupates, finally hatching out into the full grown adult.

Each stage of the complete metamorphosis life cycle is quite different from the preceding and following.

Fig. 1: quadding

Break paragraphs with a quad

There is a general purpose paragraph breaking element, known as a ‘quad’ (which is an old typesetting concept), here represented as the ‘q’ element (an empty element), which marks paragraph breaks. This is ‘the other way round’ to most markup languages such as HTML, where paragraph elements such as the ‘p’ element must contain the text. (see Fig.1)

The reasoning behind this is that it is not quite so intuitive to have to begin typing an opening tag, then the paragraph text, then the closure tag, if you’re in the full creative flow of writing. It is much more appropriate to the usage by non–experts to simply bang in paragraph breaks where they occur. This we do with the ‘q’ element.

The transformation stage later would find it quite easy to search within the ‘copy’ element for text nodes, and element nodes and ‘encapsulate’ the text nodes between the ‘q’ element nodes within ‘p’ elements for XHTML output, or ‘block’ elements for XSL–FO to PDF output.

For example, the following markup...

...should result in the following:

1990 item description
1997 item   thing
2000 item description thing

Fig. 2: tabulation

Tabulation

There should be a simple tabulation structure (see Fig. 2). The presence of a ‘t’ element should indicate to the transformation that this line forms a row of a table. Similarly, the successive instances of lines with ‘t’ elements in there would indicate that these are all part of one table. The amount of table columns would be equal to the amount in the line that has been separated with the most tabs. The other lines ‘left–range’, as it were.

At the barest minimum, there’ll just be a ‘t’ element. Chuck in as many tabs in as you need. Two or more consecutive ‘t’ elements can be spotted at the transform stage to convert into the appropriate column span mechanisms.

Writers often don’t seem to understand tabs, and a single ‘t’ element is about the simplest markup for simple tabular work that is needed in most cases. Actual column widths are a design or presentation feature, as are types of justification of tab column content, so are not catered for in the basic HackML structure.

For example, the following markup...

...should result in the following:

Fig. 3: lists

List constructions

Lists are kept quite simple too, and rely on flexibility in the design of transformations to discern patterns of list formation (see Fig. 3). The ‘l’ element simply marks the start of a list item. The end of a discrete list item is marked with a quad, as if you were marking it as an end of line. Complex lists might be set up with successive ‘l’ elements before a quad, or even combinations of ‘l’ and ‘t’ elements before a quad. I consider that it’s less likely that an author would tax these combinations fully, simple though the elements are.

Of course, there’s nothing to stop the bolting on of additional complexity through additional namespaces. HackML keeps it simple, for people who just want to bang the copy in and go.

A diagrammatic table of the entire XML vocabulary of HackML showing what fits within which, and including a brief description of each element’s purpose, is viewable here. The current DTD for HackML is viewable here. This page is highly likely to sprout further explanatory guff in the fullness of time.