Thoughts about Scholarly HTML

by Thomas Arildsen

The company is working on a draft standard (or what I guess they hope will eventually become a standard) called Scholarly HTML. The purpose of this seems to be to standardise the way scholarly articles are structured as HTML in order to use that as a more semantic alternative to for example PDF which may look nice but does nothing to help understand the structure of the content, probably more the contrary.
They present their proposed standard in this document. They also seem to have formed a community group at the World Wide Web Consortium. It appears this is not a new initiative. There was already a previous project called Scholarly HTML, but seem to be trying to help take the idea further from there. Martin Fenner wrote a bit of background story behind the original Scholarly HTML.
I read’s proposal. It seems like a very promising initiative because it would allow scholarly articles across publishers to be understood better by, not least, algorithms for content mining, automated literature search, recommender systems etc. It would be particularly helpful if all publishers had a common standard for marking up articles and HTML seems a good choice since you only need a web browser to display it. This is also another nice feature about it. I tend to read a lot on my mobile phone and tablet and it really is a pain when the content does not fit the screen. This is often the case with PDF which does not reflow too well in the apps I use for viewing. Here HTML would be much better, not being physical page-focused like PDF.
I started looking at this proposal because it seemed like a natural direction to look further in from my crude preliminary experiments in Publishing Mathematics in e-books.
After reading the proposal, a few questions arose:

  1. The way the formatting of references is described, it seems to me as if references can be of type “schema:Book” or “schema:ScholarlyArticle”. Does this mean that they do not consider a need to cite anything but books or scholarly articles? I know that some people hold the IMO very conservative view that the reference list should only refer to peer-reviewed material, but this is too constrained and I certainly think it will be relevant to cite websites, data sets, source code etc. as well. It should all go into the reference list to make it easier to understand what the background material behind a paper is. This calls for a much richer selection of entry types. For example Biblatex’ entry types could serve as inspiration.
  2. The authors and affiliations section is described here. Author entries are described as having:

    property=”schema:author” or property=”schema:contributor” and a typeof=”sa:ContributorRole”

    I wonder if this way of specifying authors/contributors makes it possible to specify more granular roles or multiple roles for each author like for example Open Research Badges?

  3. Under article structure, they list the following types of sections:

    Sections are expected to be typed using the typeof attribute. The following typeof values are currently understood:

    sa:Funding (which has its specific structure)

    I think there is a need for more types of sections. I for example also see articles containing Introduction, Analysis, and Discussion sections and I am sure there must be more that I have not thought of.