Conference Papers

Museums and the Web: An International Conference
Los Angeles, CA, March 16 - 19, 1997

Michel Vulpe
Founder and CEO
Infrastructures for Information Inc.

Show me which road I'm on


The promise of SGML is that if you separate form from content a miracle occurs. Data integrity is preserved across platforms, applications, and soup recipes. The reality is a return to the bad old days of proprietary hardwired solutions. Appreciating what SGML brings to the table requires an understanding both of its origins and of its characteristics. SGML is not what it appears to be at first glance.

In this paper, we'll look at

  1. where SGML came from
  2. how a business case can (or cannot) be made for SGML
  3. what alternatives exist today
  4. attempts to "fix" SGML, to make it address today's problems
  5. how SGML can be rescued from obscurity by redefining its role
  6. The Primeval Swamp -- Standards

    Standards allow us to share. They provide the framework for an agreement on how we exchange a number of things like information, ideas, goods and, of course, money. When a consensus is based on a false premise, or is overtaken by events, the standard which evolved out of it becomes, like any good dinosaur, extinct.

    Any economy whose assets are intangible is based on standards. The strength of an economy based on GNP (whatever that is), instead of gold, is based, in essence, on a shared agreement about how the concept of GNP reflects a tangible value. When agreement on the concept is destabilized, the entire edifice is shaken -- vis the currency crisis in most of Eastern Europe. But when the agreement on tangible value is destabilized, chaos results -- vis the current pyramid crisi in Albania.

    In the world of politics and economics one talks in terms of treaties and agreements. In the world of technology one talks of standards. Standards reflect a consensus on an issue of such importance to all signatories that they agree to agree because the cost of disagreeing is too high. The higher the cost of breaking a standard the more ingrained it becomes. Fundamental infrastructure standards such as ASCII, SQL and TCP/IP are seldom challenged: they may be expanded and evolved, but the vested interests will ensure that they remain around for a considerable period of time. As vested interests change, however, standards do come under challenge and are successfully removed from the active playing field.

    The infrastructure standard I'm here to talk about today is SGML. SGML is a computer language that allows one to describe the structure of information in a non-prescriptive manner, using the lowest common denominator encoding of 7-bit ASCII. It came into existence as a technology standard because of a shared interest in distributing information across multiple platforms. The vested interest however is not high -- because of the cost of implementing SGML, not many companies have "bought in". It has, at this point, all the appearances of a peripheral standard that promises much and delivers little -- with one extremely notable exception -- the World Wide Web and HTML.

    In the late 1970s, when SGML was conceived, the computing world was a very different place from it is today. Closed systems were the norm, and interchange standards at the data level were only beginning to gain credence. (Oracle, the dominant player in the SQL database world, only started to 'take-off' in 1989, the year SGML was formally adopted as an ISO standard.) The World Wide Web was a DoD project that was limited to a few academics, the military, and a small number of corporate players. The desktop was a piece of wood where the telephone, the pen, the file folder and the coffee cup lead a harmonious co-existence. The problems of distributing information, expressed as documents, were very different from those of today.

    SGML was a logical and realistic response to the document-based information distribution problems of that time. In a world of proprietary closed systems, information distribution was a critical problem. Hardware and software complexities made the task arduous and prone to failure. Hard-copy duplication was the norm. (In fact it is only in the last few years that desk-top printing technology has gained market share at the expense of duplication technology.) The SGML standard seemed to offer a solution. By using a neutral, lowest-common-denominator way of conveying presentation structures to applications, electronic documents could, in theory, be distributed over simple networks to proprietary systems, and be processed at the other end. The only requirement was that each system had to have the SGML technology necessary to process the documents.

    The rub lay in the requirement for SGML technology. SGML is a complex standard that requires sophisticated tools and processes, time, reengineering, and education to support. SGML, like any technology, ultimately has to be cost justified. If the perceived benefits exceed the costs, then and only then will a standard be adopted.

    The route to maturity

    For a technology to become mature and be accepted as part of the infrastructure, the technology must be more than just elegant. It must exist and deliver within the business context. The technology highway is littered with amazing products and concepts that failed the test of businesses acceptance. SGML is no different -- it is a great concept that will fall by the wayside if it cannot satisfy the needs of business.

    The business case must answer three key questions:

    1. Is the problem big enough and expensive enough to justify the costs?
    2. Are there alternative, either now or in the foreseeable future?
    3. What are my incremental options?

    Accounting algebra and cost justification

    Putting aside for the moment situations (such as CALS), where adoption is mandated , the answer appears to be clearly "yes, the problem is large and costly". For instance, information production in the heavy manufacturing sector is estimated at 20-25% of capital equipment costs1. That number does not account for the estimated 80% redundancy factor2 or the 40% formatting costs3. For information product organizations, such as publishers and government, the cost factor is even higher.

    A US Navy aircraft carrier would sink under the weight of its tour-of-duty paper maintenance manuals, and a Boeing 747 can't lift its own paper documentation. To these potential consumers, if SGML can do no more than provide the documentation in a form that eliminates the recurring paper costs, then the investment is worth considering.

    The mantra of the SGML evangelist, that SGML will save millions by providing reusability and device independence, is targeted exactly at this class of problem. That positioning, however, is only one part of the three-part equation. The SGML solution must be subject to greater scrutiny. The next two questions are the most complex, and are those to which this discussion will now turn.

    Thank goodness for alternatives

    As any high-tech vendor will tell you, there is nothing more difficult than being the only vendor of a particular technology. Competition and alternatives provide the consumer with a framework within which they can understand a product. Alternatives are critical to proving the viability of a technology.

    As has been noted, SGML only became a formal standard in 1989. The world into which it was born was very different from the world in which it was conceived. This, of course, is irrelevant to those confronted with the today's issues of distributing document-based information. To those decision makers, a large number of alternatives that did not exist in 1989 are available and viable today.

    The number of word-processors available on the market today is significantly smaller than it was five or ten years ago. So limited is the choice today that distribution of documents in one of the two dominant proprietary formats is not even questioned. Government, business and other organizations routinely distribute read-write documents in either MS-Word4 or WordPerfect (increasingly less so) formats with impunity, and with the reasonable expectation that the recipient will be able to process the document. In a worse-case scenario, RTF can be used.

    Electronic paper is another alternative. Postscript files are routinely used as a distribution format, as any extensive browser of FTP sites will know. Adobe's PDF (portable document format - a Postscript descendant) allows the information manufacturer to distribute electronic replicas of their "paper products" at essentially no cost. A PDF document is generated quite simply: just change the printer driver in the printer selection picklist on your PC or Mac. Never mind that the standard screen is not 8.5*11 and that to be readable a document needs to be reformatted -- it works, it's cheap, and it satisfies the basic requirement. The World Wide Web and HTML (an SGML implementation) provides the optics (how many people use the File/SaveAs in Netscape) of read-only delivery. HTML has the additional benefits of being fashionable, of supporting simple hypermedia links, and of costing very little to create -- most word-processors and page layout applications can now output very sophisticated HTML-encoded pages as a matter of course. Most importantly, all these technologies work transparently on all production hardware and software platforms.

    The traditional SGML argument of portability has been overtaken by technology. "Portability" is no longer the publishing problem it once was. Viable, low-cost alternatives exist, and there is no reason to believe that they will disappear. On the contrary, indications are that they will get cheaper and more robust.

    The whole ball of wax

    The final challenge is that of incremental change. Even discontinuous innovation must accommodate the legacy world. SGML, as traditionally delivered, demands a level of acceptance that is often far more than the consumer is willing to give. It has demanded that every paragraph, list item, and emphasis be encoded so that it can be the subject of innumerable customized scripts, when all we really wanted was to know what part numbers are being referenced in the user manual.

    The technology sector, despite all the hype, is extremely conservative. Although the pace of change is tremendous, it is, in most cases, supplier driven. New versions of the desktop applications suite require the latest version of the OS, that requires the latest version of the CPU, that in turn requires new memory configurations, and so on and so on. To the consumer, this cyclone of vendor-driven change is overwhelming. This is particularly the case when the new desktop app has "new and improved" features that the average user never uses. It has been argued, with some merit, that the WWW is a marketing phenomenon created to bolster demand for even bigger CPUs, applications and networks.

    The old adage, "if it ain't broke don't fix it" is the operational principle of most technology consumers. It took so long and cost so much to fund, develop, install, and stabilize the system, that anything that upsets its delicate balance it hard to justify. Incremental upgrades are possible under controlled circumstances. New technology, especially anything that changes the process, destabilizes the environment, or otherwise "rocks the boat", is resisted.

    However, traditional implementations of SGML have not been incremental. Rather, they have demanded that consumers make significant changes to their processes as well as to their technology infrastructure. Studies show that document conversion costs can eat up 50-70% of the budget for implementingS GML. Even worse, traditional implementations have been point solutions without context.

    To develop SGML documents, the consumer is asked to adopt a specialized word processor that delivers one third of the functionality at four times the price of the existing word processor. Even worse is that accessing, let alone utilizing, the so-called intelligence in the SGML is a complex task requiring sophisticated programming skills. To add insult to injury, publishing from SGML is difficult, and often requires a post-process pass to correct visual errors that are difficult to capture algorithmically. In short, to take advantage of SGML, the consumer has been forced to consider wholesale reinvestment in technology and processes.

    Increasing complexity through simplification

    Complex technologies must be surrounded by a suite of supporting tools. Data management systems need application development tools, data verification tools, recovery tools and so on. Communications systems need configuration tools, monitoring tools, and connectivity tools. Wide-spread adoption of the core technology is dependent on this.

    SGML advocates have been troubled by the lack of success of their chosen religion. They have identified problems with the infrastructure surrounding the standard. One core problem is that of presentation. Since SGML is descriptive, not prescriptive, some generic means of providing instructions to the presentation technology was needed, otherwise each SGML implementation became a complex custom effort. FOSIs5 were developed as a solution. The problem with FOSIs is that require complex SGML technology to implement. On top of that, DSSSL6 was put forward as an intermediate form between source and formatted output.

    The intent is good: the "complete" SGML solution does not exist, and, in response to that, these "rounding out" technologies are being developed. The problem with them is the same as with SGML: they are out of context with the existing technology infrastructure. To the consumer, they increase the complexity and costs of the solution without providing any obvious benefits over their present system. The infrastructure solution the consumer is looking for lies not in expanding the scope of the replacement, but in the leveraging of the current investment.

    To address the problem of complexity, subsets of SGML are being considered. XML7 is been proposed as an "SGML lite". Various industries, such as health care, are looking at the success of the WWW and its adoption of a hardwired SGML solution as a simplification strategy. These industries are trying to develop HTML equivalents on the theory that, if it worked for the WWW, it will work for them. However, "dumbing down" the technology does not solve the problem and, by creating confusion in the market, may do more harm than good.

    What these "solutions" share is a desire to realize the benefits of SGML through simplification. The viability of these strategies, however, is highly questionable. XML is unlikely to attract a wide following because its technology demands are only marginally less than its parent SGML, and someof the most important benefits, such a rigorous abstract data language and system portability, of SGML are heavily compromised. The WWW's use of HTML, the most important SGML success story, is unique and unlikely to be replicated. The phenomenal marketing success of one remarkable visionary company, Netscape, short-circuited the normal shakeout process. The result was that what took over ten years in the word-processing marketplace was accomplished in a few months in the marketplace of WWW browsers. Moreover the success of HTML is due to its limited objectives within a very simple paradigm: simple text presentation on a device-independent screen.

    So where to now St. Peter

    Before we dismiss SGML out of hand, it is perhaps worth looking at one last time through a different lens. Has the adoption strategy to date taken us down the wrong road, and, if so, what is the right road? Instead of the road of document-based information interchange, consider the road of structure. After all, one of the selling points of SGML is that it provides structure to documents.

    Traditionally, SGML has been used to express the structure of paper documents in Document Type Definitions or DDTs. The DTD (a data schema for a document), provides the application that uses it with the organizational rules of how objects within the document relate to each other structurally. These rules guide the application through the process of building the document. They tell us that a is part of a

    , and that an IDREF creates a hyperlink to an ID. These rules are embedded as SGML markup in the target document, hence SGML becomes a text-encoding technology. But look what happens if one removes the phrase "within the document" from the above sentence: one is left with SGML as a technology for describing structural relationships: "how objects relate to each other structurally". The objects can be and
    , or and, or and . There is nothing intrinsic in SGML that says it is limited to text. Furthermore, if one removes the embedding and uses an associative model8, the "text encoding" limitation is eliminated -- structure need not be implemented as explicit markup within the data.

    Data redundancy, hypermedia, just-in-time manufacturing, real-time delivery these are just some of the issues confronting the modern information manufacturer. Data redundancy is a problem of a manufacturing model that hasn't changed in thousands of years. That model says that it is easier and cheaper to rebuild a piece of information rather than to reuse it. Why? because we didn't know the information already existed, and even if we did, the cost of finding it exceeds the cost of rebuilding. Hypermedia? love to have it, but how do we describe and manage these complex relationships? Simple static URLs on the WWW are hard enough to track. What happens when one moves to a transaction-based model in which information is self-configuring, and the manufacturer and consumer interact in real-time?

    These problems, at their core, have little to do with publishing and everything to do with data management. Providing solutions to these problems is of vital importance to the information manufacturer. Thinking of SGML as a data technology rather than text-encoding technology makes it a possible candidate technology to solve the problems of information manufacturing.

    Will it successfully pass the triad of business questions we posed earlier?

    1. Is the problem big enough and expensive enough to justify the costs?
    2. Are there alternative, either now or on the horizon?
    3. What are my incremental options?

    Clearly the cost problem has not gone away, and the evaluation factors remain the same.

    In terms of alternatives, there are none presently available. Postscript and PDF formats solve the portability problem, but do not solve the problems of managing information. Attempts to utilize existing - read SQL - technology has repeatedly been characterized by limited success. Object databases are on the event horizon and have been for a considerable period of time, but attempts in this area are also characterized by limited success.

    The final point, and in the case of taking a new look at SGML, the most interesting, is the incremental issue. By redefining SGML as data schema language that is descriptive in nature and works on an associative model of structure and content, the incremental approach entirely viable. SGML becomes a technology that can be played out in the background, within the existing infrastructure, that is, using existing WP and DTP tools, at whatever pace is deemed necessary.