An open continuous media environment on the Web


Bill Simpson-Young and Ken Yap, CSIRO Mathematical and Information Sciences and Advanced Computational Systems Cooperative Research Centre, Building E6B Macquarie University Campus, Locked Bag 17, North Ryde NSW 2113, Australia. Phone: +61 2 9325 3155 Fax: +61 2 9325 3101 bill.simpson-young@cmis.csiro.au Home Page [HREF1] ken.yap@cmis.csiro.au Home Page [HREF2]


Keywords

World Wide Web, Continuous media, Media description formats


Abstract

There is now widespread use of continuous media such as video and audio on the Internet and intranets and a large number of applications supporting these media types. For the full potential of continuous media to be realised, however, it is important that it not just be supported in disparate applications each making use of video and audio data, but that the application architectures, protocols and formats that support these applications are open and shared in the same way that we are familiar with for the generation, serving and processing of traditional Web content such as HTML documents and images.

In this paper, we identify and discuss the need for an open approach to continuous media and identify issues that should be addressed in order that the continuous media architecture follow in the footsteps of the traditional Web architecture and pave the way for a large range of applications yet to be invented.


Introduction

This paper has two broad goals. The first goal is to identify and discuss the need for an open approach to continuous media. We first examine some applications and scenarios that are or will be possible in a network environment with effective support for video and audio content processing. We then look at the current state of affairs with regard to the support of continuous media such as video and audio on the Internet and intranets and look at some current standardisation efforts in this area. Next we take a look at the traditional Web architecture and examine some of the technical aspects that allowed it to become not only an architecture for use in a single application domain (i.e. distributed hypertext), but to become an enabling architecture that facilitated the emergence of an enormous number of applications and led to the phenomenal acceptance and growth of the Web. Finally, we bring all of this together and raise the question of whether there are lessons learned from these aspects of the traditional Web architecture that can be applied in the continuous media domain and hence identify the need for an open approach to continuous media on the Web.

Once we have shown the need for such an approach, we address the second goal of identifying some of the issues that should be addressed on the road to such an open architecture for continuous media. We travel only sufficiently far along this road to indicate the sort of issues that we believe exist and the sort of ways in which these issues can be addressed.

We have chosen to limit the scope of the paper in several regards. Firstly we emphasise stored media over live media such as conferencing. This means that we are only covering a portion of the types of applications and scenarios possible in a network with effective continuous media support, but it does allow us to limit our discussion to a scope that can adequately be addressed in this paper. Also, we are interested here in the application layer only and will not be addressing, for instance, issues at the transport layer. For instance, just as a discussion of the Web architecture could choose to include or not a discussion of TCP(Stevens 1994), we have chosen not to discuss transport-level issues for continuous media such as those addressed by RTP (Real Time Protocol)(Schulzrinne 1996[HREF3]).

The need for an open approach to continuous media

Applications in an open continuous media environment

With the availability of digital video and audio technology, the increasing bandwidth of networks (especially in local area networks), and the almost universal use of intranets and the Internet, the stage is set for the emergence of various video and audio handling applications. These go far beyond the basic VCR-like video viewing applications that are prevalent today. In this section, we will describe some examples of applications that would be possible in such an environment and look at the sorts of functionality required to support these applications. Most, if not all, of this functionality is available now and some is available in an intranet or Internet setting. However, the functionality is often available in separate applications and not in the integrated fashion that would be available in the type of open continuous media environment being discussed in this paper.

In an intranet environment with effective support for video applications, there could be widespread use of video as an organisational memory with videos of meetings, events, seminars, etc being readily accessible. For this purpose, videos would be stored in a central or distributed video repository accessible from each workstation. Minutes of meetings would contain timecodes that can be used to facilitate access to the corresponding location of the video (even though the minutes may not represent the actual sequence of events in a linear way). Staff could include references to specific locations or ranges within the video in email or documents and these references could be used by those reading the document. While viewing a video, a user would be able to make a request for all references or annotations that apply to the current position in the video. To aid in searching the vast quantities of video and audio content, there would need to be navigation facilities (eg automatically extracted key frames corresponding to video events such as a change of slides or audio events such as a change of speakers) (Yap et al 1996), video and audio summarisation facilities and content-based search facilities (eg searching for specific shapes within the video or specific words within the audio) (Brown et al 1996)).

Such an environment would also support frequent use of just-in-time training using networked training videos. This type of application would require similar functionality to that described above as would other applications such as video news access, video briefing systems, etc.

From an examination of these sorts of uses, it is possible to identify some general functionality that would be shared by many of these applications. For instance, users should be able to:

This is an indication of the sort of functionality possible in a networked video environment and is only a small sample of the many possibilities. The technology for all of this functionality is available now - the key to its successful deployment is to implement it in an open manner with applications interfacing effectively to each other using appropriate standards and conventions.

Continuous media applications and architectures

In this section, we will look at the current state of affairs with regard to the use of continuous media on intranets and the Internet and the current standardisation efforts in this area.

Continuous media have been used in one way or another on the Internet and intranets for almost as long as the Web architecture has existed. Initially, of course, video and audio data was treated just as any other media type and downloaded in full and then played by a helper application that supported the particular media type. Since 1994, there have been numerous applications made available that support the streaming of audio and video data across the network as it is being played and these usually use proprietary control protocols and transport protocols and often use proprietary media formats.

Many of these applications provide impressive quality media over low bandwidth Internet connections and functionality above and beyond the basic play/stop/random-access media controls. For instance, RealMedia (formerly RealAudio) from Progressive Networks supports video image maps which provide hypermedia hotspots in video, and media synchronisation files that are used to retrieve and display content when specific positions within the video or audio content are played.

In the standardisation area, work is continuing on RTSP, the Real Time Streaming Protocol (Schulzrinne et al 1997[HREF4]), which is at present an Internet Draft by Henning Schulzrinne from Columbia University (author of RTP), Anup Rao from Netscape Communications and Rob Lanphier of Progressive Networks. RTSP is an application-level protocol that provides a mechanism for establishing a session with a live or stored media source using any of a number of different types of delivery channel (such as UDP, Multicast IP, TCP and RTP-based mechanisms) and for controlling the media using methods such as PLAY, PAUSE, etc.

The prospect of an Internet or intranet environment with a wide range of applications supporting RTSP is exciting and not very far off. In such an environment, it will not be necessary to match the media client to the media server providing the service. You will be able to choose your client software on its virtues rather than being required to use a specific product. However, is the agreement on a media streaming protocol sufficient to provide the kind of environment we are talking about? The environment made possible by such agreement can be compared to the situation of free-to-air television where there are a large number of competing brands of television set ("TV clients"), each of which can be used to view the same channels. In this case, the major functional differences between the sets is limited to being in their size, picture quality and cost. Will this be the extent of the effect of RTSP or will it provide a similar type of open architecture with regard to continuous media to that provided by the traditional Web architecture to non-continuous media?

In order to answer this question, we need to first look at the technical reasons for the traditional Web architecture being so successful, not only in addressing the specific application area of distributed hypertext, but in providing an enabling architecture within which a large number of applications have emerged. We do this in the next section.

The Web as an open architecture

There are many reasons for the phenomenal success of the World-Wide Web architecture but the key ones stem from specific characteristics of the three standards at its core: the HTML format, the HTTP protocol and the URL addressing scheme. This section discusses some of the characteristics of these standards (in isolation and working together) that have been important in the Web architecture taking on the role it has today.

In looking at the success of the Web architecture, it's important to distinguish the success of the architecture from the success of the World-Wide Web itself. It is possible that someone could argue that the architecture was not important, it was just the content that mattered and the Web architecture as we know it just happened to be in the right place at the right time. It is evident, however, just how successful the architecture has been when we see the enormously widespread use of intranets being used for many aspects of workplace information systems such as querying databases and filling in workplace forms. In such cases, the architecture is not being used for access to worldwide information sources but is replacing existing local information systems. It clearly has characteristics that make it a very useful architecture.

Some of the important characteristics of the Web architecture include the following.

1. Simple text-based data format

The simplicity of HTML was an absolutely critical factor in the widespread use of the Web. The fact that any person with no or minimal training could create HTML pages by hand-coding meant that a critical mass of material in HTML was available on the Internet sometime in 1993 which led to an acceleration of the number of people both accessing and providing information on the Internet which has still to slow down. If HTML had been a binary format (eg using an ASN.1 encoding) that required HTML-specific editing software to create it, or if it had used a complex SGML DTD (Goldfarb 1990) (eg one that required more sophisticated hypermedia constructs using HyTime) (International Organisation for Standardisation 1992), it is highly unlikely that it would have received the widespread use it has today. There was actually an attempt to have the hypermedia referencing in HTML use HyTime but this did not gain support from the Web developer community. Also, the simple format made it straightforward to build applications that could generate or process data in this format and this led to a large number of applications supporting the format very quickly.

2. Text-based protocol

Similarly, the text-based nature of HTTP ensured that HTTP servers and clients were reasonably straightforward to prototype and implement and many of these appeared (and disappeared) in the early days of the Web allowing great opportunities for experimentation and interoperability testing by a large number of developers in the Internet community. The text-based approach also allows a self- documenting approach which facilitates extension of the protocol and experimentation with additional capabilities (eg as the error message sent from server to client not only includes an error number but a full textual message describing the error, new error codes can be introduced in servers and existing clients will still be able to display meaningful error messages to the user). The ease of implementation also facilitated the inclusion of HTTP support directly into many and varied applications (such as spiders and database engines).

3. Limiting of each standard to a very specific role

Another extremely important aspect of the Web architecture has been the concept of dynamic document generation (for example, HTML generated on the fly by server-side scripts). This was possible as the HTTP standard was not limited to being a protocol for file retrieval (as, eg, the semantics of the ftp protocol make it) and the URL address space was not limited to being a method for specifying remote files. The important point here is that the Web standards were specifications for a very limited scope of the whole domain so the use of these standards was not limited to a specific application. This led to an enormous range of applications being possible using the Web architecture, far exceeding the distributed hypertext for which the Web was initially used. If the Web architecture did not provide room for dynamic document generation, there would have been no interfaces to databases, no search engines and none of many of the other functionalities we take for granted on the Web.

The protocol was also developed to be independent of the actual media type, media format and encoding type of the data for which it was used. The use of MIME (Freed et al 1996 [HREF5]) to label data again made the architecture more flexible. This, combined with dynamic document generation, also led to the interesting use of the Web for services that automatically generated images on the fly such as for map navigation and graphical display of dynamic data such as share prices.

By way of comparison, an example of the type of standard that we believe is not appropriate for the Internet world is the DSM-CC standard being developed by the DAVIC Consortium [HREF6] in the area of interactive television. The part of the standard dealing with the application-level (the user-user interface) provides support for media control (start, stop, jump etc), service connection (attach, detach, etc) and directory operations (eg. open a list of movies, get the service specification for movie X). The standard has embedded within it very strong assumptions about the exact applications that will use the standard. For example, the directory operations are very much based on the video-on-demand idea that there is a list of movies that the user chooses from, but doesn't seem to be extensible to all the other ways in which video playback might be initiated. This might be appropriate for the interactive television world but doesn't provide the open extensible environment that the Internet thrives on.

4. Open, non-proprietary standards

Of course, the Web standards also share the benefits of other standards developed in an open manner within the Internet community. It has been critical for the Web's success that the Web standards be open and non-proprietary and that there be freely available reference software. Rutkowski (1994)[HREF7] gives a good discussion of the advantages of the Internet model for standards-making and lists the attributes of this process, which include the following. (It should be noted, though, that the Web standards did not progress through the stages of becoming actual Internet standards until well after they were in widespread use.)

In this section, we have shown how specific attributes of the Web standards have allowed the Web architecture to be an enabling architecture and to facilitate the evolution of a very broad range of applications using that architecture. We believe that it is important - now that audio and video processing applications are becoming available over the Web - that the architecture used for such applications also be an enabling architecture that opens up a whole range of possible applications, rather than produce a situation where each application is an island of functionality.

In the next section, we will discuss whether the emerging approaches to continuous media are going to follow this tradition.

Is the current approach open?

The current draft of RTSP looks like it will do well in following on in the spirit of the Web architecture as described in the previous section. It is very interesting to look at the recent history of the development of RTSP and notice that it could have very easily taken a different direction. The original proposal (Rao et al 1996 [HREF8]) used binary codewords and it was not until February this year that the draft protocol was substantially changed to being a text- based protocol very much in the tradition of HTTP. The current version borrows heavily from HTTP in areas such as the overall message syntax, the security mechanisms, extension mechanisms and response codes. Although being text-based introduces a small amount of overhead in message size, it carries the huge benefit that applications that use the protocol can be developed more quickly and we will hopefully see the large number of supporting applications for RTSP that we have seen for HTTP.

Although RTSP is very likely to be widely adopted, are there other standards and conventions that should exist for continuous media, in order that the architecture for continuous media be enabling in the same way as the traditional Web architecture? The rest of this paper identifies and discusses several issues that need to be addressed for there to be an open continuous media environment of this kind.

Toward an open media environment

In the remainder of this paper, we will give indications of some of the issues that need to be addressed if the type of environment we envisage is to emerge in an open manner following on in the Web tradition. Several of these issues have been identified by the World-wide Web Consortium [HREF9] as possible areas for standardisation [HREF10].

The areas that we believe would benefit from a common approach include:

We don't believe this is an exhaustive list but we think these areas are the right place to start.

Continuous media description

If different applications are to make use of the same video and audio content on a network, there is a need to have a way of sharing information about this content between applications and between people and applications. Some of this information may be represented within an encoded media stream but there are advantages in having this information available external to the stream (even if it is automatically extracted from the stream before first use). These advantages include the use of the information by many different applications including those that do not actually process the video stream itself, overriding of information in the stream (eg where there is a need for local configuration of metadata such as title etc), multi-purposing of footage (eg where the same bit of footage is used at different times for different purposes and there is only one stored version), the application of the same metadata to multiple representations of the video (eg encoded at different qualities and resolutions), etc.

The information requiring specification

We are not proposing a full metadata format for video and audio content as might be used for exchanging information between catalogue systems. The information we want to represent is that information that can effectively be used by multiple applications in processing and presenting the media content itself.

The information that must be able to be represented in a continuous media description includes information required for accessing and displaying the media content as well as information for performing other processing on it. The information includes:

The requirements for an interchange format

The continuous media information listed above will be shared between applications. Before we look at possible interchange formats for specifying this information, we list some desirable characteristics of this interchange format. It must be: The first five of these requirements are necessary for the format to be open in the way described earlier in this paper. If any of these are missing, we are not providing the open architecture which has been so important for the success of the Web. The need for the overridable requirement will become clear when we discuss continuous media references.

Some of the information is clearly metadata and could be specified using a metadata approach such as Resource Descriptions (Hardy 1996 [HREF11]), the Warwick Framework (Lagoze et al 1996 [HREF12]), PICS (Krauskopf et al 1996 [HREF13]) or Meta Content Framework (Guha 1997 HREF14). However, as will become apparent, much of the information is not metadata but is intended to specify structures and relationships that would be inappropriate to specify using such an approach. For a media description format, that information which is metadata would be specified using one of the metadata approaches but this is not dealt with in this paper.

Possible interchange formats

At first look, it might seem that the sort of format to be used as a session description by RTSP is appropriate. An RTSP session is normally initiated by the client accessing a session description file that can be retrieved using HTTP, though the format of the session description file is outside of the scope of the RTSP standard. The session description contains information about the media streams involved in a session such as language of audio streams, formats and delivery mechanisms allowing the client to make appropriate choices before commencing a session using the appropriate source. Although it's not yet clear what session description formats will be used with RTSP (and there could be several), the possible formats include SDP (Handley et al 1997 [HREF15]) and, very much in draft stage, SDF [HREF16].

SDP was developed for describing sessions on the MBONE (Kumar 1995), the IP multicast backbone on the Internet, and has been widely used for this purpose in its present form. If it is to be used with RTSP, there are some modifications that would need to be made to the protocol as described in a working note on the matter [HREF16]. The information contained in an SDP session description includes such things as media information (including type of media, media format, transport protocol, etc) and timing information. SDP uses single character attributes to specify the information such as in the following example given in the RFC:

        v=0
        o=mhandley 2890844526 2890842807 IN IP4 126.16.64.4
        s=SDP Seminar
        i=A Seminar on the session description protocol
        u=http://www.cs.ucl.ac.uk/staff/M.Handley/sdp.03.ps
        e=mjh@isi.edu (Mark Handley)
        c=IN IP4 224.2.17.12/127
        t=2873397496 2873404696
        a=recvonly
        m=audio 3456 RTP/AVP 0
        m=video 2232 RTP/AVP 31
        m=whiteboard 32416 udp wb
        a=orient:portrait

SDP provides a mechanism for extensibility that involves using the a attribute to specify additional attribute/value pairs as in a=orient:landscape that might be specified for a shared whiteboard application.

Although it can be used for stored media, SDP was initially designed for live media and much of the expressible information reflects that background. Although it is extensible, it would be awkward to do the amount of extension that would be necessary using the mechanism that the format provides. We also believe that, although it has been designed as a text-based format that is simple to generate and parse, it is not particularly easily human-readable and it is critical that a format for the type of use we envisage be easily human-readable.

SDF, on the other hand, is more easily human-readable and also allows nesting. However, it uses a LISP-like syntax with parenthesised lists of Horn clauses and this can quickly get confusing if there is a lot of nesting. Also, this is very early work and it is not yet clear what extensibility mechanisms this will support. If the format is made to be extensible and the format is widely adopted as a session description format in conjunction with RTSP, it could be an appropriate format to use as a base for the information we want to represent. In the meantime, however, we will present an alternate proposal for the continuous media description format.

A continuous media description format

In our Continuous Media Description (CMD) format, the information is represented using SGML. We chose SGML syntax because it has an existing standard definition, allows us to use existing SGML parsers, is in the spirit of SGML-type languages used on the Web, it can be formally specified using an SGML DTD (or XML [HREF17]) and the format is as good as any.

Extensibility is handled by a convention that applications ignore entities of a type that they do not support. Note that this is different from the convention for HTML, whereby applications ignore tags that they don't support rather than the full entity enclosed by those tags.

Here is a sample record showing the way in which basic information could be specified:

        <title>Bishop's Move
        <server>mediaserver.cmis.csiro.au:8888
        <location>/abc/four_corners/97393.mpg
        <type>video/mpeg;type=system,version=1
        <protocol>UDP
        <rate>1.5 Mb/s
        <fps>25
        <start>0:00:01.20
        <end>0:19:26.04
        <info>http://video_info.cmis.csiro.au/abc/four_corners/annotations/97393

The media type is a MIME type. Start and end times are specified as hours, minute, seconds and frame number. The frame rate determines the acceptable range of values for the frame number. While this is less aesthetic than an absolute frame number, this notation is more human-readable as it gives a rough idea of the times involved. Software can of course convert the times to any internal representation convenient for programmers.

The descriptors can be created manually and stored on a server or, for large scale use, the descriptors will be generated automatically from metadata databases. When they are generated on-the-fly, there can also be modifications made to override information. For example, if a time offset is specified in a URL (see discussion later), the CMD data sent to the client will have been modified to take this into account. In such cases, it is desirable to retain the full original information and so it is necessary to use an inheritance mechanism wherein a descriptor record is modified by selective changes to some of the attributes. Thus the format must support nesting of records to allow an inheritance tree to be created.

The example shows only attribute-value pairs and the value of using SGML (instead of a list of "attribute=value" lines similar to SDP) has not been made clear. We will, however, build on this base in future sections and the advantages of using SGML will become apparent.

Continuous media structure

Different applications might need to make use of information about the structure of continuous media content, for example, if a reference has been made to a particular segment of a video by name and the name reference is to be resolved to time offsets.

We can use the CMD format already introduced to define structural information. The clip entity is used to define a clip of media content but the presence of this entity by itself in a CMD stream does not indicate any action that should be taken (such as opening and displaying the stream). On the other hand, the sequence entity defines a sequence of clips that are to be played in the order given. In this way, "virtual" videos can be constructed out of various clips around the network. Other directives would be used to specify concurrent playing of clips or alternate choices of clips (eg available from different servers).

An example of CMD data indicating a simple media sequence is:

        <clip name="video1">
        <title>Bishop's Move
        <server>mediaserver.cmis.csiro.au:8888
        <location>/abc/four_corners/97393.mpg
        <type>video/mpeg;type=system,version=1
        <protocol>UDP
        </clip>

        <sequence>
                <clip name="clip1" source="video1">
                <start>00:23:24.04
                <end>00:24:25.02
                </clip>

                <clip name="clip2" source="video1">
                <start>01:12:24.14
                <end>01:13:44.24
                </clip>
        </sequence>

The CMD data associated with a video could describe the full structure of the content at a high level for use by people and applications. For example, a video of a meeting could be described as a sequence of named clips for each item on the agenda with each defined with a start and end offset. References could be made to a clip using the name of the clip as described in the section on references to continuous media.

For specifying the time relationships, it might seem that it would be better to adopt an existing approach for specifying media relationships such as using the Open Media Framework[HREF18] which has been developed by AVID and is used widely within the film production industry for exchanging information between different production tools. Another approach might be to adopt a full multimedia scripting language such as Macromedia's Lingo. However in both cases, these formats have been developed with different purposes in mind and neither of them fulfil all of the criteria we set out earlier in the paper. A better approach for specifying the time relationships might be to use a DTD that makes use of HyTime constructs in a very limited fashion (to retain simplicity) and this is something we intend to examine further.

Hypermedia linking from continuous media

For different applications to be able to make use of links from continuous media to other media (continuous or non-continuous), there needs to be a common approach to specifying these links. One possible approach is to specify these links in the encoded continuous media itself (either in headers or within the frames that are being linked). Clearly, there are many disadvantages with this approach including the need to have separate conventions for each encoding type (eg which of the user-definable fields to store this information in), the need for any application that wants to make use of this information (eg for editing it, analysing it, etc) to be able to handle the media format itself, and the difficulty of hand-coding the information.

The alternative approach is to have a common interchange format for the specification of this linking information. Such a format must fulfil the usual requirements of being human readable, easy to generate by hand, easy for programs to generate and easy for programs to parse and process.

There are two types of links that need to be able to be specified:

One approach to specifying this information is to extend the CMD format and borrow partly from the HTML syntax for specifying client-side image maps.

A map with time-only links could be specified as follows:

        <map name="map1">
                <time start="00:12:32.02" end="00:12:35.03">
                        <href>http://video_info.cmis.csiro.au/info.html
                </time>
                <time start="00:13:34.04" end="00:13:36.02">
                        <href>http://video_info.cmis.csiro.au/overview.html
                </time>
        </map>
This might be used, for instance, for displaying shotlist information associated with archival footage where timecoded shotlist information exists but where there is no need for more fine-grained links within frames. In this case, the CMD stream specifying the map can be generated on-the-fly from the shot database and when the user clicks on a frame of the video, the specified URL is requesting resulting in the information being extracted from the shot database on-the-fly.

When area is also used to specify links, area entities can be nested inside time entities or vice versa. For example, a video containing a sequence of shots, each one of which has objects in set positions could have a map such as:

        <map name="map2">
                <time start="00:12:32.02" end="00:12:35.03">
                        <area shape=rect coords="0,0,118,28">
                                <href>http://video_info.cmis.csiro.au/info.html
                        <area shape=rect coords="184,0,276,28">
                                <href>http://video_info.cmis.csiro.au/overview.html
                </time>
        </map>
On the other hand, time entities can be nested inside area entities. This could be used, for example, for a video of a meeting made with a single stationery camera where the presented slides occupy a consistent region of the image throughout the video but the link to be associated with that position is different for each slide (ie each of the time ranges). Of course, it is also useful to be able to have many levels of nesting with either of the entity types at each level.

It is important to note that although we are proposing that this information be specified in the same format, it is not necessary that the video map information be contained in the same data stream as the descriptive information. For instance, an initial video descriptor in CMD format may be retrieved and this may point to a URL containing a video map in CMD format using the "info" attribute.

Synchronisation to continuous media content

Many applications might need to make use of information specifying synchronisation mappings for the media content. For instance, an application for the browsing of archival video may present descriptive information on each shot in the video as the video is being played and that shot information may be retrieved from a database elsewhere on the network (as done by the FRANK system, Simpson-Young et al 1996).

Synchronisation information of this type is specified using mappings from time offsets to resource locations (ie URLs). Again, this information could be specified in the CMD format and we can in fact extend the map entity. A generic solution is to add the concept of an event to the map syntax. An example could be:

        <map name="map3">
                <time start="00:12:32.02" end="00:12:35.03" event="enter">
                        <href>http://shotlists.cmis.csiro.au/abc/four_corners/shotget?00:12:32.02
                </time>
                <time start="00:13:34.04" end="00:13:36.02" event="enter">
                        <href>http://shotlists.cmis.csiro.au/abc/four_corners/shotget?00:13:34.04
                </time>
        </map>
In this case, the link is actioned when the specified range is entered (eg by the playing position reaching that point or by a jump to that position).

Continuous media referencing

To enable the effective integration of applications using video and audio on an intranet or Internet, it is important that there be an agreed way to specify references to video and audio content and to specify qualifying information such as positions within the video or audio content. This is necessary for use in simple linking and hypermedia navigation and the obvious way to do this is with a URL.

The qualifying information that must be able to be specified in a URL that refers to video or audio content includes:

A convention on the syntax and semantics of the qualifying information in the URL is important. For instance, if a user is viewing a video with a video client and wants to save a reference to a specific position within the video (eg to send in an email to a colleague), it is important that the client be able to save the reference in a format that would be able to be read by other video clients. If a client can not detect and understand information specifying a location within the video, then the type of infrastructure we are hoping for can not be achieved.

Specifying the base reference

One way to refer to the base video content is to specify a URL that would conform to the scheme of the streaming protocol that will be used to view the video. For instance, if a video is available on a server and can be accessed using the RTSP protocol, the URL for the video might be something like
        rtsp://videos.cmis.csiro.au/abc/four_corners/1234
However, as it has often been pointed out (see eg [HREF19]), the use of URLs for specifying resource references has its problems: in this case, the problem is the assumption of the use of a specific streaming protocol. Embedding the protocol in this way means that the reference can't be easily used as a reference independently of the actual service that is currently used to provide the resource. For example, we may want to make a reference to the video itself without having to assume that it will definitely be retrieved using RTSP and we might want a client to be able to get more information about the stream even if it doesn't support RTSP.

An alternative approach is to use the HTTP scheme, very much the lingua franca of the Web for resource referencing and to only use a URL that refers to the actual streaming protocol at the very last step of the process when the client is to commence media viewing. This is also the approach suggested for initiating a session with RTSP and is the common approach used to pass control to a client-side media-playing application from a browser. In the latter case, a small bootstrap file of a specific MIME type is used which initiates the running of the helper application and commences downloading or playing of the real remote data.

Specifying the qualifying information

The syntax of URLs conforming to a particular URL scheme (eg http, ftp, file, etc) is specific to that scheme but, as defined in the Internet Standard for Relative Uniform Resource Locators (RFC1808), there is a generic syntax that can be characterised as the following (see the RFC for the full syntax):

<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

Each component, except the scheme name itself, may be absent from a particular URL or may even be disallowed by a specific scheme. The # component is not actually part of the URL as it is interpreted by the client only and is not sent as part of the request to the server.

There are three main candidates for a way of representing qualifying information in a video/audio URL: using the fragment component, using the query component and using the end of the path component (note: the params component is not used in the HTTP scheme). Using the end of the path component does not provide any advantages over using the query component and will not be considered further. We will examine each of the other two approaches using the example of a reference to a segment of a video that starts at a specific timecode and ends at another specific timecode.

Using the fragment component

The approach of using the fragment component for specifying information that qualifies the resource being referenced initially seems appropriate as the URL is intended to refer to a particular fragment of the full resource. Using a fragment identifier, a URL would look something like:

http://video_info.cmis.csiro.au/abc/four_corners/1234.cmd#start=00 :25:01.12&end=00:23:14.03

As already pointed out, the fragment identifier of a URL is processed by the client only and is not sent to an HTTP server as part of the request. In this case, it could make sense for the parsing and handling of this offset information to be done by the client only and not be sent to an HTTP server. It should be noted that the common use of fragment identifiers as a reference to a hypertext anchor within an HTML document is specific to particular uses (such as browsing) of a particular media type (HTML) and is not a characteristic of the HTTP URL scheme. The actual syntax and semantics of the fragment identifier can be defined for a particular use of a particular media type so that, as in this case, applications that support the continuous media description format being used would support this use of the fragment identifier. The RFC1808 syntax allows the fragment identifier to contain characters such as ì=î and ì&î.

The disadvantages with this approach are that:

i) In most Web browsers, a client-side plug-in, helper or applet may not have access to the full URL. Most browsers would remove the fragment, make the specified GET request, and pass the response to the plug-in, helper or applet that handles the particular media type (in this case VRD). Thus, it is not possible for anyone other than a browser implementer to make use of a reference such as this. This is a practical argument against this approach rather than a principled one.

ii) There is no guarantee that the HTTP client will send the fragment component of the URL as part of the HTTP request and it may be useful to send the additional attribute/value pairs to the HTTP server for range-checking and other functions.

Using the query component

The approach of using the query component for specifying the qualifying information overcomes the practical problems of the fragment approach. This is the usual way of specifying attribute/value pairs in a URL and is the usual way in which attribute/value pairs are sent to CGI scripts or other back-end processing programs running behind HTTP servers. Using this approach, the URL would be something like:

http://video_info.cmis.csiro.au/abc/four_corners/1234.cmd?start=00:25 :01.12&end=00:23:14.03

Using this approach, the offset information is, unlike the previous approach, sent to the server. Although this may not always be necessary as the offset information may be interpreted only on the client side, it avoids the problems with the previous approach.

If URLs using the query approach are to be used as references to positions within videos between applications, there must be a common syntax and semantics of the attribute value pairs that form all or part of the query component of the URL. This is a little unconventional as, although the practice of using attribute/value pairs within the query component of a URL is common, the semantics of the attribute/value pairs are always specific to the particular base URL that is described (usually a CGI script) and have no well-defined meaning to any other base URL. In this case, instead of having semantics specific to a base URL, we are proposing a convention for the query component of URLs that refer to a specific media type (ie continuous media description data). So, although this approach overcomes some practical problems of the fragment approach, it has a disadvantage with the nature of the convention it requires.

Recommended approach

The approach we propose for now is the use of query component for practical reasons. If the disadvantages discussed with the fragment component approach are resolved (eg by the evolution of different conventions in Web browsers), the fragment approach would be preferable. For the sort of environment that this paper is discussing, there need to be conventions on the parameters specified in the query component. An extensive definition of the parameters is outside the scope of this paper but it should include such parameters as:

It should be noted that the approach of having conventions for a URL is different to that of the current draft of RTSP which does not give any well-defined meaning to fragment and query identifiers, leaving the interpretation to the RTSP server (see RTSP section 3.2). Such an approach is not appropriate for the use of URLs as we are presenting them here, where information about offset etc from URLs is to be used and understood by different applications (as described earlier). That is not to say that our URLs are not resolved to RTSP URLs at the time of usage.

Further work

In the previous section, we have shown glimpses of the type of format that could be adopted to address some of the issues that we have identified for an open continuous media environment. As we've shown, such a common format could be used to represent general information about the media required for processing or displaying it, hypermedia mappings for the media and syncronisation points within the media. One of the keys to such a format being useful for our purposes is that it provide the right balance between simplicity and utility. Whether a format such as this is appropriate can only be seen by implementation and experimentation with a wide variety of applications such as those identified early in the paper. The next steps are to further specify the CMD format and to implement various applications that can make use of the information it provides. When we have done this, we hope that we will have demonstrated that it does provide a useful addition to the Web architecture.

Conclusion

In this paper we have presented the concept of an open continuous media environment which would provide an enabling architecture for emerging video and audio handling applications following in the footsteps of the traditional Web architecture. For such an environment to emerge, there are several areas in which standards and conventions must be developed and adopted. We have discussed some of the issues that must be addressed and given indications of the types of solutions that are required. We have also indicated the type of data format that might be able to address some of these issues.

We believe that standardisation in the areas we have identified will provide a strong foundation for a large range of new and innovative video and audio handling applications.


Acknowledgements

The authors wish to acknowledge that this work was carried out within the Cooperative Research Centre for Research Data Networks established under the Australian Government's Cooperative Research Centre (CRC) Program and acknowledge the support of the Advanced Computational Systems CRC under which the work described in this paper is administered.

References

M Brown, J Foote, G Jones K Sparck Jones and S Young (1996) "Open-Vocabulary Speech Indexing for Voice and Video Mail Retrieval", The Fourth ACM International Multimedia Conference, Boston MA, November 1996, pp. 307-316.

N Freed, N Borenstein (1996) "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996. ftp://munnari.oz.au/rfc/rfc2045.Z

CF Goldfarb (1990) "The SGML Handbook", Oxford University Press.

RV Guha (1997) "Meta Content Framework : A Whitepaper", http://mcf.research.apple.com/wp.html

M Handley, V Jacobson (1997) "SDP: Session Description Protocol", Internet Engineering Task Force Internet-Draft 26 March 1997. ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-sdp-03.txt.Z

D Hardy (1996) "Resource Description Messages (RDM)", W3C NOTE 24-Jul-96, http://www.w3.org/pub/WWW/TR/NOTE-rdm.html

International Organisation for Standardisation (1992) "ISO/IEC 10744 Hypermedia/Time-based Structuring Language: HyTime".

T Krauskopf, J Miller, P Resnick and W Treese (1996), "PICS Label Distribution Label Syntax and Communication Protocols Version 1.1", W3C Recommendation 31-October-96, http://www.w3.org/pub/WWW/TR/REC-PICS-labels-961031.html

V Kumar (1995) "MBone: Interactive Multimedia On The Internet", Macmillan Publishing, November 1995.

C Lagoze, C Lynch and R Daniel (1996) "The Warwick Framework: a container architecture for aggregating metadata objects", http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR96-1593

A Rao, R Lanphier (1996) "Real Time Streaming Protocol (RTSP)", Internet Engineering Task Force Internet-Draft 26 November 1997. ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-rtsp-00.txt.Z

AM Rutkowski "Today's Cooperative Competitive Standards Environment For Open Information and Telecommunication Networks and the Internet Standards-Making Model", NIST Standards Development and Information Infrastructure Workshop, June 1994 (http://www.isoc.org/papers/standards/amr-on-standards.html).

H Schulzrinne, S Casner, R Frederick and V Jacobson (1996) "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, January 1996. ftp://munnari.oz.au/rfc/rfc1889.Z

H Schulzrinne, A Rao, R Lanphier (1997) "Real Time Streaming Protocol (RTSP)", Internet Engineering Task Force Internet-Draft 27 March 1997. ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-rtsp-02.txt.Z

B Simpson-Young and K Yap (1996) "FRANK: Trialing a system for remote navigation of film archives", SPIE International Symposium on Voice, Video and Data Communications, Boston MA, November 1996.

WR Stevens (1994) "TCP/IP Illustrated, Volume 1 The Protocols", Addison-Wesley.

K Yap, B Simpson-Young and U Srinivasan "Enhancing Video Navigation with Existing Alternate Representations", First International Workshop on Image Databases and Multi Media Search, Amsterdam, August 1996.


Hypertext References

HREF1
http://www.syd.dit.csiro.au/staff/bill - Bill Simpson-Young's Home Page.
HREF2
http://www.syd.dit.csiro.au/staff/ken - Ken Yap's Home Page.
HREF3
ftp://munnari.oz.au/rfc/rfc1889.Z - RFC 1889 "RTP: A Transport Protocol for Real-Time Applications"
HREF4
ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-rtsp-02.txt.Z - "Real Time Streaming Protocol (RTSP)" (March 1997 version)
HREF5
ftp://munnari.oz.au/rfc/rfc2045.Z - RFC 2045 "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies"
HREF6
http://www.davic.org/ - DAVIC Home Page
HREF7
http://www.isoc.org/papers/standards/amr-on-standards.html - "Today's Cooperative Competitive Standards Environment For Open Information and Telecommunication Networks and the Internet Standards-Making Model"
HREF8
ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-rtsp-00.txt.Z - "Real Time Streaming Protocol (RTSP)" (November 1996 version)
HREF9
http://www.w3.org/ - World-Wide Web Consortium Home Page
HREF10
http://www.w3.org/pub/WWW/AudioVideo/Activity-new - W3C Activity: Real Time Multimedia
HREF11
http://www.w3.org/pub/WWW/TR/NOTE-rdm.html - Resource Description Messages (RDM)
HREF12
http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR96-1593 - The Warwick Framework: a container architecture for aggregating metadata objects
HREF13
http://www.w3.org/pub/WWW/TR/REC-PICS-labels-961031.html - "PICS Label Distribution Label Syntax and Communication Protocols Version 1.1"
HREF14
http://mcf.research.apple.com/wp.html - Meta Content Framework : A Whitepaper
HREF15
ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-sdp-03.txt.Z - "SDP: Session Description Protocol"
HREF16
http://www.cs.columbia.edu/~hgs/rtsp/sdf.html - Working note on session description
HREF17
http://www.w3.org/MarkUp/SGML/Activity - XML (Generic SGML over the Web)
HREF18
http://www.avid.com/omf - Open Media Framework Home Page
HREF19
http://www.ansa.co.uk/ANSA/ISF/decoupling.html - "Decoupling the URL Scheme from the Transport Protocol"

Copyright

Bill Simpson-Young, Ken Yap ©, 1997. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers, and for the document to be published on mirrors on the World Wide Web. Any other usage is prohibited without the express permission of the authors.


[All Papers and Posters]


AusWeb97 Third Australian World Wide Web Conference, 5-9 July 1997, Southern Cross University, PO Box 157, Lismore NSW 2480, Australia Email: AusWeb97@scu.edu.au