Bill Simpson-Young and Ken Yap, CSIRO Mathematical and Information Sciences and Advanced Computational Systems Cooperative Research Centre, Building E6B Macquarie University Campus, Locked Bag 17, North Ryde NSW 2113, Australia. Phone: +61 2 9325 3155 Fax: +61 2 9325 3101 bill.simpson-young@cmis.csiro.au Home Page [HREF1] ken.yap@cmis.csiro.au Home Page [HREF2]
World Wide Web, Continuous media, Media description formats
In this paper, we identify and discuss the need for an open approach to continuous media and identify issues that should be addressed in order that the continuous media architecture follow in the footsteps of the traditional Web architecture and pave the way for a large range of applications yet to be invented.
Once we have shown the need for such an approach, we address the second goal of identifying some of the issues that should be addressed on the road to such an open architecture for continuous media. We travel only sufficiently far along this road to indicate the sort of issues that we believe exist and the sort of ways in which these issues can be addressed.
We have chosen to limit the scope of the paper in several regards. Firstly we emphasise stored media over live media such as conferencing. This means that we are only covering a portion of the types of applications and scenarios possible in a network with effective continuous media support, but it does allow us to limit our discussion to a scope that can adequately be addressed in this paper. Also, we are interested here in the application layer only and will not be addressing, for instance, issues at the transport layer. For instance, just as a discussion of the Web architecture could choose to include or not a discussion of TCP(Stevens 1994), we have chosen not to discuss transport-level issues for continuous media such as those addressed by RTP (Real Time Protocol)(Schulzrinne 1996[HREF3]).
In an intranet environment with effective support for video applications, there could be widespread use of video as an organisational memory with videos of meetings, events, seminars, etc being readily accessible. For this purpose, videos would be stored in a central or distributed video repository accessible from each workstation. Minutes of meetings would contain timecodes that can be used to facilitate access to the corresponding location of the video (even though the minutes may not represent the actual sequence of events in a linear way). Staff could include references to specific locations or ranges within the video in email or documents and these references could be used by those reading the document. While viewing a video, a user would be able to make a request for all references or annotations that apply to the current position in the video. To aid in searching the vast quantities of video and audio content, there would need to be navigation facilities (eg automatically extracted key frames corresponding to video events such as a change of slides or audio events such as a change of speakers) (Yap et al 1996), video and audio summarisation facilities and content-based search facilities (eg searching for specific shapes within the video or specific words within the audio) (Brown et al 1996)).
Such an environment would also support frequent use of just-in-time training using networked training videos. This type of application would require similar functionality to that described above as would other applications such as video news access, video briefing systems, etc.
From an examination of these sorts of uses, it is possible to identify some general functionality that would be shared by many of these applications. For instance, users should be able to:
This is an indication of the sort of functionality possible in a networked video environment and is only a small sample of the many possibilities. The technology for all of this functionality is available now - the key to its successful deployment is to implement it in an open manner with applications interfacing effectively to each other using appropriate standards and conventions.
Continuous media have been used in one way or another on the Internet and intranets for almost as long as the Web architecture has existed. Initially, of course, video and audio data was treated just as any other media type and downloaded in full and then played by a helper application that supported the particular media type. Since 1994, there have been numerous applications made available that support the streaming of audio and video data across the network as it is being played and these usually use proprietary control protocols and transport protocols and often use proprietary media formats.
Many of these applications provide impressive quality media over low bandwidth Internet connections and functionality above and beyond the basic play/stop/random-access media controls. For instance, RealMedia (formerly RealAudio) from Progressive Networks supports video image maps which provide hypermedia hotspots in video, and media synchronisation files that are used to retrieve and display content when specific positions within the video or audio content are played.
In the standardisation area, work is continuing on RTSP, the Real Time Streaming Protocol (Schulzrinne et al 1997[HREF4]), which is at present an Internet Draft by Henning Schulzrinne from Columbia University (author of RTP), Anup Rao from Netscape Communications and Rob Lanphier of Progressive Networks. RTSP is an application-level protocol that provides a mechanism for establishing a session with a live or stored media source using any of a number of different types of delivery channel (such as UDP, Multicast IP, TCP and RTP-based mechanisms) and for controlling the media using methods such as PLAY, PAUSE, etc.
The prospect of an Internet or intranet environment with a wide range of applications supporting RTSP is exciting and not very far off. In such an environment, it will not be necessary to match the media client to the media server providing the service. You will be able to choose your client software on its virtues rather than being required to use a specific product. However, is the agreement on a media streaming protocol sufficient to provide the kind of environment we are talking about? The environment made possible by such agreement can be compared to the situation of free-to-air television where there are a large number of competing brands of television set ("TV clients"), each of which can be used to view the same channels. In this case, the major functional differences between the sets is limited to being in their size, picture quality and cost. Will this be the extent of the effect of RTSP or will it provide a similar type of open architecture with regard to continuous media to that provided by the traditional Web architecture to non-continuous media?
In order to answer this question, we need to first look at the technical reasons for the traditional Web architecture being so successful, not only in addressing the specific application area of distributed hypertext, but in providing an enabling architecture within which a large number of applications have emerged. We do this in the next section.
In looking at the success of the Web architecture, it's important to distinguish the success of the architecture from the success of the World-Wide Web itself. It is possible that someone could argue that the architecture was not important, it was just the content that mattered and the Web architecture as we know it just happened to be in the right place at the right time. It is evident, however, just how successful the architecture has been when we see the enormously widespread use of intranets being used for many aspects of workplace information systems such as querying databases and filling in workplace forms. In such cases, the architecture is not being used for access to worldwide information sources but is replacing existing local information systems. It clearly has characteristics that make it a very useful architecture.
Some of the important characteristics of the Web architecture include the following.
The protocol was also developed to be independent of the actual media type, media format and encoding type of the data for which it was used. The use of MIME (Freed et al 1996 [HREF5]) to label data again made the architecture more flexible. This, combined with dynamic document generation, also led to the interesting use of the Web for services that automatically generated images on the fly such as for map navigation and graphical display of dynamic data such as share prices.
By way of comparison, an example of the type of standard that we believe is not appropriate for the Internet world is the DSM-CC standard being developed by the DAVIC Consortium [HREF6] in the area of interactive television. The part of the standard dealing with the application-level (the user-user interface) provides support for media control (start, stop, jump etc), service connection (attach, detach, etc) and directory operations (eg. open a list of movies, get the service specification for movie X). The standard has embedded within it very strong assumptions about the exact applications that will use the standard. For example, the directory operations are very much based on the video-on-demand idea that there is a list of movies that the user chooses from, but doesn't seem to be extensible to all the other ways in which video playback might be initiated. This might be appropriate for the interactive television world but doesn't provide the open extensible environment that the Internet thrives on.
In this section, we have shown how specific attributes of the Web standards have allowed the Web architecture to be an enabling architecture and to facilitate the evolution of a very broad range of applications using that architecture. We believe that it is important - now that audio and video processing applications are becoming available over the Web - that the architecture used for such applications also be an enabling architecture that opens up a whole range of possible applications, rather than produce a situation where each application is an island of functionality.
In the next section, we will discuss whether the emerging approaches to continuous media are going to follow this tradition.
Although RTSP is very likely to be widely adopted, are there other standards and conventions that should exist for continuous media, in order that the architecture for continuous media be enabling in the same way as the traditional Web architecture? The rest of this paper identifies and discusses several issues that need to be addressed for there to be an open continuous media environment of this kind.
The areas that we believe would benefit from a common approach include:
The information that must be able to be represented in a continuous media description includes information required for accessing and displaying the media content as well as information for performing other processing on it. The information includes:
Some of the information is clearly metadata and could be specified using a metadata approach such as Resource Descriptions (Hardy 1996 [HREF11]), the Warwick Framework (Lagoze et al 1996 [HREF12]), PICS (Krauskopf et al 1996 [HREF13]) or Meta Content Framework (Guha 1997 HREF14). However, as will become apparent, much of the information is not metadata but is intended to specify structures and relationships that would be inappropriate to specify using such an approach. For a media description format, that information which is metadata would be specified using one of the metadata approaches but this is not dealt with in this paper.
SDP was developed for describing sessions on the MBONE (Kumar 1995), the IP multicast backbone on the Internet, and has been widely used for this purpose in its present form. If it is to be used with RTSP, there are some modifications that would need to be made to the protocol as described in a working note on the matter [HREF16]. The information contained in an SDP session description includes such things as media information (including type of media, media format, transport protocol, etc) and timing information. SDP uses single character attributes to specify the information such as in the following example given in the RFC:
v=0
o=mhandley 2890844526 2890842807 IN IP4 126.16.64.4
s=SDP Seminar
i=A Seminar on the session description protocol
u=http://www.cs.ucl.ac.uk/staff/M.Handley/sdp.03.ps
e=mjh@isi.edu (Mark Handley)
c=IN IP4 224.2.17.12/127
t=2873397496 2873404696
a=recvonly
m=audio 3456 RTP/AVP 0
m=video 2232 RTP/AVP 31
m=whiteboard 32416 udp wb
a=orient:portrait
SDP provides a mechanism for extensibility that involves using the
a attribute to specify additional attribute/value pairs as in
a=orient:landscape that might be specified for a shared whiteboard
application.
Although it can be used for stored media, SDP was initially designed for live media and much of the expressible information reflects that background. Although it is extensible, it would be awkward to do the amount of extension that would be necessary using the mechanism that the format provides. We also believe that, although it has been designed as a text-based format that is simple to generate and parse, it is not particularly easily human-readable and it is critical that a format for the type of use we envisage be easily human-readable.
SDF, on the other hand, is more easily human-readable and also allows nesting. However, it uses a LISP-like syntax with parenthesised lists of Horn clauses and this can quickly get confusing if there is a lot of nesting. Also, this is very early work and it is not yet clear what extensibility mechanisms this will support. If the format is made to be extensible and the format is widely adopted as a session description format in conjunction with RTSP, it could be an appropriate format to use as a base for the information we want to represent. In the meantime, however, we will present an alternate proposal for the continuous media description format.
Extensibility is handled by a convention that applications ignore entities of a type that they do not support. Note that this is different from the convention for HTML, whereby applications ignore tags that they don't support rather than the full entity enclosed by those tags.
Here is a sample record showing the way in which basic information could be specified:
<title>Bishop's Move
<server>mediaserver.cmis.csiro.au:8888
<location>/abc/four_corners/97393.mpg
<type>video/mpeg;type=system,version=1
<protocol>UDP
<rate>1.5 Mb/s
<fps>25
<start>0:00:01.20
<end>0:19:26.04
<info>http://video_info.cmis.csiro.au/abc/four_corners/annotations/97393
The media type is a MIME type. Start and end times are specified as hours, minute, seconds and frame number. The frame rate determines the acceptable range of values for the frame number. While this is less aesthetic than an absolute frame number, this notation is more human-readable as it gives a rough idea of the times involved. Software can of course convert the times to any internal representation convenient for programmers.
The descriptors can be created manually and stored on a server or, for large scale use, the descriptors will be generated automatically from metadata databases. When they are generated on-the-fly, there can also be modifications made to override information. For example, if a time offset is specified in a URL (see discussion later), the CMD data sent to the client will have been modified to take this into account. In such cases, it is desirable to retain the full original information and so it is necessary to use an inheritance mechanism wherein a descriptor record is modified by selective changes to some of the attributes. Thus the format must support nesting of records to allow an inheritance tree to be created.
The example shows only attribute-value pairs and the value of using SGML (instead of a list of "attribute=value" lines similar to SDP) has not been made clear. We will, however, build on this base in future sections and the advantages of using SGML will become apparent.
We can use the CMD format already introduced to define structural information. The clip entity is used to define a clip of media content but the presence of this entity by itself in a CMD stream does not indicate any action that should be taken (such as opening and displaying the stream). On the other hand, the sequence entity defines a sequence of clips that are to be played in the order given. In this way, "virtual" videos can be constructed out of various clips around the network. Other directives would be used to specify concurrent playing of clips or alternate choices of clips (eg available from different servers).
An example of CMD data indicating a simple media sequence is:
<clip name="video1">
<title>Bishop's Move
<server>mediaserver.cmis.csiro.au:8888
<location>/abc/four_corners/97393.mpg
<type>video/mpeg;type=system,version=1
<protocol>UDP
</clip>
<sequence>
<clip name="clip1" source="video1">
<start>00:23:24.04
<end>00:24:25.02
</clip>
<clip name="clip2" source="video1">
<start>01:12:24.14
<end>01:13:44.24
</clip>
</sequence>
The CMD data associated with a video could describe the full structure of the content at a high level for use by people and applications. For example, a video of a meeting could be described as a sequence of named clips for each item on the agenda with each defined with a start and end offset. References could be made to a clip using the name of the clip as described in the section on references to continuous media.
For specifying the time relationships, it might seem that it would be better to adopt an existing approach for specifying media relationships such as using the Open Media Framework[HREF18] which has been developed by AVID and is used widely within the film production industry for exchanging information between different production tools. Another approach might be to adopt a full multimedia scripting language such as Macromedia's Lingo. However in both cases, these formats have been developed with different purposes in mind and neither of them fulfil all of the criteria we set out earlier in the paper. A better approach for specifying the time relationships might be to use a DTD that makes use of HyTime constructs in a very limited fashion (to retain simplicity) and this is something we intend to examine further.
The alternative approach is to have a common interchange format for the specification of this linking information. Such a format must fulfil the usual requirements of being human readable, easy to generate by hand, easy for programs to generate and easy for programs to parse and process.
There are two types of links that need to be able to be specified:
A map with time-only links could be specified as follows:
<map name="map1">
<time start="00:12:32.02" end="00:12:35.03">
<href>http://video_info.cmis.csiro.au/info.html
</time>
<time start="00:13:34.04" end="00:13:36.02">
<href>http://video_info.cmis.csiro.au/overview.html
</time>
</map>
This might be used, for instance, for displaying shotlist information
associated with archival footage where timecoded shotlist information
exists but where there is no need for more fine-grained links within
frames. In this case, the CMD stream specifying the map can be generated
on-the-fly from the shot database and when the user clicks on a frame of the
video, the specified URL is requesting resulting in the information
being extracted from the shot database on-the-fly.
When area is also used to specify links, area entities can be nested inside time entities or vice versa. For example, a video containing a sequence of shots, each one of which has objects in set positions could have a map such as:
<map name="map2">
<time start="00:12:32.02" end="00:12:35.03">
<area shape=rect coords="0,0,118,28">
<href>http://video_info.cmis.csiro.au/info.html
<area shape=rect coords="184,0,276,28">
<href>http://video_info.cmis.csiro.au/overview.html
</time>
</map>
On the other hand, time entities can be nested inside
area entities. This could be used, for example, for a video of
a meeting made with a single stationery camera where the presented
slides occupy a consistent region of the image throughout the video but
the link to be associated with that position is different for each
slide (ie each of the time ranges). Of course, it is also useful to be
able to have many levels of nesting with either of the entity types at
each level.
It is important to note that although we are proposing that this information be specified in the same format, it is not necessary that the video map information be contained in the same data stream as the descriptive information. For instance, an initial video descriptor in CMD format may be retrieved and this may point to a URL containing a video map in CMD format using the "info" attribute.
Synchronisation information of this type is specified using mappings from time offsets to resource locations (ie URLs). Again, this information could be specified in the CMD format and we can in fact extend the map entity. A generic solution is to add the concept of an event to the map syntax. An example could be:
<map name="map3">
<time start="00:12:32.02" end="00:12:35.03" event="enter">
<href>http://shotlists.cmis.csiro.au/abc/four_corners/shotget?00:12:32.02
</time>
<time start="00:13:34.04" end="00:13:36.02" event="enter">
<href>http://shotlists.cmis.csiro.au/abc/four_corners/shotget?00:13:34.04
</time>
</map>
In this case, the link is actioned when the specified range is entered (eg by
the playing position reaching that point or by a jump to that position).
The qualifying information that must be able to be specified in a URL that refers to video or audio content includes:
A convention on the syntax and semantics of the qualifying information in the URL is important. For instance, if a user is viewing a video with a video client and wants to save a reference to a specific position within the video (eg to send in an email to a colleague), it is important that the client be able to save the reference in a format that would be able to be read by other video clients. If a client can not detect and understand information specifying a location within the video, then the type of infrastructure we are hoping for can not be achieved.
rtsp://videos.cmis.csiro.au/abc/four_corners/1234
However, as it has often been pointed out (see eg [HREF19]),
the use of URLs for specifying resource references has its
problems: in this case, the problem is the assumption of the use of a
specific streaming protocol. Embedding the protocol in this way
means that the reference can't be easily used as a reference
independently of the actual service that is currently used to provide
the resource. For example, we may want to make a reference to the
video itself without having to assume that it will definitely be
retrieved using RTSP and we might want a client to be able to get
more information about the stream even if it doesn't support
RTSP.
An alternative approach is to use the HTTP scheme, very much the lingua franca of the Web for resource referencing and to only use a URL that refers to the actual streaming protocol at the very last step of the process when the client is to commence media viewing. This is also the approach suggested for initiating a session with RTSP and is the common approach used to pass control to a client-side media-playing application from a browser. In the latter case, a small bootstrap file of a specific MIME type is used which initiates the running of the helper application and commences downloading or playing of the real remote data.
<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>
Each component, except the scheme name itself, may be absent from a
particular URL or may even be disallowed by a specific scheme. The
#
There are three main candidates for a way of representing qualifying information in a video/audio URL: using the fragment component, using the query component and using the end of the path component (note: the params component is not used in the HTTP scheme). Using the end of the path component does not provide any advantages over using the query component and will not be considered further. We will examine each of the other two approaches using the example of a reference to a segment of a video that starts at a specific timecode and ends at another specific timecode.
http://video_info.cmis.csiro.au/abc/four_corners/1234.cmd#start=00 :25:01.12&end=00:23:14.03
As already pointed out, the fragment identifier of a URL is processed by the client only and is not sent to an HTTP server as part of the request. In this case, it could make sense for the parsing and handling of this offset information to be done by the client only and not be sent to an HTTP server. It should be noted that the common use of fragment identifiers as a reference to a hypertext anchor within an HTML document is specific to particular uses (such as browsing) of a particular media type (HTML) and is not a characteristic of the HTTP URL scheme. The actual syntax and semantics of the fragment identifier can be defined for a particular use of a particular media type so that, as in this case, applications that support the continuous media description format being used would support this use of the fragment identifier. The RFC1808 syntax allows the fragment identifier to contain characters such as “=” and “&”.
The disadvantages with this approach are that:
i) In most Web browsers, a client-side plug-in, helper or applet may not have access to the full URL. Most browsers would remove the fragment, make the specified GET request, and pass the response to the plug-in, helper or applet that handles the particular media type (in this case VRD). Thus, it is not possible for anyone other than a browser implementer to make use of a reference such as this. This is a practical argument against this approach rather than a principled one.
ii) There is no guarantee that the HTTP client will send the fragment component of the URL as part of the HTTP request and it may be useful to send the additional attribute/value pairs to the HTTP server for range-checking and other functions.
http://video_info.cmis.csiro.au/abc/four_corners/1234.cmd?start=00:25 :01.12&end=00:23:14.03
Using this approach, the offset information is, unlike the previous approach, sent to the server. Although this may not always be necessary as the offset information may be interpreted only on the client side, it avoids the problems with the previous approach.
If URLs using the query approach are to be used as references to positions within videos between applications, there must be a common syntax and semantics of the attribute value pairs that form all or part of the query component of the URL. This is a little unconventional as, although the practice of using attribute/value pairs within the query component of a URL is common, the semantics of the attribute/value pairs are always specific to the particular base URL that is described (usually a CGI script) and have no well-defined meaning to any other base URL. In this case, instead of having semantics specific to a base URL, we are proposing a convention for the query component of URLs that refer to a specific media type (ie continuous media description data). So, although this approach overcomes some practical problems of the fragment approach, it has a disadvantage with the nature of the convention it requires.
It should be noted that the approach of having conventions for a URL is different to that of the current draft of RTSP which does not give any well-defined meaning to fragment and query identifiers, leaving the interpretation to the RTSP server (see RTSP section 3.2). Such an approach is not appropriate for the use of URLs as we are presenting them here, where information about offset etc from URLs is to be used and understood by different applications (as described earlier). That is not to say that our URLs are not resolved to RTSP URLs at the time of usage.
We believe that standardisation in the areas we have identified will provide a strong foundation for a large range of new and innovative video and audio handling applications.
M Brown, J Foote, G Jones K Sparck Jones and S Young (1996) "Open-Vocabulary Speech Indexing for Voice and Video Mail Retrieval", The Fourth ACM International Multimedia Conference, Boston MA, November 1996, pp. 307-316.
N Freed, N Borenstein (1996) "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996. ftp://munnari.oz.au/rfc/rfc2045.Z
CF Goldfarb (1990) "The SGML Handbook", Oxford University Press.
RV Guha (1997) "Meta Content Framework : A Whitepaper", http://mcf.research.apple.com/wp.html
M Handley, V Jacobson (1997) "SDP: Session Description Protocol", Internet Engineering Task Force Internet-Draft 26 March 1997. ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-sdp-03.txt.Z
D Hardy (1996) "Resource Description Messages (RDM)", W3C NOTE 24-Jul-96, http://www.w3.org/pub/WWW/TR/NOTE-rdm.html
International Organisation for Standardisation (1992) "ISO/IEC 10744 Hypermedia/Time-based Structuring Language: HyTime".
T Krauskopf, J Miller, P Resnick and W Treese (1996), "PICS Label Distribution Label Syntax and Communication Protocols Version 1.1", W3C Recommendation 31-October-96, http://www.w3.org/pub/WWW/TR/REC-PICS-labels-961031.html
V Kumar (1995) "MBone: Interactive Multimedia On The Internet", Macmillan Publishing, November 1995.
C Lagoze, C Lynch and R Daniel (1996) "The Warwick Framework: a container architecture for aggregating metadata objects", http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR96-1593
A Rao, R Lanphier (1996) "Real Time Streaming Protocol (RTSP)", Internet Engineering Task Force Internet-Draft 26 November 1997. ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-rtsp-00.txt.Z
AM Rutkowski "Today's Cooperative Competitive Standards Environment For Open Information and Telecommunication Networks and the Internet Standards-Making Model", NIST Standards Development and Information Infrastructure Workshop, June 1994 (http://www.isoc.org/papers/standards/amr-on-standards.html).
H Schulzrinne, S Casner, R Frederick and V Jacobson (1996) "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, January 1996. ftp://munnari.oz.au/rfc/rfc1889.Z
H Schulzrinne, A Rao, R Lanphier (1997) "Real Time Streaming Protocol (RTSP)", Internet Engineering Task Force Internet-Draft 27 March 1997. ftp://munnari.oz.au/internet-drafts/draft-ietf-mmusic-rtsp-02.txt.Z
B Simpson-Young and K Yap (1996) "FRANK: Trialing a system for remote navigation of film archives", SPIE International Symposium on Voice, Video and Data Communications, Boston MA, November 1996.
WR Stevens (1994) "TCP/IP Illustrated, Volume 1 The Protocols", Addison-Wesley.
K Yap, B Simpson-Young and U Srinivasan "Enhancing Video Navigation with Existing Alternate Representations", First International Workshop on Image Databases and Multi Media Search, Amsterdam, August 1996.
Bill Simpson-Young, Ken Yap ©, 1997. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers, and for the document to be published on mirrors on the World Wide Web. Any other usage is prohibited without the express permission of the authors.