It is becoming harder to access, manage and transmit multimedia content according to the meaning it embodies. As text-based search engines give way to content- and context-aware engines, which not only personalize searching and delivery but also the content format, advanced network infrastructures are emerging that are capable of end-to-end ubiquitous transmission of multimedia content to any device (fixed or mobile), on any network (wired or wireless), at any time. This has opened up new markets for content and service providers who, recognizing the value of individual users, are investing in technologies that adapt and personalize content. In response, organizations have released new standards, such as MPEG-7, MPEG-21, and VC-1, which enable propagation of content adaptation and personalization. Consequently, a broad range of applications are emerging across many industry sectors, including music, film, games, television, and sports, to name but a few.

Personalizing and adapting content to the preferences and needs of users requires processing of content, on the one hand, and recognizing patterns in users’ behavior, on the other. The former involves stages such as extraction and analysis of content semantics and structure, modeling of the resulting content metadata, filtering of content metadata through user profiles, and adaptation of the content to suit the usage environment (i.e. the user, the client, the network, and the natural environment) or the usage environment to suit the content. The latter requires construction of user models that record usage history and user preferences for types of content, browser and interface modalities in order to tailor content to cater for these preferences and to predict future usage behavior without stereoty** the user too much.

Personalizing and adapting the semantic content of multimedia enables applications to make just-in-time intelligent decisions regarding this content, which, in turn, makes interaction with the multimedia content an individual and individually rewarding experience.

The Semantic Media Adaptation and Personalization (SMAP) Initiative was founded during the summer of 2006 in an effort to bring together researchers and practitioners working in this area, in order to discuss the state of the art, recent advances and future outlooks for semantic media adaptation. The First International Workshop on Semantic Media Adaptation and Personalization (SMAP 2006) held in December 2006 in Athens, Greece, was the first SMAP meeting. It outgrew all initial expectations and thus had to be extended from a 1-day to a 2-day event. The Second International Workshop on Semantic Media Adaptation and Personalization (SMAP 2007) was held in December 2007 in London, UK and experienced similar growth, which resulted in the event having to be held over 3 days and the inclusion of a doctoral consortium. As a result of the overwhelming and continuing interest and support for the first two SMAP events, SMAP has become an annual event, with the third installment of the workshop taking place in December 2008 in Prague, Czech Republic.

This special issue comprises extended versions of five papers which were originally presented at SMAP 2007 and which have successfully made it through at least three additional rounds of reviews. The selection process was particularly tough because a very high number of quality contributions presented during SMAP 2007 were up for consideration. We have made no effort to select papers of matching content, but rather papers that are representative of the work presented at the Workshop and which promote understanding of the wider problems and issues which are pursued by researchers and practitioners working in the field. While some papers of truly high quality had to be omitted from this special issue, they and other deserving papers are being published in an edited volume by CRC Press entitled Semantic Media Adaptation and Personalisation, Volume 2.

Spyrou et al. of the National Technical University of Athens in Greece propose a video analyzer which extracts colour and texture from coarse regions of each frame to predict images and then constructs a visual thesaurus which comprises clusters of these regions with local information. A vector constructed for each frame is then used to select key frames for Latent Semantic Analysis.

Agius and Angelides of Brunel University in the UK argue that the range of content-relevant preferences that may be expressed using the MPEG-7 user interaction tools is very limited in comparison to the wider range of metadata that may be represented using the MPEG-7 content tools, while, at the same time, the user preference and history metadata within the MPEG-7 user interaction tools can greatly complement these specific content preferences. Consequently they propose a set of bridges over this “content-user gap” supported by isomorphic user and content metadata.

López-Nores et al. of the University of Vigo in Spain argue that hosting personalization engines on dedicated servers is not useful in broadcasting because it is impossible for the hosting server to know its broadcast users. In contrast, the authors propose hosting a semantic reasoning process in DTV receivers whose content is collated according to user stereotypes.

Kaiser and Hausenblas of Joanneum Research and Umgeher of Graz University of Technology in Austria argue that most web tools for authoring and managing interactive, non-linear audio–visual content and their metadata according to user preferences assume a certain degree of human intervention that renders them non-dynamic. In response, the authors propose a web system which dynamically and continuously updates its content and metadata with very limited human intervention according to domain-specific user preferences.

Asteriadis et al. of the National Technical University of Athens in Greece argue that collecting user feedback indirectly in e-learning environments by asking the users, for example, may not be as reliable as we once thought. In response, they propose a mechanism which collects user feedback directly by observing user head, eye and hand movements through a web camera.