1 Introduction

In recent years, social media have transformed from simple digital platforms for informal conversations to massive networks that shape modern society (Akram and Kumar 2017). They now encapsulate a broad spectrum of daily life, ranging from communication and commerce to politics and entertainment, serving as a valuable data source providing insights into how people perceive and engage with the world.

The versatility of social media data makes it valuable for analyzing and gaining insights into a wide array of vastly different domains, including but not limited to tourism, public policy, or healthcare. For example, in the tourism sector, stakeholders are increasingly leveraging social media data for diverse purposes (Hvass and Munar 2012). These purposes range from analyzing frequently practiced leisure activities and their correlations with climatic or temporal factors, to understanding typical tourist routes and gauging levels of satisfaction. This data can assist in tasks such as the creation of tailored tour packages or the identification of tourist areas needing refinement. In the domain of public policy, social media data helps to contextualize public sentiment, enabling government agencies to make informed decisions (Charalabidis and Loukis 2012) to improve public services, enhance land use planning or address societal issues. Lastly, in healthcare (Smailhodzic et al. 2016), social media provide a platform for patients to share their experiences and for healthcare providers to disseminate information and monitor health trends, ultimately contributing to a better understanding of medical interventions and patient outcomes.

Analysis of social media data often involves the calculation of indicators (Neiger et al. 2012), which are metrics that aim to measure or evaluate the state or level of a particular aspect of interest (e.g., the affluence associated with a given place, the level of friendship between users, etc.). Typically, indicators are defined based on the domain of application. Indicators are usually related to heterogeneous social media entities. These entities may be directly associated with the social media platform, such as individual users, groups of users, posts, etc., but they can also be informational entities extracted from the content or metadata of social media posts, including cities, themes, events, persons, organizations, etc. A prevalent category of indicators are similarity measures. These are quantitative indicators used to assess the degree of resemblance or closeness between several entities. For instance, in the context of social media, a similarity measure might compare user profiles based on shared features (e.g., similar age, same gender, nearby home location, same language) to suggest potential connections with other users or content recommendations (Mazhari et al. 2015).

Similarity measures are crucial in various domains for many applications, including content recommendation (Jiang and Yang 2017), targeted advertising (Zhang et al. 2019), understanding social dynamics, and event detection (Becker et al. 2010). In the context of content recommendation, similarity measures are essential because they allow the recommender system to identify and suggest content that is most relevant and appealing to the user. We will give two examples. (1) In collaborative filtering on a social media platform, user-based collaborative filtering calculates the similarity between users based on their engagement history, such as likes and shares. If User A and User B have liked many of the same posts, the recommender system infers that they could have similar tastes and recommends to User A the posts that User B has liked but User A has not yet seen. (2) Conversely, in item-based collaborative filtering, the similarity between content items is assessed, recommending posts to a user based on their engagement with similar posts (Zangerle and Bauer 2022). This helps in personalizing the content feed, kee** the user engaged by presenting them with content that aligns closely with their interests.

Table 1 The 5 Dimensions of Proxemics, as defined by Greenberg et al. (2011)

When interpreted correctly, they can provide insights into user connectivity and preferences, the nature of interactions, and common patterns found online.

However, determining what similarity means in the context of social media can be challenging. Different platforms, users, and objectives lead to multifaceted interpretations of similarity (Anderson et al. 2012). Consequently, there is a pressing need for universal similarity measures capable of serving as meaningful indicators in an extensive array of domains, using data from social media. These measures should not be domain-specific but instead, offer versatility and modularity to adequately satisfy the diverse requirements of end users.

In response to these challenges, this paper introduces ProxMetrics, a modular and generic toolkit designed to assess the similarity between multidimensional social media entities (that encompass users, groups, places, times, and themes). Proxemics is the study of human use of space and the effects of distances on behavior, communication, and social interaction (Hall 1966). Traditionally applied to physical environments, we hypothesize that adapting the 5 dimensions of proxemics (DILMO), namely Distance (D), Identity (I), Location (L), Movement (M) and Orientation (O) (Greenberg et al. 2011) to the digital world of social media could provide modular and scalable dimensions to build personalized indicators to evaluate similarity between multidimensional social media entities in a domain-independent and versatile manner. We define proxemic similarity on social media as the perceived relational closeness or association between entities on social media (e.g., users, groups, places, periods, themes), based on the nature, frequency, and depth of their interactions or mentions within social media platforms. Various factors can be considered when evaluating proxemic similarity, including analysis of individual posts, user feeds, engagement levels, sentiment analysis, or even profile information. The ProxMetrics toolkit has been designed around the 5 dimensions of proxemics with both modularity and genericity in mind (see Fig. 1). It can be used with any social media platforms (source genericity) along any domain that can be described semantically (thematic genericity) without requiring extensive modifications or ad hoc solutions. Additionally, a user-friendly dashboard (ProxViewer) is provided to visualize the results; it is aimed at non-computer scientists, such as stakeholders in various domains.

Fig. 1
figure 1

Overview of the ProxMetrics toolkit

This paper is structured as follows: We begin by reviewing existing work related to proxemics and similarity measures (Sect. 2). Subsequently, we introduce ProxMetrics, a modular toolkit designed for universal similarity measurement in social media, grounded in the theory of Proxemics and its foundational data model (Sects. 3 and 4). We then proceed to experiment it and evaluate its performance based on requirements in the tourism domain (Sect. 5). We conclude by discussing future prospects (Sect. 6).

2 Related work

To design the ProxMetric toolkit, we conducted an evaluation of existing work in 3 key areas: (1) the use of proxemics in various contexts (Sect. 2.1), (2) the commonly used measures for assessing similarity (Sect. 2.2), and (3) the current approaches to combining measures in order to obtain a composite one (Sect. 2.3).

2.1 The proxemics theory

Table 2 Overview of Existing Research Works Using Proxemics for Practical Applications

Before going further, it is important to define the concept of proxemics, as it is a core element of our project. Proxemics was introduced in the seminal work of the American anthropologist Edward T. Hall (Hall 1966). He defines proxemics as “the science that studies the organization of space and the effect of distance on interpersonal relations”. Hall studied physical distance and the way it affects and regulates interactions between people. He then went further and linked the concept of distance to proxemic zones (Hall 1966). There are 4 core proxemic zones: (1) the intimate zone (0 to 0.45m) which is mainly used for close physical contact, (2) the personal zone (0.45 to 1.2m) for interactions with very close people such as family or friends, (3) the social zone (1.2 to 3.6m) for regular conversations with strangers, and finally (4) the public zone (more than 3.6m) which is used when speaking to an audience or gathering. It is crucial to note that cultural, social, and physical factors can affect the definition of proxemic zones. In 2006, Greenberg et al. extended Hall’s definition of proxemics to introduce the notion of proxemic dimensions (Greenberg et al. 2011) (also referred to as DILMO dimensions). They identified 5 dimensions that can be used to express proxemics (Distance, Identity, Location, Movement and Orientation, DILMO). These dimensions are presented in Table 1.

Proxemics can be studied at several levels: (1) the individual level (how and why does an individual express specific traits and cognitive or affective states through their proxemic behavior?) and (2) the group level (how does the behavior of individuals affect the group?) (McCall 2015). A crucial concept in proxemics is known as centrality. Proxemics requires the selection of a reference, central entity (e.g., a specific individual, an object, etc.) that will serve as the focal point for behavioral observations and analyses. For example, in an urban planning context, a central landmark building can be chosen as the reference entity in a city. Urban planners might study the behavior and movement patterns of individuals in and around this central entity to gain insight into how people use the space, how they interact with each other, and how other entities (like nearby businesses) relate to it.

When it comes to existing usage, (refer to Table 2) proxemics is used primarily to analyze physical interactions using tangible, physical metrics, such as for robot navigation (Rios-Martinez et al. 2015) and classroom behavior analysis (Castañer et al. 2013). The concept has also been extended to other applications like picture annotation (Yang et al. 2012) and assessing the impact of social distancing during COVID-19 (Mehta 2020). Recent works have tried adapting proxemics to virtual spaces but typically still relied on physical metrics, such as in video games or Virtual Reality (VR) worlds (Llobera et al. 2010). Recently, the concept of digital proxemics has emerged, focusing on non-physical interactions in virtual spaces, such as in the analysis of cybercrimes (Gunawan et al. 2021) or for middleware reconfiguration (Luxey 2019).

We hypothesize that by adapting proxemics to social media and redefining the 5 DILMO dimensions, we can establish a foundation for a generic and modular approach to measuring similarity in social media. This approach could then be used by domain stakeholders to easily build custom indicators to analyze behaviors on social media around their domain. Several factors motivate this choice:

  • Flexibility (versatility) proxemics is versatile and can be adapted to various requirements. Its five dimensions are broad and can be used to model many use cases. From Table 2, it can be observed that proxemics has been extensively used to model user-oriented similarity measures for a wide range of objectives, including physical metrics like human-drone interactions, behavior analysis in classrooms, picture annotation, social distancing during the COVID-19 pandemic, and other metrics like middleware reconfiguration and cybercrime analysis. These diverse applications highlight the versatility of the proxemics theory in addressing various requirements.

  • Domain-Agnostic it has no strong correlation to a specific domain; it is a very domain-unaware theory. Table 2 shows that existing work leveraging proxemics spans vastly different domains of application: Virtual Reality, Health, Video Games, Robotics, Education, Software Engineering, Digital Forensics, and more. This demonstrates the domain-agnostic nature of proxemics.

  • Fitness for Social Media As shown previously, proxemics can be applied to spaces of different natures (e.g., physical space, VR space, social media space) with interacting entities (e.g., real people, video game characters, social media users). As social media platforms can be conceptualized as spaces where various entities interact and maintain distances between each other, it is possible to hypothesize that proxemics could be applied to social media and that many aspects of proxemics could be naturally linked with social media concepts. For example, distance can refer to the distance between social media users, posts, or entities contained in posts such as hashtags, place mentions, etc. Additionally, location can refer to the community in which a social media user is positioned in terms of their interests. Orientation might signify a sentimental orientation toward certain topics, and identity can characterize a social media user by attributes like age, language, etc.

  • Tangible Dimensions Proxemic dimensions were originally designed around the physical world. They are practical and tangible, making them easier to understand and manipulate, even for non-computer scientist users. These dimensions can therefore serve as abstractions for more complex concepts. For example, the distance between individuals can be used to represent their level of interaction or social engagement. By leveraging these tangible dimensions, we aim to create similarity measures that simplify the understanding of more abstract or complex phenomena.

Let’s now examine existing similarity measures that could be used to assess similarity between entities’ proxemic dimensions.

2.2 Existing similarity measures

Social media platforms have become a focal point for the research and application of various algorithms (Anderson et al. 2012), especially with regard to trend analysis (Bhor et al. 2018), content recommendation (Jiang and Yang 2017), event detection (Huang et al. 2021), and ad targeting (Knoll 2016). One of the key components for these tasks is the ability to assess similarity between entities, whether these items are words, posts, users, or media. We have chosen to divide existing similarity measures into 4 core families (traditional, series-based, graph-based and deep learning).

Table 3 provides a side-by-side comparison of existing similarity measures aligned with the various types of social media data contained in the APs Trajectory Model (refer to Fig. 2 for details).

Table 3 Comparison of the Applicability of Existing Similarity Measures for Social Media Data

We consider that a user’s attributes can be either categorical (e.g., gender, occupation, etc.) or numerical (e.g., age, height, etc.).

We have specified with the sets keyword when the similarity is calculated between unordered sets of entities (rather than individual ones), while the seq keyword indicates ordered sequences. The ✓ symbol indicates that the measure is fully compatible and fits the given data type, requiring little to no modification. The ✳ symbol indicates that the measure could be adapted but would need substantial redesign or modification. The absence of a symbol suggests that the measure is not fit or that adapting the measure would require extensive work.

The underlined measures are those we selected to adapt in the ProxMetrics toolkit. We hypothesized that we could rely on these measures as a foundation for the ProxMetrics toolkit and adapt them to assess similarity based on proxemic dimensions. These similarity measures are widely used, cover most of the data types we are dealing with, are less complex than graph or series-based ones, and do not necessitate time consuming or costly training dataset.

These include the Haversine (Nguyen et al. 2017; Baucom et al. 2013) and Euclidean (Johansson et al. 2013) distances to measure the straight-line distance between two coordinates, the first taking into account the curvature of the Earth, which is essential for dealing with spatial coordinates such as geocoded toponyms or post geotags. The Jaccard Index assesses the similarity of two sets of data, and is applicable for evaluating social media trends like hashtag usage (Zangerle et al. 2013). Lastly, the Wu-Palmer similarity determines the closeness of semantic concepts in a taxonomy, it calculates similarity by considering the elements’ depths in the taxonomy and the depth of their Least Common Subsumer (LCS), making it particularly useful for semantic similarity measures (Wu and Palmer 1994).

We will now examine how similarity metrics can be combined to create composite ones, which is crucial in our use case, given that we consider five distinct proxemic dimensions (DILMO).

2.3 Main techniques for criteria combination

The diversity of social media data often requires a nuanced approach to measuring similarity. A single similarity measure may not adequately address the multifaceted nature of the data. Therefore, the combination of various similarity measures can provide a more robust and comprehensive understanding. This section outlines strategies for combining multiple measures to improve indicators’ accuracy in various domains.

In healthcare, criteria such as patient history, symptom severity, and laboratory test results are often combined. While weighted means are common, other methods like decision trees or Bayesian networks might be used, depending on the complexity and nature of the data. For example, Khan et al. (2008) used fuzzy decision trees to combine various biological indicators for disease prediction.

Financial institutions might use weighted means, logistic regression, or machine learning models to combine criteria like credit history, current debts, and income levels for risk assessment. The approach chosen often depends on the need for interpretability compared to predictive accuracy. For example, Bolton et al. (2010) explored the application of logistic regression in credit scoring models.

In education, combining criteria such as student engagement, performance metrics, and feedback can involve weighted means, cluster analysis, or even neural networks. The choice depends on the educational context and the specific goals of the analysis. For instance, Ng et al. (2016) employed cluster analysis to categorize student motivation and learning behaviors based on multiple metrics.

Lastly, environmental scientists often rely on spatial analysis techniques to combine criteria like pollutant levels, biodiversity indices, and land use patterns. The method chosen often reflects the scale and complexity of the environmental data. A study by Lu et al. (2015) leveraged spatial analysis to assess the impacts of various environmental factors on ecosystem health.

Fig. 2
figure 2

APs Trajectory Model - Designed around the 5 dimensions of proxemics (DILMO)

In the ProxMetrics toolkit, we plan to first experiment the weighted mean. This decision is driven by three main factors. Firstly, as demonstrated above, various domains employ distinctly different methods for combining indicators. Therefore, to ensure domain independence, it is crucial for us to adopt a combination technique that can be flexibly adapted to each domain’s specific requirements and that is not too specialized. Secondly, we aim to integrate the expertise of domain stakeholders into the measure. Stakeholders should have the ability to adjust the impact of each dimension to tailor the indicators to their unique requirements. Thirdly, we require a combination technique that is easily interpretable by users who are not computer scientists. A weighted mean, in contrast to more complex combination methods such as machine learning models, provides a clear and understandable rationale for the assigned weights of different measures.

We will now explain: (1) how we adapted traditional physical proxemics to social media data into a generic data model, and (2) how this model and the similarity measures we have selected are used in the ProxMetrics toolkit.

3 From physical to social media proxemics

The ProxMetrics toolkit rely on a common data model. This model adheres to the five dimensions of proxemics: Distance, Identity, Location, Movement, and Orientation (DILMO dimensions) (Greenberg et al. 2011). However, we have adapted these dimensions to model entities and interactions in social media rather than in physical spaces. The process behind this adaptation is detailed in Masson et al. (2023b).

The model is designed to be compatible with any type of social media data as long as it conforms to this model, thus ensuring its applicability across various domains and social media platforms. Figure 2 presents the UML class diagram of the model, which is instantiated step-by-step throughout the framework’s pipeline.

The Identity (I) dimension allows for the modeling of the studied population: individual users (along with their profile attributes) or demographics (user groups) featuring common characteristics or traits.

The Movement (M) dimension provides the ordered sequence of posts belonging to a given user. It gives a comprehensive view of a user’s activities on the chosen social media and allows linking posts together to create multidimensional trajectories. Additionally, it can be broken down into several sub-trajectories (spatial, thematic, spatio-thematic, tempo-sentimental, etc.).

The Location (L) dimension models the posts themselves along with their associated locations. We move away from the solely physical definition of location and consider 3 types of locations. These can be (1) spatial (places based on toponyms extracted from posts or geotag metadata), (2) temporal (temporal entities or timestamps) or (3) thematic (domain concepts aligned with a semantic resource). Thematic locations are defined according to the studied domain’s description (domain specific ontology, thesaurus or dictionary). These semantic resources provide additional hierarchy information. When it comes to spatial locations, they are associated with a unique identifier linked to a spatial database. This allows for featuring relationships (e.g., a city is within a region, itself within a country). A given post can be in several locations at the same time, making the model ubiquitous.

The Orientation (O) dimension contains contextual and enrichment data, such as sentiment of the associated post (positive, negative or neutral) and the engagement associated with it (based on the number of replies, likes and quotes). The classes for both locations and orientations are designed to be extensible, thus providing the flexibility to incorporate new classes as desired (e.g., when studying politics, one could imagine a political orientation dimension).

Lastly, the Distance (D) dimension helps in modeling and storing static distances between entities of the same type, specifically between (1) identities (user or group) and (2) locations (spatial, temporal or thematic). These static distances will be used to calculate proxemic similarity, we will detail this aspect later (see Sect. 4.6).

Let’s now explore the ProxMetrics toolkit, which relies on this model and novel redefinition of proxemics, allowing for the calculation of similarity indicators between various types of social media entities.

4 ProxMetrics: modular toolkit to evaluate proxemic similarity in social media

We introduce ProxMetrics, a modular toolkit designed to assess the proxemic similarity between multidimensional entities on social media based on the 5 dimensions of proxemics (DILMO). This toolkit is designed to be fully modular and generic, adaptable to any social media platform and flexible to accommodate a wide variety of user requirements.

4.1 Entity definitions

We define proxemic similarity on social media as the perceived relational closeness or association between entities on social media, based on the nature, frequency, and depth of their interactions or mentions within social media platforms. Various factors can be considered when evaluating proxemic similarity, including analysis of individual posts, user feeds, engagement levels, sentiment analysis, or even profile information. In this work, we consider the following categories of social media entities. Firstly, we have dynamic entities: users and groups. They actively interact and move within the landscape of social media, analogous to people in physical proxemics.

  • Users (). Individual social media users, whether physical (e.g., a person) or corporate, institutional ones (e.g., an institution, a company).

  • Groups or demographics (). Groups of social media users defined according to the domain of study. They can be based on shared traits (e.g., French users, influencers, foreign tourists, etc.).

Secondly, there are static entities: places, dates (or periods), and themes. These are extracted from users’ posts and, unlike dynamic entities, do not interact on their own. Instead, they appear in the user posts. These static entities are analogous to objects in physical proxemics.

  • Places or spatial entities (). Places mentioned on social media. Different levels of granularity are possible, such as points of interest, districts, cities, regions, countries, etc. Extracted from posts’ metadata (geotags) or from the content of the posts.

  • Themes or thematic entities (). Domain-specific thematic concepts mentioned on social media are assigned to a semantic resource (e.g., dictionary, thesaurus, ontology). Different levels of granularity, like levels within a thesaurus or ontology, are possible. Extracted from posts’ content.

  • Dates/Periods or temporal entities (). Dates or time periods associated with social media posts. Different levels of granularity are possible, including hour, day, week, month, year, season, day of the week, etc. Extracted from posts’ timestamps (metadata) or content.

4.2 Proxemic similarity design

In proxemics, the selection of a reference entity, or the center entity (\(E_{ref}\)), is essential. While the traditional physical context of proxemics mostly uses individuals or sensors as references (whose interactions are under observation), the landscape of social media provides a more diverse range of entities as potential references. This could be, for example, a specific user (u), a group (g), a place (s), a theme (th), or a date (t).

$$\begin{aligned} E_{ref} \in \{u, g, s, th, t\} \end{aligned}$$
(1)

Proxemic similarity (\(P_s\)) is measured between a chosen reference entity \(E_{ref}\) and a set of target entities denoted as \(\tau\). This relationship gives rise to a wide array of potential entity pairings (25 combinations), including user to users, user to groups, place to users, place to places, among others. To systematically categorize these pairings, we have identified 4 primary proxemic similarity patterns, as illustrated in Fig. 3.

In the center of the proxemic reticle is the reference entity \(E_{\text {ref}}\). Surrounding it is a set of target entities \(\tau\). We denote entities in this set as \(E_{\text {target}}\) (with \(E_{target} \in \tau\)). The visual distance between these and the reference entity represents their relative proxemic similarity. This distinction in pattern is necessary because, unlike in physical proxemics, determining the proxemic similarity between dynamic entities (users or groups issuing posts), static entities (informational entities found in posts), or a combination of both cannot be done in the same manner.

Fig. 3
figure 3

The 4 proxemic similarity patterns in the ProxMetrics toolkit

Depending on their level of proxemic similarity in regard to the reference entity \(E_{ref}\), target entities \(\tau\) might be categorized into different proxemic zones. These zones are demarcated by the light gray dashed lines in Fig. 3. The number and range of proxemic zones are determined based on the study domain and by the end users. For example, in the tourism domain, they might represent the degree of attraction a tourist feels towards nearby POIs (Points of Interest). Unlike in traditional physical proxemics (with intimate, personal, social, and public zones), there are no universal definitions of zones when it comes to social media.

4.3 Proxemic similarity definition

Fig. 4
figure 4

Overview of the ProxMetrics toolkit

The formula for proxemic similarity (\(P_s\)) is grounded in the 5 proxemic dimensions (Distance, Identity, Location, Movement, Orientation), as detailed in Fig. 4. It is therefore composite and modular; the user can modulate (amplify or reduce) the impact of certain dimensions depending on the requirements he wishes to address. This is done through 5 coefficients (here \(\alpha\), \(\beta\), \(\gamma\), \(\delta\), \(\omega\)).

Given a reference entity \(E_{\text {ref}}\) and a target entity \(E_{\text {target}}\), we introduce a composite measure \(P_s(E_{\text {ref}}, E_{\text {target}})\). This measure is an aggregation of 5 sub-measures, each corresponding to one of the proxemic dimensions (DILMO). These sub-measures quantify the similarity or dissimilarity between the 2 entities across each dimension, providing a comprehensive assessment of their proxemic relationship.

Proxemic similarity (\(P_s\)) is expressed as a numerical value ranging from 0 to 1. A value of 1 indicates a strong proxemic similarity, while a value of 0 points to a lack of similarity.

$$\begin{aligned}{} & {} P_s(E_{ref}, E_{target}) = \alpha D(E_{ref}, E_{target}) \nonumber \\{} & {} \quad +~\beta I(E_{ref}, E_{target}) +~\gamma L(E_{ref}, E_{target}) \nonumber \\{} & {} \quad +~\delta M(E_{ref}, E_{target}) +~\omega O(E_{ref}, E_{target})\nonumber \\{} & {} \quad \text {with} ~ \alpha + \beta + \gamma + \delta +~\omega = 1 \nonumber \\{} & {} \quad \text {and} ~ 0 \le P_s, D, I, L, M, O \le 1 \end{aligned}$$
(2)

We hypothesize that by blending and modulating these five dimensional measures, we can create domain-specific indicators. This approach aims to address business requirements in a manner that is generic, functioning across various social media platforms, domain-independent, applicable to different domains of application, and versatile, capable of accommodating a wide range of end user requirements. We will now explore how each component is calculated.

4.4 Distance similarity (D)

As previously mentioned, we opted to align with physical proxemics for the Distance \(D(E_{ref}, E_{target})\). For simplicity, we currently allow the distance dimension to be applied only when calculating proxemic similarity between two entities of the same kind (e.g., two users, two cities, two themes, two dates or time periods, etc.). Our approach allows us to leverage established and efficient similarity measures for this dimension.

For users and places, we leverage the Haversine distance (Nguyen et al. 2017; Baucom et al. 2013) (denoted as \(D_{physical}\)), which measures the straight-line distance between two points while accounting for the Earth’s curvature. This decision is grounded in the utility of physical metrics in social media contexts. For example, it helps determine whether two social media users are in close proximity or if two cities are in the same region. This physical metric could be useful in recommendation scenarios, where we want to recommend POIs (Points of Interest) to users that are physically close to them. In cases where real-time positioning of a user is not supported by the social media platform, we rely on their most recently recorded location.

For user groups, we calculate the centroid position of all members, providing an indication of the group’s predominant location.

For themes, we evaluate their semantic similarity (denoted as \(D_{semantic}\)) using the Wu-Palmer methodology (Wu and Palmer 1994), which determines similarity based on their least common subsumer (LCS). This method helps ascertain whether two themes are semantically close (e.g., Beach with Sea) or far (e.g., Beach with Museum).

For dates, we evaluate the interval (\(D_{interval}\)) in hours between them, and for periods, we reference the median date. This enables us to determine the temporal proximity of dates or periods, determining whether they occurred close or not.

Lastly, it is important to note that the Haversine distances and time intervals undergo normalization between 0 and 1 using min-max normalization. Normalization parameters need to be tweaked depending on the spatial area and time range covered by the social media corpus in use.

4.5 Identity similarity (I)

The Identity dimension \(I(E_{ref}, E_{target})\) uses profile features to calculate the similarity. Social media users possess various profile features such as age, gender, occupation, and more. These features are crucial for understanding behavior by emphasizing the unique characteristics of individual profiles. The primary goal of this dimension is to bridge demographic differences and to detect similar groups or users. If we refer back to the patterns illustrated in Fig. 3, here are examples of how this dimension is applied:

  • Pattern 1 \(E_{ref} = u, \forall E_{\text {target}} \in \tau , E_{\text {target}} = u\) Detection of users (u) with similar profiles based on their features (e.g., user connection based on their age, language, gender,etc.).

  • Pattern 2 \(E_{ref} = u, \forall E_{\text {target}} \in \tau , E_{\text {target}} = th\) Recommendation of places (p) or themes (th) to users (u) based on places visited or themes mentioned by similar users (based on their profiles features).

  • Pattern 3 \(E_{ref} = s, \forall E_{\text {target}} \in \tau , E_{\text {target}} = g\) Detection of which demographics (g) primarily visit a given city (p) or are active during a certain period of the day or year (t).

  • Pattern 4 \(E_{ref} = s, \forall E_{\text {target}} \in \tau , E_{\text {target}} = s\) Comparison of the demographics associated with themes (th), places (p), or time periods (t). For example, two places might want to identify how similar their visitors are.

Let’s begin by defining I between 2 individual users. Given 2 users \({E_{ref}}\) and \({E_{target}}\), the component \(I_{individual}(E_{ref}, E_{target})\) is defined as:

$$\begin{aligned}{} & {} I_{individual}(E_{ref}, E_{target}) \nonumber \\{} & {} \quad = \frac{\sum _{i=1}^{n} w_i \times s_{feature}(E_{ref}^{i}, E_{target}^{i})}{\sum _{i=1}^{n} w_i} \nonumber \\{} & {} \qquad \text {with} ~ E_{ref} \in u ~ \text {and} ~ E_{target} \in u \end{aligned}$$
(3)
  • \(n\) represents the number of attributes considered.

  • \(w_i\) represents the weight of the \(i^{th}\) attribute.

  • \(s_{feature}(E_{ref}^{i}, E_{target}^{i})\) denotes the similarity between the \(i^{th}\) attributes of users \(E_{ref}\) and \(E_{target}\) respectively.

For numerical attributes (e.g., age, height), we use the normalized Manhattan distance (Wang et al. 2016):

$$\begin{aligned} s_{feature}(E_{ref}^{i}, E_{target}^{i}) = \frac{|E_{ref}^{i} - E_{target}^{i}|}{\text {max}_{\text {attr}} - \text {min}_{\text {attr}}} \end{aligned}$$
(4)
  • \(\text {max}_{\text {attr}}\) and \(\text {min}_{\text {attr}}\) are the maximum and minimum possible values for the attribute respectively.

For categorical attributes (e.g., gender):

$$\begin{aligned} s_{feature}(E_{ref}^{i}, E_{target}^{i}) = {\left\{ \begin{array}{ll} 0 &{} \text {if } E_{ref}^{i} = E_{target}^{i} \\ 1 &{} \text {if } E_{ref}^{i} \ne E_{target}^{i} \end{array}\right. } \end{aligned}$$
(5)

We can now extend this to groups of users, given two groups of users \(E_{ref}\) and \(E_{target}\), the average similarity \(I_{group}(E_{ref}, E_{target})\) is:

$$\begin{aligned}{} & {} I_{group}(E_{ref}, E_{target}) \nonumber \\{} & {} \quad = \frac{1}{m \times p} \sum _{i=1}^{m} \sum _{j=1}^{p} I_{individual}(E_{ref}^{i}, E_{target}^{j}) \nonumber \\{} & {} \qquad \text {with} ~ E_{ref} \in g ~ \text {and} ~ E_{target} \in g \end{aligned}$$
(6)
  • \(m\) is the number of profiles in group \(E_{ref}\).

  • \(p\) is the number of profiles in group \(E_{target}\).

  • \(I_{individual}(E_{ref}^{i}, E_{target}^{j})\) is the similarity between the \(i^{th}\) profile in \(E_{ref}\) and the \(j^{th}\) profile in \(E_{target}\) computed using the individual formula described before.

When one of the entities considered (\(E_{ref}\) or \(E_{target}\)) is static (e.g. themes, places, or dates / periods), we assimilate it to the subset of users who have referenced it in their posts. This enables us to assess whether these entities are associated with similar demographics.

4.6 Location similarity (L)

The Location dimension \(L(E_{ref}, E_{target})\) operates differently based on the pattern of proxemic similarity in use (see Fig. 3). Specifically, it has three variations.

For Pattern 1, which covers 4 proxemic similarity entity pairing (user to users, user to groups, group to users, and group to groups), we adapt the Jaccard Index (Zangerle et al. 2013) used to compute similarity between sets of spatial, temporal, and thematic locations. We evaluate the cooccurrences of locations between \(E_{\text {ref}}\) and \(E_{\text {target}}\). The more locations (spatial, temporal and thematic) two users or groups share, the more similar they are considered to be. For example, if two users frequently mention visiting the same cities or attending the same events, they are considered to have a high location similarity. Additionally, we want locations mentioned in recent posts to weight more (we consider that, as they are more recent, they are more relevant to domain stakeholders). Therefore, we introduce a time decay factor \(w_{freshness}\), which is based on the freshness of posts (e.g., a post issued today will have a weight of 1, while a post issued x days ago will weight less).

$$\begin{aligned} w_{freshness}(\text {p}) = e^{-\lambda \times (currentDate - p.date)} \end{aligned}$$
(7)
  • \((currentDate - p.date)\) is the difference in days between the current date and the post’s issuance date.

  • \(\lambda\) is a constant that controls the decay rate (how fast older locations are weighted less). We use \(\lambda = 0.01\) for a balanced effect.

The time-weighted locations similarity score between users or groups could be defined as:

$$\begin{aligned}{} & {} L_{individual}(E_{ref},E_{target}) \nonumber \\{} & {} \quad = \sigma \times S_{spatial} + \theta \times S_{temporal} + \zeta \times S_{thematic} \nonumber \\{} & {} \qquad \text {with} ~ \sigma + \theta + \zeta = 1 ~ \text {and} \nonumber \\{} & {} \qquad ~ E_{ref} \in \{u,g\} ~ \text {and} ~ E_{target} \in \{u,g\} \end{aligned}$$
(8)
  • \(\sigma , \theta , \zeta\) are coefficients used to modulate the strength of each type of locations (spatial, temporal, thematic).

  • \(S_{spatial}, S_{temporal}, S_{thematic}\) are individual locations similarity scores for the three types of locations, as defined below.

$$\begin{aligned}{} & {} S_{type} = \nonumber \\{} & {} \quad \frac{\sum _{l \in E_{ref}^{type} \cap E_{target}^{type}} w_{freshness}^{ref}(l.p) + w_{freshness}^{target}(l.p)}{\sum _{l \in E_{ref}^{type} \cup E_{target}^{type}} w_{freshness}^{ref}(l.p) + w_{freshness}^{target}(l.p)} \nonumber \\{} & {} \quad \text {with} ~ type \in \{spatial,temporal,thematic\} \end{aligned}$$
(9)
  • type is type of locations considered (spatial, temporal or thematic).

  • \(E_{ref}^{type}\) is the set of locations mentioned by \(E_{ref}\) of type type. The same applies for \(E_{target}\).

  • \(w_{freshness}^{ref}(l.p)\) represents the time decay factor \(w_{freshness}\) applied to the most recent post in \(E_{ref}^{type}\) where location l is mentioned. Similarly, \(w_{freshness}^{target}(l.p)\) applies to \(E_{target}\).

This formula allows us to assess whether users or groups are similar based on: (1) where they interacted (spatial, where), (2) when they were active (temporal, when), and (3) what they interacted about, their interests (thematic, what).

When applying Pattern 2 and Pattern 3 (e.g., user to themes, places, etc.), we use the ratio of the users’ posts containing the theme to his other posts. This allows us to detect the affinity of users or groups to specific themes, places, and time periods. We also weight occurrences with the time decay factor to give more weight to recent ones. We define \(L_{occurrences}(E_{ref},E_{target})\) as:

$$\begin{aligned}{} & {} L_{occurrences}(E_{ref},E_{target}) \nonumber \\{} & {} \quad = \frac{\sum _{p \in P_{E_{\text {ref}} | E_{\text {target}} \in p.locations}} w_{freshness}(p)}{\sum _{p \in P_{E_{\text {ref}}}} w_{freshness}(p)} \nonumber \\{} & {} \quad \text {with} ~ E_{ref} \in \{u,g\} ~ \text {and} ~ E_{target} \in \{s,t,th\} \end{aligned}$$
(10)
  • \(P_{E_{\text {ref}} | E_{\text {target}} \in p.locations}\) represents the set of all posts from user or group \(E_{ref}\) that contain \(E_{target}\).

  • \(\sum _{p \in P_{E_{\text {ref}} | E_{\text {target}} \in p.locations}} w_{freshness}(p)\) is the sum of the weights of all posts from user or group \(E_{ref}\) that contain \(E_{target}\).

  • \(\sum _{p \in P_{E_{\text {ref}}}} w_{freshness}(p)\) is the sum of the weights of all posts from user or group \(E_{ref}\).

Lastly, when dealing with Pattern 4, which consists in determining the proxemic similarity between themes, places and dates or periods (e.g., place to themes, theme to periods, etc.), we consider the co-occurrences of spatial, temporal, and thematic locations found within posts or user feeds. For instance, consider the post: “We went swimming at the beach in Paris”. In this example, “swimming” (thematic), “beach” (thematic), and “Paris” (spatial) are co-occurring elements. For this, we can use a variant of the time-weighted locations similarity score described above (the time decay factor \(w_{freshness}\) is the same as defined before).

$$\begin{aligned}{} & {} L_{coocurences}(E_{ref},E_{target}) \nonumber \\ = \frac{\sum _{p \in E_{ref}^{posts} \cap E_{target}^{posts}} w_{freshness}(p)}{\sum _{ps \in E_{ref}^{posts} \cup E_{target}^{posts}} w_{freshness}(p)} \nonumber \\{} & {} \qquad \text {with} ~ E_{ref} \in \{s, t, th\} ~ \text {and} ~ E_{target} \in \{s, t, th\} \end{aligned}$$
(11)
  • \(E_{ref}^{posts} \cap E_{target}^{posts}\) is the set of posts containing both \(E_{ref}\) and \(E_{target}\).

  • \(E_{ref}^{posts} \cup E_{target}^{posts}\) is the set of posts that contain either \(E_{ref}\) or \(E_{target}\).

4.7 Movement similarity (M)

The Movement dimension \(M(E_{ref}, E_{target})\) focuses on sequential relationships within multidimensional trajectories. A sequence is formed when two locations are mentioned consecutively on different posts. Using an example, if a post reads “We are visiting the museum in Paris”, following another that mentions “swimming”, sequences like \({swimming} \xrightarrow {} {visiting}\), \(swimming \xrightarrow {} museum\), and \(swimming \xrightarrow {} Paris\) emerge. Here, for simplicity reasons, we limit ourselves to sequences of two contiguous locations, representing transitions from one place, theme, or time period to another.

For Pattern 1 (such as user-to-users or group-to-users relations), we employ the same formula as location similarity (\(L_{\text {individual}}\)). However, instead of considering occurrences of spatial, temporal, or thematic locations, we compare the sequencings of these elements across posts.

Fig. 5
figure 5

Two sets of spatial, temporal, and thematic sequencings

This approach (\(M_{individual}\)) helps identify common travel patterns among tourists, for example. An illustrative example of the sequencing sets compared is presented in Fig. 5.

Table 4 Implementation of the ProxMetrics toolkit. Each line represents a Proxemic Similarity Pattern from Fig. 3

For Pattern 2 and Pattern 3, the goal is to determine if users consistently stay (or focus) in a particular location (e.g., a place, a theme, or a time period) in their posts. We achieve this by calculating the entropy associated with that location. In the following example, \(E_{\text {ref}}\) represents a user, and \(E_{\text {target}}\) a theme.

  1. 1.

    We start by counting the number of posts containing occurrences of the location \(E_{\text {target}}\) (denoted as \(C_{\text {target}}\)) and all other locations in \(E_{\text {ref}}\)’s sequence of posts (denoted as \(C_{\text {others}}\)):

  2. 2.

    We calculate the total number of posts (\(C_{\text {total}}\)):

  3. 3.

    We compute the entropy (\(M_{entropy}\)) using the following formula (the lower this value is, the less predictable the user’s sequence of posts is in regard to the reference entity):

$$\begin{aligned} M_{entropy}(E_{ref},E_{target}) \nonumber \\ = 1 - \sum _{T \in \{\text {target}, \text {others}\}} \left( \frac{C_T}{C_{\text {total}}} \right) \log _2 \left( \frac{C_T}{C_{\text {total}}} \right) \nonumber \\ \text {with} \nonumber \\ C_{\text {target}} = |\{ p \in E_{\text {ref}} \mid E_{\text {target}} \in p.locations \}|\nonumber \\ C_{\text {others}} = |\{ p \in E_{\text {ref}} \mid E_{\text {target}} \notin p.locations \}|\nonumber \\ C_{\text {total}} = C_{\text {target}} + C_{\text {others}} \nonumber \\ E_{ref} \in \{u,g\} ~ \text {and} ~ E_{target} \in \{s,t,th\} \end{aligned}$$
(12)

In Pattern 4, where we evaluate the relationships among themes, locations, or time periods (or any combination thereof), we use the conditional probability to determine how frequently one entity consistently appears after another. Given a sequencing set (see Fig. 5), the probability that an entity \(E_{target}\) follows \(E_{ref}\) is:

$$\begin{aligned} M_{sequencing}(E_{ref},E_{target}) = \frac{N(E_{ref} \cap E_{target})}{N(E_{ref})} \nonumber \\ \text {with} ~ E_{ref} \in \{s,t,th\} ~ \text {and} ~ E_{target} \in \{s,t,th\} \end{aligned}$$
(13)
  • \(M_{sequencing}\) is the probability that \(E_{target}\) occurs after \(E_{ref}\).

  • \(N(E_{ref} \cap E_{target})\) is the count of times both \(E_{ref}\) and \(E_{target}\) appear sequentially.

  • \(N(E_{ref})\) counts occurrences of \(E_{ref}\).

This measure can help in determining social media users’ subsequent destinations after visiting a particular city or what people tend to do after practicing a given activity (theme). It can also easily be reversed to determine what people do before.

4.8 Orientation similarity (O)

In the orientation, both contextual sentiment and engagement data are taken into account, and they are linked on a per-post basis. In our model, sentiment is categorized with labels: positive, negative, or neutral. Engagement is quantified by aggregating metrics such as the number of likes, reposts, comments, etc.

For every proxemic similarity pattern, our formula mirrors that of the location dimensions with one key modification: the weighting. Rather than applying the time-decay factor \(w_{freshness}\), we introduce a new orientation factor, \(w_{orientation}\), that adjusts post weights based on sentiment and engagement levels associated with it.

$$\begin{aligned} w_{orientation}(p) = \mu \times p.eng \times \nu \times {p.sent} \nonumber \\ \text {with} ~ e \in {\mathbb {N}} ~ \text {and} ~ s \in \{0, 1, 2\} \end{aligned}$$
(14)

p.sent represents the sentiment value of the post p, which we map to the following values: \(\{0, 1, 2\}\). Here, 0 corresponds to negative posts (reducing the weight), 1 to neutral posts (no change in weight), and 2 to positive posts (increasing the weight). This scaling, for instance, helps in identifying whether multiple users or groups favor the same content, or in emphasizing positive posts when establishing connections between themes or places (as positive posts will weight more than neutral ones and negative posts will not be considered).

p.eng stands for the engagement value of the post p. A post with higher engagement will carry greater weight. For example, this can help to discern whether various users gain popularity around similar themes.

Lastly, \(\mu\) and \(\nu\) are coefficients employed to prioritize either sentiment or engagement in the weighting calculation.

Formulas with \(w_{orientation}\) weighting are called \(O_{individual}\), \(O_{occurrences}\) and \(O_{cooccurrences}\). It is important to note that the ProxMetrics toolkit is extensible. While we have chosen to implement it using these formulas, users are free to select others, provided they adhere to the proxemic patterns and the APs Trajectory Model. Table 4 offers a summary of the appropriate formulas for each similarity pattern. We will now experiment with the toolkit and manipulate the dimensions to address domain-specific requirements.

5 Experimentation: application to the tourism domain

Firstly (Sect. 5.1), we present the input data that we use for this experiment. This includes (1) the dataset (its collection and preprocessing for use with the ProxMetrics toolkit), as well as (2) the tourism knowledge base, and (3) the requirements from stakeholders in the domain of tourism (specifically, tourism offices) that we gathered.

Secondly (Sect. 5.2), we demonstrate how these requirements can be expressed as proxemic similarity indicators through the ProxMetrics toolkit (e.g., choice of proxemic pattern, relevant dimensions). We then elaborate on a chosen request from tourism offices.

Thirdly (Sect. 5.3), we generalize from this and qualitatively evaluate the four defined proxemic patterns with various dimensions, as well as the methodology used to combine them, to assess the validity of our working hypotheses and the relevance of the indicators produced by the toolkit. Additionally, we conclude the experiment by comparing the toolkit with existing platforms dedicated to analyzing touristic data to highlight its complementarity in this domain.

5.1 Input data and enrichment

Fig. 6
figure 6

Overview of the Instantiation Process of the APs Trajectory Model

Let’s begin by describing the input data and the enrichment process used in our experiments with the ProxMetrics toolkit. Fig. 6 illustrates the pipeline used to instantiate the APs Trajectory Model (note that the toolkit is applied to this model). We have also specified the dimensions of the model and the corresponding steps at which they are instantiated. The data collection and enrichment process was organized as follows:

  1. 1.

    We collected data from the popular social media platform Twitter (now X). Specifically, we gathered a corpus of 3,154 multilingual tweets in French, English, and Spanish that originated from the French Basque Coast during the summer of 2019 (see Fig. 6, ①). This area is known as one of the top tourist destinations in France. These tweets were posted by 655 unique users. Details on the data collection process can be found in Masson et al. (2022).

  2. 2.

    The data model was instantiated using two types of data: metadata and tweet content (see Fig. 6, ②). Metadata-based instantiation was a straightforward process, encompassing elements like engagement metrics, profile features, timestamps, and geotags, among others. The textual content of the tweets was processed using three deep learning-based NLP modules to generate automatic annotations at the token level for (1) locations and (2) thematic concepts, and at the text level for (3) sentiment. We divided the dataset into three parts with a split of 60% for training, 20% for validation, and 20% for testing, ensuring a uniform distribution of languages and users across these splits. More precisely:

    • For Tourism-related Concepts: we used the Thesaurus on Tourism & Leisure Activities provided by the World Tourism Organization (2002) to define them. This extensive resource covers approximately 1,300 touristic concepts. These concepts were linked to tweets through prompt-based few-shot learning with EntLM (Ma et al. 2022) paired with a multilingual BERT (Devlin et al. 2018) language model. The hyperparameters used are those recommended in the EntLM repository,Footnote 1 namely a batch size of 4, a learning rate of 1e-4, and a weight decay of 0. We found that prompt-based few-shot learning with EntLM was significantly more effective than fine-tuning approaches, achieving an F1-score of 0.840, compared to 0.241 for fine-tuning for this task.

    • For Spatial Locations: we extracted them by geotagging named entities from toponyms aligned with OpenStreetMap and from the geotags of posts. Here, fine-tuning multilingual BERT (Devlin et al. 2018) demonstrated superior performance with an F1-score of 0.848, compared to 0.788 with the few-shot technique for sequence labeling implemented by EntLM (Ma et al. 2022). We used the OSM Nominatim API (Clemens 2015) to geocode extracted toponyms.

    • For Sentiments: they were derived using the XLM-RoBERTa language model (Conneau et al. 2019). More precisely, we used the version which was fine-tuned for sentiment analysis (Barbieri et al. 2022). It achieved a very high accuracy of 0.939.

    • Users in the dataset were manually categorized into three groups (local tourists, external tourists, and photographers). This technique is not efficient, but the objective is to use this annotated dataset as a backbone to train deep learning-based classifiers to automate the task, as there were no existing annotated resources for this purpose publicly available at the time of our study.

More details on these NLP-related aspects are given in Masson et al. (2023a). In the end, this experimentation dataset of 3,154 tweets covers around 315 unique concepts (which were instantiated 7,022 times) and 7,128 locations. 1,357 tweets were positive, 366 negative and 1,238 neutral. The dataset has then been converted to our proxemic model (from Fig. 2) and loaded into the ProxMetrics toolkit (see Fig. 6, ③).

Below is a set of simplified requirements for indicators in the tourism domain, compiled through collaborative discussions with local tourism authorities and stakeholders on the French Basque Coast. Stakeholders are looking for indicators on:

  1. 1.

    Leisure Activities Practiced Together (which activities do tourists often practice together)

  2. 2.

    Cities Visited in Sequence (after visiting a given city, where do tourists usually go)

  3. 3.

    City-Specific Demographics (types of tourists per city)

  4. 4.

    Weather-Based City Preferences (choice of city by tourists based on the weather)

  5. 5.

    Seasonal Activity Preferences (choice of leisure activity by tourists based on the season)

  6. 6.

    Identification of Popular Events (which events are popular in the region)

  7. 7.

    Lacking Tourism Infrastructure (places where infrastructure is found disappointing)

  8. 8.

    Tourist Satisfaction about POIs (what POIs do tourists primarily enjoy)

  9. 9.

    Trends in Cross-Border Tourism (who, where, when, and what)

  10. 10.

    Connection of Similar Tourists (for a tourists connection system)

These requirements vary in scope, with some being broad and others more specific. Tourism stakeholders need indicators to better understand these diverse aspects of tourism in their region. To address this, we will take advantage of the ProxMetrics toolkit to address these requirements and calculate relevant proxemic similarity indicators, allowing for a deeper understanding of the various aspects of tourism in the region.

5.2 Toolkit experiment on tourism

Table 5 Example of end users requirements for social media analysis: the case of tourism

Table 5 presents the requirements from tourism offices as shown above. Each line number corresponds to a requirement from the list in Sect. 5.1. For each requirement, we have proposed a manner to express it as a proxemic similarity indicator using the ProxMetrics toolkit. This includes: (1) the proxemic environment that could be used to model the requirement, identifying both the reference entity (\(E_{ref}\)) and the target entities (\(\tau\)), whose proxemic similarities to the reference element will be assessed; and (2) the relevant proxemic dimensions of our model that are essential for addressing it. While a variety of dimension combinations and proxemic environments (e.g., reference and target entities) may be applicable for most requirements, we present only a selected one here due to space constraints.

5.2.1 Designing tourism indicators with ProxMetrics

Let’s examine the requirement “City-Specific Demographics” (requirement 3 in Table 5). In this context, the goal is to calculate indicators allowing tourism stakeholders to identify the types of tourists most commonly associated with various cities. Using the ProxMetrics toolkit, this requirement can be modeled by selecting a specific user group (e.g., short-stay tourists) as the proxemic reference and then determining the proxemic similarity of different cities to this group. Therefore, it falls into the proxemic Pattern 2 (refer to Table 4), which links a dynamic reference entity (in this case, a user group) to a static entity (in this case, a place).

To calculate this similarity, it may be useful to take into account multiple proxemic dimensions to achieve a more comprehensive analysis, such as the characteristics of the reference user group and of users visiting cities (Identity dimension), the frequency with which members of the reference group mention the city in their posts (Location dimension), and the sentiments they generally express towards it (Orientation dimension). For example, a given group of tourists and city could be perceived similar if positive sentiments are frequently expressed by the group about the city. This requirement could be articulated in various alternative ways (e.g., by choosing a specific city as the reference point instead of a user group). As the toolkit is modular, the end user is able to give more importance to certain dimensions by adjusting the dimensional weighting. In summary, Table 5 shows that the ProxMetrics toolkit is versatile and capable of modeling various domain-specific requirements.

5.2.2 Indicator case study: City-specific demographics

We will now go into more details with the example requirement presented above. Figure 7 focuses on the indicator “City-Specific Demographics”. Visuals were created using the ProxViewer dashboard, a web interface powered by the ProxMetrics toolkit, able to visualize results through proxemic reticles. This interface also helps end users setup the toolkit (dimensions, entities, etc.). Due to space constraints, we are unable to thoroughly present this dashboard. However, we have prepared a demonstration video. It is available here.Footnote 2

Fig. 7
figure 7

Visualization of the indicator “City-Specific Demographics” in proxemic reticles, using proxemic patterns 2 and 3

In this example, the “City-Specific Demographics” indicator is expressed using two distinct proxemic environments (Perspective 1 and 2 in Fig. 7). One where the demographic (user group entity) is the reference (here, Photographers), and similar cities are positioned relative to it, and another perspective where a chosen city of interest (here, the touristic city of Biarritz) is at the center, and we want to identify groups similar to this city. These 2 perspectives correspond to the proxemic Pattern 2 and Pattern 3 defined earlier (see Fig. 3). By default, we set proxemic dimensions (ILO) to equal weighting (\(\frac{1}{3} \approx 0.33\)), but this default weighting can be dynamically changed by end users if they wish to give more weight to profile features or city mentions.

  • The Identity (I) dimension influences the result based on whether the typical user profiles of a city are similar to the group based on profile features (e.g., are the tourists of the city of Biarritz usually of the same age compared to photograph tourists, etc.).

  • The Location (L) dimension is based on how often members of the user groups mention the given city in their posts, which may indicate a specific affinity for the city.

  • The Orientation (O) dimension is considered and weights city mentions by sentiment and engagement values, which means that influential or positive users will have more impact on the similarity results.

Let’s illustrate Perspective 1 from Fig. 7 using the pairing \(E_{ref} = Photographers\) and \(E_{target} = Biarritz\). We can refer back to Table 4 to determine the formula for this proxemic environment pattern, specifically Pattern 2. We calculate the proxemic similarity between Biarritz and Photographers as follows, using the \(I_{group}\), \(L_{occurrences}\), and \(O_{occurrences}\) formula corresponding to the Identity, Location, and Orientation (ILO) dimensions considered in this example. We obtain a proxemic similarity of around 0.403 by aggregating the I, L and O dimensions with equal weighting. It appears that users classified as Photographers have a moderate interest in this city when considering these 3 dimensions.

$$\begin{aligned} P_{s}(Photographers,Biarritz) \\ = \frac{1}{3} \times \underbrace{I_{group}}_{0.52} + \frac{1}{3} \times \underbrace{L_{occurences}}_{0.34} + \frac{1}{3} \times \underbrace{O_{occurences}}_{0.36} \nonumber \\ \approx 0.4026 \\ \end{aligned}$$

In Fig. 7, results are displayed through a proxemic reticle with the reference entity in the center and the target entities around it, scattered in different proxemic zones. Here, we have 3 zones: strong affinity, medium affinity, and weak affinity. However, other zones could be defined according to the domain and the requirement of interest. Any number of zones is possible. As we can observe, these visualizations are useful and flexible for domain stakeholders as they allow immediate identification of similar or dissimilar entities.

To further elaborate and demonstrate the ProxMetrics toolkit’s capability to calculate a wide array of indicators, we will provide an overview of four additional selected indicators.

5.2.3 Overview of additional indicators

Figure 8 presents example results for four additional indicators from those introduced in Sect. 5.1.

Fig. 8
figure 8

Results of Four Selected Proxemic Similarity Indicators from the Tourism Office Requirements

These four indicators were chosen to demonstrate the capabilities of the ProxMetrics toolkit, as they leverage various types of entity combinations and proxemic dimensions. The proxemic similarity indicators displayed are as follows:

  • Leisure Activities Practiced Together (Fig. 8, Indicator A): for this indicator, we are examining similar touristic activity-related themes compared to a reference one, in this example, Surfing. We use two equally weighted dimensions: Location (L), to consider the co-occurrences of themes in posts, and Orientation (O), so that positive co-occurrences weigh more heavily, as we consider those to be more significant. The results show that Surfing is often paired with Photography (see ①) and Outings (see ②). Perhaps, it appears that photographing surfers is particularly popular in the region.

  • Cities Visited in Sequence (Fig. 8, Indicator B): here, we are interested in the spatial movements of tourists, namely, where they tend to go after visiting a given place. In this case, we are focusing on the city of Ustaritz, positioned as the reference entity, and it will be compared to other cities in the region. We use the Movement (M) dimension with stronger weighting because we are particularly interested in the places that are often sequenced in tourists’ trajectories. Minor weighting is given to the Distance (D) dimension to slightly boost places that are physically close, as they are more likely to attract tourists. The results show that the cities of Biarritz (see ①) and Saint-Jean-De-Luz (see ②) are more likely to attract tourists after having visited Ustaritz. An extension of this indicator could be to calculate it for specific user categories, for example, Photographers. This would allow a recommender system to determine where to recommend another photographer to go based on the behavior of other photographers (via a collaborative filtering approach (Liu et al. 2014)).

  • Lacking Tourism Infrastructure (Fig. 8, Indicator C): for this requirement, the objective is to identify themes related to infrastructure that are broadly considered lacking in a given city. We use the city of Biarritz as a reference and observe similar themes based on the Orientation dimension. This dimension will bring positive themes closer to the reference city and push negative ones further. The results show that Shop** (see ①) and Exhibitions (see ②) are quite dissimilar, indicating that these aspects are severely lacking in this city. On the other hand, Sports (see ③) is very similar, suggesting that the city is viewed very positively in this aspect by tourists.

  • Connection of Similar Tourists (Fig. 8, Indicator D): Finally, the last requirement is to detect similar tourists to build a system that connects them. If we select a reference tourist, here Pierre, we can observe close matches. This indicator is based on all dimensions because we want to get the overall similarity based on various criteria. This can be tweaked depending on the use case, as the toolkit is modular. ① and ② show examples of other tourists that are closer to the reference one and could therefore be recommended because they have similar behavior and likely share interests with the reference user. This indicator could also be used for a recommender system to suggest themes or places based on what similar users liked.

We have demonstrated that ProxMetrics effectively models a wide range of diverse indicators in the tourism domain by combining various entities and modular proxemic dimensions according to the requirements of domain stakeholders. However, it is now essential to evaluate the relevance and significance of the produced proxemic similarity indicators for the stakeholders.

5.3 Qualitative evaluation of proxemic similarity patterns

Table 6 Qualitative Evaluation of the ProxMetrics toolkit on the 4 proxemic similarity patterns (see Fig. 3) with selected indicators

To determine whether the ProxMetrics toolkit produces indicators that are accurately representative of real-world phenomena or behaviors on social media, we conducted a qualitative evaluation of the results and compared the results obtained by ProxMetrics with the assessments of domain experts (colleagues specialized on cultural heritage and tourism practices), as depicted in Table 6. For each identified proxemic similarity pattern (from Fig. 3), we selected an indicator within this pattern as a case study. As mentioned earlier, the proxemic similarity is typically calculated between a reference entity and multiple target entities. However, for this evaluation, we focused on calculating the similarity between the reference entity and a single target entity to facilitate the work of the experts.

We implemented the following protocol. Firstly, for each of the four proxemic patterns, we asked five domain experts to assess the similarity (on a scale from 1 to 10, with 1 being extremely dissimilar and 10 being extremely similar) of the reference and the selected target entity for each dimension individually. To do this, experts were provided with only excerpts of the data model corresponding to the given dimension. For example, in the case of Pattern 1, when evaluating the proxemic similarity between two users within the identity dimension, two sets of user profiles along with their characteristics (age, number of followers, etc.) were given but no information on the tweets they posted, places visited, etc. When dealing with the location dimension, they were only given two sets of tweets, but not the sequence in which they were issued or the users to whom they belong, etc. This allows us to determine whether our individual dimension formulas are relevant and meaningful to domain stakeholders. Then, we asked the domain experts to assess the similarity of all relevant dimensions combined in regard to the example indicator chosen. This helps us to determine whether our method of combining dimensions (weighted mean) is relevant. For this evaluation, we deliberately selected entities that are not overly represented in the entire dataset, aiming to facilitate the experts’ evaluation process.

As seen in Table 6, for each dimension and pattern, we calculated the standard deviation (depicted as \(\sigma\)) between the five experts’ measures (1, 2, 3, 4, and 5) in order to assess their degree of agreement. Then, we averaged the measures provided by the experts (depicted as \(\bar{x}\)) and compared them against the results obtained by ProxMetrics. This allowed us to calculate the difference (depicted as \(\Delta\)) between the experts’ results and those of ProxMetrics, thus determining whether they are in concordance or not.

As we can observe in Table 6, for the individual dimensional evaluation, out of the 18 evaluation cases, there are only 3 cases ( in Table 6) where the ProxMetrics assessment significantly differs from that of the evaluators (e.g., the \(\Delta\) is greater than 2). These are Pattern 1 with the I dimension (\(\Delta = 3.50\)), Pattern 2 with the O dimension (\(\Delta = 3.60\)), and Pattern 3 with the O dimension (\(\Delta = 2.40\)). In cases of poor assessment for the O dimension, we notice they are also correlated with very low agreement between evaluators (\(\sigma = 1.90\) for Pattern 2 and \(\sigma = 1.85\) for Pattern 3), indicating there are various ways to interpret the similarity for this dimension. Therefore, other formulas might be more appropriate depending on the domain requirements. For the other dimensional measures, ProxMetrics provides assessments that are quite similar to those of the experts, demonstrating the relevance of our formula in the domain of tourism.

In the case of multidimensional combinations (DILMO, ILO, ILMO, and LO in the examples chosen), for most of them (3 out of 4, in Table 6), the aggregation with default parameters (equal weighting) of ProxMetrics closely matches that of the evaluators, with \(\Delta = 1.06\), \(\Delta = 0.87\), and \(\Delta = 0.93\). However, for Pattern 4, the result diverges with \(\Delta = 2.85\) ( in Table 6), and the agreement among evaluators is also significantly weaker (\(\sigma = 1.72\)) compared to the other patterns. In all cases, it would have been possible to improve the accuracy of the results by altering the weighting of each dimension. Let’s consider Pattern 1 as an example: if we had doubled the impact of the L and O dimensions and reduced that of D by half, we would have obtained a result of \(3.14\), similar to that of the experts. This highlights the importance of choosing appropriate weights during the process to get accurate results. These weight values are highly dependent on the domain and specific requirements; therefore, it is necessary for experts in each domain to tweak them.

Let’s conclude this experiment by comparing the ProxMetrics toolkit applied to the domain of tourism with indicators produced by other platforms in this domain. ProxMetrics is highly complementary for several reasons:

  • Dynamic, User-Parameterized Indicators: It is a dynamic and modular tool that allows users to build their own indicators in real time using proxemic dimensions (DILMO). This contrasts with static dashboards such as Pilat Tourisme (2022), Isère Attractivité (2023), or Atout France (2023), where users have no say about what is presented to them.

  • Distinctive Analytical Insights: ProxMetrics introduces a type of indicator (proxemic similarity) rarely seen in existing solutions. This allows tourism stakeholders to gain new analytical perspectives by assessing the similarity of domain-specific entities (tourist activities, points of interest, types of tourist, etc.). Additionally, usual tourism dashboards like the one from Visit Paris Region (2023) usually allow combined filtering but are limited in terms of blending multidimensional indicators together.

  • New Visualization (Proxemic Reticles): This offers a mode of visualization that contrasts with the classic visualizations used in tourism dashboards such as UNWTO (2023), INSEE (2023), or OECD (2023), which contain spatial maps, timelines, tables, and numerical charts but do not highlight the similarity of entities relative to each other.

6 Conclusion and future perspectives

In this paper, we introduced ProxMetrics, a modular toolkit designed to assess the similarity of social media entities (namely users, groups, themes, places, and time periods). This toolkit is grounded in the theory of proxemics, a traditionally physical theory that we adapted for social media. The adaptation of this theory allows the toolkit to be highly versatile, enabling domain stakeholders to build insightful indicators for various requirements by manipulating the five dimensions of proxemics (DILMO): Distance, Identity, Location, Movement, and Orientation.

Our experimentation within the tourism domain demonstrated not only the practical applicability of ProxMetrics but also its potential to contribute meaningfully to domain stakeholders. The proxemic similarity scores that it generates facilitate a deeper understanding of the relationships and behavioral patterns within social media. Through the proxemic viewer dashboard (ProxViewer), we have also addressed the accessibility of the toolkit, ensuring that it can be used by non-computer scientist users.

We are currently seeking to enhance this work in various ways. Firstly, we plan to conduct a more extensive evaluation of the toolkit on larger datasets (to determine whether it scales effectively to massive volumes of data, on the order of millions of posts) and across other domains (such as fashion and public policy). This evaluation will be conducted in collaboration with management researchers and will cover a broader range of end users through semi-structured interviews. Secondly, although our current choice of foundational measures to evaluate proxemic similarity has proven effective in the tourism experiment, we want to explore other measures (such as series-based or graph-based ones) to determine whether they would significantly improve the quality of results for each proxemic dimension. Thirdly, we aim to improve the explainability of how these dimensions influence the calculated proxemic similarity scores for non-specialist users by adding reactive visual aids to the ProxViewer dashboard. Lastly, we envisage using the ProxMetrics toolkit for the detection of bots and avatars on social media, addressing one of the world’s most pressing issues (Ferrara 2023). The ProxMetrics toolkit can calculate proxemic similarities between users, which may be indicative of automated behavior. For effective bot detection, it will be necessary to incorporate additional dimensions, such as temporal activity patterns (Chavoshi et al. 2017), interaction diversity (Kosmajac and Keselj 2019), linguistic consistency (Cardaioli et al. 2021), and network centrality (Shinan et al. 2023), to enhance the accuracy and reliability of the detection process.

The toolkit and associated dashboard will be made open source and released to the public in the coming months.