1 Introduction

Geospatial data on buildings is important for a wide array of applications. For example, they can be used to study the urban fabric, while adding building attributes such as their type and height facilitates generating 3D building models, energy simulations, climate studies, disaster mitigation, land administration, and urban morphology analyses (Park and Guldmann 2019; Li et al. 2020b; Huang & Wang 2020; Agugiaro et al. 2020; Yuan et al. 2020; Palliwal et al. 2021; Abdelrahman et al. 2021; León-Sánchez et al. 2021; Ning et al. 2021; Wu & Biljecki 2021; Koeva et al. 2021; Bourdeau et al. 2019; Chen et al. 2020; Li et al. 2021; Florio et al. 2021; Hopf 2018). Information on nearby amenities (POIs) and the surroundings are also important in this context, as they are often associated with buildings, e.g. as indicators of housing value, demographics, and accessibility (Feng & Humphreys 2012; Kang et al. 2021; Yang et al. 2021; Mirkatouli et al. 2018; Szarka & Biljecki 2022; Su et al. 2021).

However, in practice, such data is still complex to obtain, and many issues prevail despite the significant developments in GIScience and remote sensing communities such as proliferation of Volunteered Geographic Information (VGI), i.e. OpenStreetMap (OSM), and advancements in data acquisition techniques. First, such data remains unavailable for most of the world, especially considering open data instances. Second, when such features are mapped, they often lack semantic information (attributes), e.g. year of construction, number of storeys, and type of building. This omission is most evident in recent efforts map** buildings at a large-scale but without considering any descriptive information on them (Huang et al. 2020; Li et al. 2020a; Sirko et al. 2020) analyse movement trajectories to infer attributes of roads, Chen et al. (2021b) mine social media data to map and understand amenities, Lines & Basiri (2021) exploit obstructions in satellite signals to reconstruct the vertical extent of buildings, Wu & Biljecki (2022) map buildings from street networks, Milojevic-Dupont et al. (2020) infer the heights of buildings by develo** a regression model that predicts them from the characteristics of the footprint and surrounding context, and Delmelle & Nilsson (2021) assess the ability of using property listing text for neighbourhood type prediction. However, to the extent of our knowledge, the potential of real estate data in building and amenity data acquisition remains uninvestigated despite their abundance, which is the key contribution of this paper.

An illustration of our hypothesis and the work is given in Fig. 1. Real estate data mainly span two forms: texts and images. For example, transaction data are mostly recorded in texts (e.g. address, price, and flat type), while typical rental or sale listings can have both text descriptions and images of the property and its attached amenities such as common spaces, gyms, and swimming pools. In most cases, such records are about subdivisions of buildings (e.g. units, flats). However, even when they pertain to a subset of a building (e.g. one out of a few hundreds of flats in a residential estate), they are representative of entire buildings, as by extension, they provide building information such as the building’s year of construction, tenure, location, and common amenities. Some of these forms of data provide information that could be extracted directly without much effort, while some would require a degree of processing. For example, text and/or photos in ads often feature amenities such as sport courts that are part of or are near the advertised property. In such instances, these amenities could be detected from ads using computer vision approaches which are now mature and readily available for that purpose (Chen et al. 2021a). Further, textual data may require processing as well. Free-text descriptions about a property may contain valuable information that could be extracted using basic text mining or natural language processing techniques.

Fig. 1
figure 1

The idea of our research: real estate datasets (i.e. property ads and transactions) provide a wealth of information on buildings but are a latent type of data in GIS that has a diverse set of untapped applications — ranging from quality control to rapid data updates thanks to their omnipresence, amount of information, and dynamic nature. This illustration is based on real data of a setting in Singapore. Notes: (i) the exemplified property transaction data has been adopted from another undisclosed building to preserve privacy; (ii) this example is simplified and it is not exhaustive, as it presents rather a subset of data that is in the scope of this research; (iii) some building information may be found in both ads and transactions (e.g. year of construction, floor area of apartments). Data sources: OpenStreetMap, PropertyGuru, Urban Redevelopment Authority of Singapore, and OSM Buildings

The same goes for transaction data. For example, while the information about the number of storeys of a building may be available in an ad, depending on the jurisdiction and other aspects, it may also be available indirectly from transaction data, from the address (i.e. unit number) of the apartment that is sold. That is, provided a long horizon of transaction data, we may be able to pick up the apartment sold on the highest floor (or at least very close to it), presenting an equivalent of the height of a building with an accuracy sufficiently reasonable for a number of analyses and to indicate the rough urban form (Manoli et al. 2019; Liu et al. 2020; Wang et al. 2021).

The extracted and processed data may provide value for multiple applications. Continuing elaborating the illustrated concept, here we provide four examples. First, the location of the building and presence of an amenity such as a swimming pool in its vicinity (e.g. deduced from a photo in the listing of a currently advertised apartment), may be used to check the content of an existing spatial database such as OSM for quality. If the feature is missing from the building’s vicinity, the spatial dataset can be flagged as suffering from completeness issues with the ability of identifying issues at a high spatial resolution. Second, extracted descriptive information about buildings, such as their type and year of construction, may be compared to the existing attributes in a spatial database, potentially indicating a discrepancy warranting further attention (e.g. errors or outdated records). Third, properties may be listed and also sold even before the buildings started being constructed (off-plan, pre-sales properties). Therefore, ads and transactions may hold value as a signal for new or future buildings that are yet to be mapped, and provide information about the future situation that may support use cases in urban studies and democratisation of the planning process, and make such data available to a broader audience as data on future developments is rarely available openly, especially in datasets such as OSM. Fourth, the extracted information can be used to enhance the database, if they have not been available previously, e.g. number of storeys of a building, which is often the case. Expanding the set of attributes in a spatial database may open the door for further spatial analyses that require such information, but have not been possible previously due to the lack of such data. Since nowadays online real estate marketplaces are prevalent around the world, unlike (open) data on buildings, we believe that such idea holds great potential. Therefore, much of our work focuses on investigating how can we make the best out of such data in the geospatial realm, by develo** a proof of concept and develo** experiments for various scenarios.

Another characteristic of real estate data is that they are highly dynamic. Their continuous update is especially true for advertisement services in which every minute multiple ads representing a portion of real estate in a city may be added. As these data are uploaded by the users of the property websites, we postulate that real estate data may be considered as a latent form of volunteered geographic information (VGI), and one that warrants further investigations, as contemplated above (Goodchild 2007). To be more specific, for the first time, we deem that they are a type of passive and implicitly volunteered VGI (Craglia et al. 2012; See et al. 2016; Ghermandi & Sinclair 2019; Hopf 2018), as contributing spatial data is not the contributor’s primary intention, similarly to social media and geo-tagged imagery such as Flickr (Yan et al. 2017, 2018). Considering real estate data from such an angle, we posit that they may also double as reference data to enrich and verify building and amenity databases, which has not been investigated yet even though it has various application in a few other areas such as socio-economic studies (You et al. 2017; Liu et al. 2019; Kang et al. 2020; Su et al. 2021). This topic is also relevant in the context of the growing interest in VGI among the smart city and sustainable development communities (Milojevic-Dupont & Creutzig 2021; Nitoslawski et al. 2019).

With the aim of develo** a new method of improving geospatial databases of buildings and amenities, we pursue the following research questions: What is the potential of using real estate data to update or create spatial databases of buildings and associated amenities? How can we develop an automated mechanism to collect, maintain, and ensure the quality of building and amenity information in spatial databases? For maintaining a database and assuring its quality, there are two lines of work that we consider. The first one is focused on develo** a new data quality assessment method that is investigating whether we can leverage real estate data to examine the completeness of the building footprints and locations of amenities. The second one zeroes in on collecting new data on buildings: the use of real estate data to add unfilled attributes and new buildings or amenities into the existing database.

2 Background

2.1 Information on buildings and their update

Acquisition of building information, the primary focus of this paper, has been thoroughly investigated in the last decade. A variety of data sources and approaches, from satellite/aerial imagery to point clouds to street view imagery, have been used to extract information on buildings, in the form of points and footprints to semantically rich 3D building models, for diverse applications and at various scales (Bshouty et al. 2019; ** of building height using Sentinel-1 and Sentinel-2 time series. Remote Sensing of Environment, 252, 112128. https://doi.org/10.1016/j.rse.2020.112128 " href="/article/10.1007/s44212-022-00012-2#ref-CR24" id="ref-link-section-d136695414e618">2021; Ledoux et al. 2021). Some parts of the world have benefited from these developments and an increasing number of jurisdictions is rich in data on buildings, which are often released openly. Nevertheless, in many other parts of the world, map** buildings remains a manual task, e.g. by OpenStreetMap contributors, due to lack of input data such as high-resolution imagery, often leaving areas unmapped or partially mapped.

While the acquisition of building information has been a rapidly develo** topic, and while there has been an increasing body of research focusing on develo** mechanisms to update spatial databases automatically (Zhang et al. 2018; Guo et al. 2016; Cheng et al. 2008; Tian et al. 2012), update of building information also remains a challenge. Most of them rely on manual updates from cadastral data and change detection (Shi et al. 2020). In this paper, we also seek to understand whether we can take advantage of the dynamic nature of real estate data to fetch up-to-date information on buildings for the purpose of their update.

2.2 OpenStreetMap and quality control

OpenStreetMap gained considerable attention in the recent years, and it is now being used across academia, government, and companies as a reliable source of spatial data (Yan et al. 2020). In fact, in some cases, it is the only freely available source of spatial data (So & Duarte 2020). While OSM started with a focus on roads, now the community is increasingly spotlighting buildings, and in some locations they are fully mapped, together with attributes (Brovelli & Zamboni 2018; Biljecki 2020). As such, OSM ascended to support a variety of spatial analyses that require building data in academia and beyond (Westrope et al. 2014; Cerri et al. 2020; Schilling and Tränckner 2020; Braun et al. 2021; Ma et al. 2022; Zhang et al. 2022; Komadina & Mihajlovic 2022). Nevertheless, their completeness, including in developed countries, remains heterogeneous. Therefore, two lines of effort have emerged – ameliorating the data and assessing their quality. Both often require a reference dataset, another instance of assumedly sufficient reliability that can be freely used to either ingest it in OSM or use it to cross-check the content of OSM (Zielstra et al. 2013; Zheng & Zheng 2014; Brovelli et al. 2016; Juhász & Hochmair 2018; Zhou 2018; Witt et al. 2021; Majic et al. 2021). The same concepts apply for other instances beyond OSM.

In this study, we investigate the potential of property transactions and commercial real estate advertisement data to serve as reference data for both purposes. This aspect may be particularly interesting as another contribution in the topic of VGI quality, as using one form of VGI to assess the quality of another has not been documented much.

2.3 Applications of real estate data

Property transactions have been used routinely for various types of real estate analyses (Fesselmeyer & Seah 2018; Lee & Ooi 2018). Information obtained by scra** real estate websites, such as property listings and ads for short-term accommodation, have been used primarily for studies such as understanding patterns driving prices and analysing socio-economic distributions (Boeing & Waddell 2016; Li & Biljecki 2019; Boeing 2020; Delmelle & Nilsson 2021; Kang et al. 2020; Liang et al. 2021; Zhang et al. 2021b; Nowak & Smith 2016; Su et al. 2021). For example, You et al. (2017) estimate prices of houses from a large amount of house photos posted by real estate brokers in property websites.

Outside of the real estate focus, there are few papers making use of such data. To the extent of our knowledge, the paper most related to ours is the study of Hopf (2018) who extracted information from textual data in 8341 real estate advertisements in Switzerland to support predictive energy data analytics. Building attributes such as dwelling type, amount of rooms, dwelling level from the real estate advertisements are used for household classification. These attributes are attached with geographic coordinates of the buildings, and all real estate advertisements within a radius of 1000 m of each household address are considered. This work demonstrates how textual data in real estate ads can be exploited to extract some information on buildings. We take inspiration from the work, and considerably expand it by positioning the work in the geospatial domain and volunteered geographic information, considering property transactions for the first time (besides property listings only), increase the scope of the extracted information, investigate and elaborate a broad range of use cases (e.g. data quality control), and conduct a city-wide analysis to comprehensively investigate the potential of the approach and several aspects neglected hitherto such as the influence of the urban form.

Liu et al. (2019) analyse over 200,000 images in rental ads to study the geographical differences of interior decorations in ten major cities in United States, taking advantage of the rare opportunity that photos in ads give us an insight in homes at a large scale. In a related work using the same type of data, Rahimi et al. (2016) investigate the decor of home spaces to study the presence of geographic culture and globalisation trends. Finally, Chu et al. (2016) take advantage of floor plans, which are often included in real estate ads, to generate indoor 3D models. However, the work is not standalone as it also requires other data.

These studies demonstrate that real estate data, which is often widely and easily available, contains a rich set of information pertaining to buildings in the shape of text and photos covering a large geographical area, and therefore they may hold much potential in geospatial research. However, to the extent of our knowledge, there has been no research on using real estate data to systematically and comprehensively extract information on buildings for purposes such as geodatabase update.

3 Methodology

3.1 Study Area

Singapore is a densely populated city-state in Southeast Asia with a vibrant real estate market. Buildings in Singapore are classified into three main categories: residential, commercial, and industrial. For residential buildings, there are three subcategories: Housing & Development Board (HDB) (public housing), private housing, and hybrid housing. Buildings managed by HDB accommodate more than 80% of residents, and their quality in OSM has been deemed as very high with near full completeness (Chen 2020; Biljecki 2020).

3.2 Data

3.2.1 Data collection, cleaning, and processing

For real estate data, we use two instances: property sales transactions and real estate ads (listings). In the former, we have two datasets, which are similar but come from two different sources: transaction data of public housing are downloaded from the Singapore Government’s open data portal, and transaction data of private/hybrid properties (private housing, commercial buildings, and industrial buildings) are obtained from REALIS (Real Estate Information System), which is a real estate database maintained by the Urban Redevelopment Authority (URA). Transaction data of public housing properties are available for a 20-year period, and transaction data from REALIS are of the last five years. They contain the address, price, type of flat, floor area, and storey of each transacted property (Fig. 1), and the dataset is similar to what many other governments elsewhere provide.

On the other hand, listings data, including rent and sale properties, are scraped from PropertyGuru, a popular property website in Singapore (see Figs. 1 and 2), also similar to those that are available in many other countries around the world.

Fig. 2
figure 2

Example of a listing for a property for rent. Besides the location and building characteristics, the ads contain information about associated amenities in both textual and photo forms. Source: PropertyGuru

The total number of collected transactions and listings will be described later. In our analysis, we will also take a subset of only data available in the last month, to understand the continuous potential of this work, e.g. collecting data on a monthly basis and understanding how many buildings can be captured with only one month worth of transactions and ads. There is a certain overlap between these sources in term of information they provide, and in the workflow we describe them together, but in the results, we consider these sources separately, to provide an interpretation of the results per each source, as not all of these might be available in other locations at the same time.

Regarding transaction data, building locations (addresses) and attributes of sold properties can be extracted directly from the datasets, thanks to the clean and structured dataset. For real estate advertisements, besides the locations of the building (available directly as coordinates), the descriptions and photos of the listings contain much information about building characteristics, but require a degree of processing. Further, unlike transactions, ads provide information about the attached amenities. In our work, we focus on sports facilities (primarily gyms and swimming pools), but the method is generalisable to further types of amenities such as playgrounds and parking lots.

To ensure the reliability of the mechanism, the reference data require automatic cleaning and processing to make it usable for harmonisation with a spatial database. There are three main steps of data cleaning and processing in this study: localisation, removal of duplicates, and extraction of information from text and photos.

First, the transaction data collected from authoritative resources include the address instead of geographic coordinates. Geocoding is applied to obtain the geographic coordinates of each building by using the Google Maps Platform API. For listing data, geographic coordinates are already available.

Second, we have found multiple records pertaining to the same property in all sources of data. In the transactions, this is because the same unit was sold multiple times in the timeframe of the dataset. Regarding the listings, the reason is that one or multiple real estate agents may advertise the same property concurrently. Such duplicates were detected, and only the latest record was preserved.

Third, the extraction of relevant information from the transactions and the textual portion of listings was trivial using existing approaches and requiring little processing (thus, much of the paper will be devoted to the results and discussion rather than the method). The photos in the listings were used to extract the presence of amenities in them. These are detected using computer vision (object detection) using the Google Cloud Vision API, giving reliable results without the need to develop an own prediction model, making the method accessible to researchers who have little expertise with such techniques.

3.2.2 Summary of the datasets at stages of data processing

After the steps of localisation and removal of duplicates, we have identified that rent and sale listings cover 4371 and 8865 buildings with at least one listing, respectively (Table 1). This amount represents about one tenth of the building stock of Singapore, and about half of residential buildings. Transactions covered more than 90% of public housing buildings (9316), and 14,192 private/hybrid residential buildings (as well a high share) (Table 1). 1800 commercial and industrial buildings have been identified as well (Table 1). The information extraction will be based on these cleaned datasets.

Table 1 Summaries of cleaned datasets

3.3 Connecting the extracted data with a spatial database (OSM)

The first step towards making use of relevant information from real estate data is to associate the extracted items with the ones in OpenStreetMap (or any other spatial database). The idea is to first identify the locations of the extracted items in the OSM, and then compare them with the existing data in OSM. For buildings, the listings are converted to data points, which indicates the locations; while for amenities, we define the nearest buildings in OSM as the approximate locations. Next, buffering and spatial intersection are applied to compare the two databases. We defined the intersected points as ‘covered’ points and unintersected ones as ‘uncovered’ points, indicating whether the extracted items can be matched. The methodology of this process for both buildings and related amenities is illustrated in Figs. 3 and 4. Following this method, the functions of the developed mechanism are outlined in Table 2. Buildings are described first in this section.

Fig. 3
figure 3

Workflow (left) and spatial operations (right) of matching extracted data of buildings with a spatial database (OSM). The yellow squares in the left represent the data used in the method and implementation

Fig. 4
figure 4

Workflow (left) and spatial operations (right) of matching extracted data of amenities with a spatial database (OSM). The yellow squares in the left represent the data used in the method and implementation

Table 2 Functions of the developed approach

The features that cannot be matched may indicate those that are missing from the spatial database, and thus, the process doubles as a method to detect unmapped instances, because they are overlooked (omission), have been demolished in the meantime, or because they have not yet been constructed. While from the real estate data we are not able to extract building footprint polygons (the most common geometric form of building information), the locations of buildings (as points) are sufficient for this purpose. Further, they are also sufficient for those databases in which buildings are mapped as points.

For both covered and uncovered building points, the same set of building information can be extracted. Table 3 highlights the building attributes that can be extracted from the three reference datasets based on our exploration. While our work is focusing on a particular study area, we assert that it is generalisable because similar services and records in other countries contain comparable information.

Table 3 Building attributes extracted from real estate reference data

There are two kinds of methods of attributes extraction for different building information. First, attributes including building address, lease commence date (or year of construction and completion year), tenure, and building type can be extracted directly, and for which only the latest record/listing in the datasets is necessary. Second, the remaining two building attributes that can be extracted from the real estate datasets are approximate building levels and statistics of floor area. They are extracted indirectly and require multiple records/listings in a building, as the accuracy of extraction converges towards the true value. Figure 5 illustrates the approach of estimating the number of floors of a building. Among all of the records/listings of transacted units in a building, the one with the highest level corresponds to the approximate building levels. As there will be more records/listings created continuously, the approximate building levels will be updated if a unit of higher level appears over time, ultimately capturing a property at the highest floor, or otherwise one that is close to the highest floor, at least resulting in an approximate building form (i.e. distinguishing between mid-rise and high-rise buildings). The method of extracting statistics of floor area applies similar idea to the previous method. Mean, maximum and minimum floor area of units in the building can be established from all of the records/listings of units in the building.

Fig. 5
figure 5

Indirect method of estimating building levels. This example is based on the transaction data of a high-rise residential building (Blk 702 West Coast Rd, Singapore). It shows the maximum unit levels that appear in the transaction data in each month in a five-year period and their convergence to the true value over time. The approximate level of this building is the highest unit level that appears in the transaction records. In this example, as it is the case in many other instances, the method identified the building level correctly, and about one year of transactions is usually enough to obtain accurate information on the vertical extent of the building, which provides indispensable value to some use cases

Moving on to amenities, real estate data can be used to check the correctness and completeness of them in OSM. However, the unmapped amenities cannot be added to the database, as only their approximate location can be determined. When viewed as part of buildings, the presence of amenities, on the other hand, could be added as an attribute of buildings.

To identify amenities from real estate data, both text and image data from listings are used (Fig. 2). The texts of the listings include labels of amenities that the buildings have in their surroundings. Besides extracting amenity information from texts, nearly all ads contain also photos of amenities associated with the buildings. After the classification, each image will have a list of description labels. The image labels are then attached to the corresponding listings which contain the geographic coordinates of the buildings.

Next, to compare these amenity labels with the existing data of OSM, both building dataset and amenity dataset of OSM are used. The rationale of the approach is that if an amenity appears in the photo of an ad of an apartment of a building, they must be very close to the building, if not part of it. Thus, while we will not be able to map their exact location, we are able to use it for quality assessment purposes, i.e. verify whether there is such an amenity in the immediate vicinity of the mapped building. The process is straightforward – it consists of buffering the corresponding building and locating whether the same amenity is inside the buffer. If not, the amenity can be flagged as unmapped.

3.4 Validation

To evaluate the performance of the described mechanism and the feasibility of the idea of using real estate data for such purpose, samples of the results of building and amenity databases updating are manually checked by comparing them with the ground truth (mostly satellite and street view imagery). For building database updating, two perspectives are checked: (i) if the building points from real estate data uncovered by OSM building footprints can identify the locations of unmapped buildings; and (ii) if the building points from real estate data covered by OSM building footprints can detect the corresponding buildings in OSM? For each category in three reference datasets, more than 50 samples are selected (e.g. 50 for uncovered extracted points from sale ads are checked). In total, 200, 200, and 300 samples are selected from three datasets respectively (Tables 11, 12 and 13). For amenities, the accuracy of the approximate locations of unmapped swimming pools and fitness centres are checked. The sample size is 50 for each kind of amenity. The samples are randomly selected from all over the city.

4 Results

4.1 Extraction of building information

4.1.1 Matching between real estate data and OSM data

Table 4 outlines the result of geometry matching between the three real estate datasets and OSM building data. The total number of points in the table indicates the number of ads, and afterwards the number of buildings detected in the processed datasets. The table also includes the number of matched and unmatched buildings, suggesting potentially unmapped buildings or those that are yet to be constructed, and thus, are missing in OSM. A majority of buildings could be matched to counterparts in OSM, likely due to the high completeness of OSM in the study area. Nevertheless, these results suggest the performance of the method also in areas that are not mapped well in the considered spatial database such as OSM, as we determined that about one tenth of the building stock could be inferred from real estate data, a value that is useful for any scenario discussed in the paper.

Table 4 The results of matching buildings extracted from real estate data with those in OpenStreetMap

4.1.2 Building information extracted from real estate advertisements

For both rent and sale listings, four building attributes can be extracted, and a sample of the extraction is given in Table 5 as an example. The method proves to be useful to borrow such attributes and enrich existing building records.

Table 5 Samples of building information extracted from rent listings, which have been matched to counterparts in OSM, and can be used to update the existing record with previously unavailable attributes

To put the results in perspective, and pronounce the idea of periodically or continuously scra** real estate websites to mine data on buildings that can be used to enrich spatial databases, we focus only on the last one month of scraped data to understand how much information can be extracted from solely one month of listings. The results in Table 6 suggest that listings posted in a span of one month can update the attributes of 8115 buildings (around 7% of the city’s building stock) and identify 3806 uncovered buildings (around 3.3% of the city’s building stock). However, it should be noted that listings do not always contain all these attributes, which we indicate in the same table. The unmatched buildings will be discussed in later sections.

Table 6 Summary of extracted information from rent and sale listings in a period of one month

4.1.3 Building information extracted from public housing transaction data

There are nine building attributes that can be extracted from HDB transactions data (Table 7). The data have a high degree of matching with OSM, and the attributes extracted from it do not suffer from any missing value, unsurprisingly as the source of the transactions is from the government. Table 7 indicates that HDB transactions that have occurred in a single month can update the attributes of 1535 buildings, and identified 101 unmapped buildings.

Table 7 Summary of extracted information from HDB transactions in a period of one month. For all records, all attributes are extracted

4.1.4 Building information extracted from transaction data from private/hybrid properties

For hybrid/private housing transaction data, ten attributes can be extracted (Table 8). Building name, completion year, lease commence date, and approximate level are the four attributes with most missing values, but nevertheless, the majority of transactions contains such information. Interpreting the results, private/hybrid housing transactions generated in one month can update the attributes of 721 buildings (around 0.6% of the building stock), and it identified 450 unmapped buildings. The results for commercial and industrial buildings are congruent, and are not included here for space considerations.

Table 8 Summary of extracted information from private/hybrid housing transaction data in a period of one month

4.2 Extraction of amenity information

Fitness centres and swimming pools are extracted from both text descriptions and photos in the listings: 3051 unique fitness centres and 2176 unique swimming pools are extracted from texts, and 99 unique fitness centres and 1452 unique swimming pools are extracted from photos. After removing the duplicates between amenities from texts and photos, 3096 fitness centres and 2947 swimming pools remain.

After extracting the amenities from listing data, we converted these listings with amenities into spatial points. To compare them with the amenity data of OSM, the nearest building in OSM to each of the spatial point is identified (Fig. 4). This is because the location of each spatial point indicates the location of the building nearby the amenity rather than of the amenity itself. Finding the nearest buildings in OSM is detecting the locations of corresponding buildings with amenities extracted from ads. We assumed that the locations of the nearest buildings are the approximate locations of extracted amenities.

In some cases, the amenities are fairly far (over 100 metres) from their nearest buildings, which is uncommon. This might be because the building locations in the listings are inaccurate or the buildings in the listings have not been mapped in OSM yet. To avoid the influence of these cases and identify the corresponding buildings, amenities with distances longer than a certain threshold to their nearest buildings are removed. To decide on the threshold, results with various distances have been checked. When the distances are shorter than 25 metres, in most of the cases the nearest building is the corresponding building which pertains to the amenity, while when distances are as large as around 30 and 40 metres, usually the nearest buildings are not the ones where the amenities are attached. Besides, according to the results, 85% of fitness centres and 88% of swimming pools have a distance shorter than 25 metres to their nearest buildings. Hence, to ensure the accuracy and completeness of the results, amenities with distances longer than 25 metres from the building are removed.

After processing, there are 2623 fitness centres and 2600 swimming pools extracted from listing data that remain. Besides, there are 2512 and 2272 unique buildings corresponding to the fitness centres and swimming pools respectively. These numbers are less than the numbers of amenities, which means some amenities extracted by listings are close to each other and share the same nearest buildings.

After detecting the approximate locations of extracted amenities in OSM, buffering and spatial intersection are applied to identify if the extracted amenities are already mapped in OSM (Fig. 4). To achieve a higher accuracy of the comparison, the buffer distance should be long enough to cover the attached amenities of the buildings but should not be too long to cover other amenities. Hence, quantiles of distances between amenities and their nearest neighbours and amenities and their nearest buildings in OSM are calculated. In this study, 20 metres are selected as the buffer distance. Because for both fitness centre and swimming pool, distances of over 75% amenities to their nearest amenities is larger than 20 metres and distances of less than 25% amenities to their nearest buildings is longer than 20 metres.

After identifying the buffer distances, intersections are applied between the buffers of the buildings with amenities and the existing amenities of OSM. Table 9 shows that 99% of the fitness centres in the approximate locations are not mapped in OSM yet, while 19% of the swimming pools are already mapped in OSM.

Table 9 Comparison between extracted amenities and existing amenities data of OSM

While this method cannot detect the exact coordinates and shapes, the results suggest that it can signal omission issues in the database with a high degree of reliability and approximate expected location of the omitted feature (Fig. 6).

Fig. 6
figure 6

Approximate locations of unmapped fitness centres detected thanks to our method. Basemap: OpenStreetMap contributors

4.3 Validation of the results

4.3.1 Potential of updating and validating a spatial database of buildings

Table 10 illustrates a few samples of the validation. The table also exposes some issues in real estate data, which affect the performance of the method. Tables 11, 12 and 13 outlines the performance of the method for the validation set.

Table 10 Examples of matching the extracted real estate data for updating and/or validating a building database. The OSM data we have used is denoted in purple, while the basemap is also from OSM, but from a few months later with some unmapped buildings added in the meantime

In general, the method seems to be successful for using real estate data to enrich existing building data with previously unavailable attributes, but it is less so in detecting unmapped buildings. The validation set suggests that for building points from real estate data uncovered by building polygons of OSM, most of them are unmatched because of the inaccurate building locations in real estate data (case exhibited in Table 10(a)). For uncovered points of residential building, no more than 15% of them can detect unmapped buildings in OSM (Table 10(b)), but this low figure does not necessarily suggest the limited performance of the method, as it may rather reflect that there are simply not many buildings that are left unmapped in the study area. For commercial and industrial buildings, the percentages are slightly higher (18% and 34%). This difference might be because that there are more unmapped commercial and industrial buildings than residential buildings in OSM in the study area. It is worth noting that, while the method in some instances is able to detect buildings that are missing from the targetted spatial database, some buildings from older transactions can represent demolished building rather than those that are unmapped or yet to be constructed (Table 10(c)), which may be both an advantage or disadvantage, depending on the perspective: on the one hand, it may be possible to reconstruct historical data, while on the other hand, this may be undesirable if only the current or future situations are sought. Whether a building is unmapped because it was demolished or because it is unbuilt could be distinguished by checking the year of completion of the building, which is usually available in both the ads and transactions.

These results should also be placed in the context of the very high building completeness of OSM data in Singapore. Applying the method in areas of partial and heterogeneous completeness may result in detecting many more unmapped buildings. Thus, the high performance of matching buildings in the real estate data with their counterparts in OSM (or any other spatial database), can be interpreted also as the method having high potential of detecting missing buildings.

For buildings points covered by building polygons in OSM, the accuracy rate of detecting the corresponding buildings in OSM (Table 10(d)) is fairly high. Authoritative real estate data including transactions can detect over 94% of OSM buildings correctly (Tables 12 and 13), and for these buildings a bridge can be established to transfer the semantic information from one to the other dataset. The building locations in real estate advertisements are less reliable but can still achieve an accuracy rate of 86% and 88% for rent and sale listings, respectively (Table 11), suggesting a high potential as a rapid acquisition method for areas lacking completeness.

Table 11 Potential of updating and validating a spatial database of buildings: using real estate advertisements (listings)

Among the real estate records that can be correctly associated to buildings in OSM, there are some points covered by multiple building polygons. This is because these building footprints are clustered together, and the buffers of them overlap with each other. Tables 12 and 13 indicate that public housing and industrial buildings have much less overlap than private/hybrid housings and commercial buildings, caused by the urban form of our study area — some properties (e.g. terraced, semi-detached houses) and shophouses are close to each other.

Table 12 Potential of updating and validating a spatial database of buildings: using public housing transactions
Table 13 Potential of updating and validating a spatial database of buildings: using private property transactions

4.3.2 Potential of validating a spatial database of amenities

The performance of the method for collecting data on amenities is outlined in Table 14. The method proves to be highly effective: among the samples of unmapped swimming pools and fitness centres, 90% and 76% of them are detected correctly. The inaccuracies of the mechanism include two causes. One is that the approximate locations of the amenities are correct, but they are already mapped. The other one is that the mechanism does not detect the approximate location accurately, and there is no amenity. Most of the inaccuracies pertain to the second situation, which are caused by the errors of the locations in real estate data.

Table 14 Result validation of amenity database updating

In conclusion, the strongest point of the method is the enrichment of spatial databases with descriptive information of buildings, highly valuable for a variety of use cases, and the detection of unmapped amenities associated with buildings. In theory, the method holds value for detecting unmapped buildings, but its performance in practice is burdened by imperfect real estate data and it depends on the level of completeness of existing data.

5 Discussion

5.1 Observations, limitations, and opportunities for further investigations

The results uncover the high potential of real estate data to serve spatial data infrastructures and affirm the role of such data as a latent form of implicit VGI. In a relative simple way, real estate data offers filling gaps in the data that otherwise are not easily obtainable, and alleaviates the challenging problem of acquiring spatial data of fine and dynamic urban features. Considering that such approach has been engaged for the first time, there are limitations and research opportunities.

In the development of the method, we have regarded its applicability in other study areas. Real estate markets are highly digital in many parts of the world, and datasets such as listings and transactions are quite similar in content and structure, so for its large part, the method can be applied elsewhere with little modifications. Nevertheless, a limitation of the work is that a specific study area was investigated, which may have its own particularities and the results will inevitably differ elsewhere. In the next section, we will discuss applications and scalability of the method in other study areas. Another limitation of the work is related to transactions — they suffer from survivorship bias, as transaction data contains only the apartments that have been sold successfully, not all that are on the market.

Next, in our work, we use a mix of authoritative data (property transactions) and user-generated data (real estate ads). The first one is used routinely in assessing the quality of VGI. The second one, however, is not, and essentially, in this paper, we are using one form of VGI to check another one. Further, this instance of VGI can also be used to check authoritative datasets, another uncommon exercise, but a potentially viable novelty and contribution (e.g. using listings to detect whether the function of a building remains the same as recorded in the cadastral registry). Such approaches would be interesting to consider for further investigations.

Regarding amenities, a small set of amenities was considered (gyms and swimming pools) that are specific to the study area, and with limited coverage (only those that are in the immediate vicinity of the property that is sold could have been captured), and we could only use them for quality assessment rather than map** them (since their exact location is not known). In this work, the listings are predominantly advertising residential real estate, so we focused on amenities that pertain to such properties. However, these amenities are not common in many other countries, and there are many more amenities that are relevant in GIS and urban studies that could be investigated in the future, e.g. parking lots. Further, some amenities may be more characteristic to commercial properties. Perhaps an ad on selling commercial space may include a photo of a restaurant in the same building as an added value of the property. Next, we noticed that some ads feature amenities in the wider catchment area of the building, e.g. shop** malls and public transportation in the neighbourhood. While detecting these is not a challenge, because of the much larger radius that is entailed, it may be meaningless to include them in this work.

In terms of amenities, another limitation is that many of these have restricted access, thus, their benefit of being mapped remains dependent on a downstream application. While a substantial portion of Singapore’s housing landscape is public, with buildings and their surroundings accessible without restrictions, it is mostly the private properties (e.g. Fig. 1) that include amenities that we have detected in this work. The difference and importance between a gym in a condominium and a private gym available to the public at a fee depends on the map** and application context. In our case, we noticed that both tend to be mapped in OSM, but it can be argued that the relevance of the former may not be at the same level as the one of the latter chiefly due to public access. At the same time, it might be very useful to map features such as automated external defibrillator units in private areas despite their restricted access. Therefore, this is not necessarily a limitation, but it may certainly result in an imbalance in map**, which on the other hand, also reflects the state of OSM. However, there are other types of amenities that may be possible to detect, such as childcare and parks, which are public but integrated in residential estates, and thus, such limitation will not apply. Nevertheless, amenities are the secondary purpose of this work, and it does not affect the main goal — extracting data on buildings.

As another direction for future work, we suggest develo** a measure of ‘confidence’ of the results that would encapsulate the accuracy of the derived data. For example, the reliability of the method on deriving the number of building levels (Fig. 5) much depends on the number of transactions and time, aspects that may be taken into account by such a metric.

5.2 Applications elsewhere

5.2.1 Influence of the urban form

The mechanism in this study is built based on the buildings and amenities in Singapore, which is a city that has experienced intense urban development and has high building density (Fig. 7). As much as possible, we investigated this idea in general with scalability in mind, but inevitably, the results are tied to the study area and the performance of the method elsewhere remains to be investigated.

Fig. 7
figure 7

Variable urban form in Singapore. The performance of this method is driven by urban morphology. The photos are courtesy of Unsplash contributors

Many landed properties (private housing, terraced houses) and shophouses (historic commercial buildings) in Singapore are standing shoulder to shoulder, and minor building location inaccuracies in real estate data can cause errors in identifying corresponding buildings in OSM, and overlap** buffers may cause mismatching the derived information. For high-rise buildings, which are usually well separated from each other, the mechanism is more effective in both detecting unmapped buildings and finding corresponding existing buildings in OSM (Tables 11, 12 and 13). Hence, the approach might be more reliable in regions with lower building density and those that have detached real estate, but at the same time the high-rise urban form ensures that there are many data points for a single building, maximising the data acquisition. A convenient particularity of our study area is that for most buildings, at any moment, there is at least one apartment being advertised for rent or sale, which is sufficient for our method to work and collect data on the entire building.

Another influence of the urban form on the mechanism is about the estimation of approximate building levels, which are calculated as the highest levels of the units in one building appearing in all of the records/listings according to the mechanism. Urban areas mainly consisting of low-rise houses rather than high rises might not be very suitable to use this method. Because there might not be many records/listings of the buildings to estimate their total levels. At the same time, a small number of listings in a building is precisely something that could indicate its size as well, i.e. it can also serve as a proxy for its height.

5.2.2 Reference data

This study used multiple sets of real estate data that is adopted as reference data — advertisements and property sale transactions from both commercial and authoritative sources. In many countries, there are property websites providing abundant rent or sale listings containing the same building information, for example, Redfin in the US, Rightmove in the UK, Funda in the Netherlands, and Beike in China, ensuring replicability elsewhere. However, the authoritative transaction data might not be available in as many countries. In our analysis, we have considered each source of data in isolation to give an understanding of the results when only one of them is available.

5.3 Legal matters

Web scra** is an important step to collect the listings and transform them into data useful for our approach. While much of the method relies on data such as transactions, web scraped listings add much value because they contain additional data such as on amenities and may indicate buildings that will be constructed in the future, something rarely available in conventional spatial data sources. However, the legality of web scra** is still in a ‘grey area’ (Krotov et al. 2020), despite its abundant use in academia and elsewhere. It might constitute copyright infringement or a breach of contract of the website’s terms of use, but it is also a question whether an online marketplace actually owns copyright on ads posted by its usersFootnote 1. In the context of this study, the Copyright Act 2021 of Singapore permits copying copyrighted work specifically for the purpose of computational data analysisFootnote 2, and it appears that the online marketplace does not prohibit web scra** for non-commercial use, which may be the case in many other countries.

Another legal concern is that the data that is added to certain spatial resources such as OSM should not be from any copyrighted data sources. Hence, only real estate data sources with licenses compatible with the targetted database (i.e. OSM) can be used for the mechanism. It remains to be investigated whether these issues would affect use cases for research purposes with data that is not distributed. Nevertheless, in the context of implicit VGI, these aspects of real estate data are not necessarily much different from other VGI instances such as Flickr, which are well established in the community.

6 Conclusion

Real estate data sources provide a wealth of information that have been underused outside their primary purposes in the real estate domain, and may be used to augment their value by extending their application in other domains, transcending their value beyond their primary purposes. In this paper, we bring attention to the spatial aspect of various forms of real estate data, and we put forward a new idea — exploiting real estate data as an unexplored and underused form of crowdsourced data and volunteered geographic information, which among other applications, can help to provide a rapid and efficient method to keep spatial databases updated and correct. We provided a proof of concept — we have demonstrated that data obtained from real estate ads and property transactions may be a new source for collecting building data for the geospatial domain. The method is simple and powerful, and we demonstrate its feasibility by performing an experiment in Singapore. The results suggest that we are able to retrieve several sets of building information for a large number of buildings in the country, many of which have not been available in OSM in the study area (and are rarely available elsewhere), and are key ingredients in a breadth of spatial analyses that require semantic data on buildings, such as in planning, and in energy and microclimate simulations. By considering real estate data as reference (ground truth) data, the method doubles as an instrument to carry out data quality assessment studies, and our paper essentially contributes to the field by introducing a new method for spatial data quality assurance. While this method will unlikely be able to provide information on all buildings in a study area (some buildings will take time to be advertised or transacted, if ever), its heterogeneity is in line with other forms of VGI such as OpenStreetMap and Mapillary (Juhász & Hochmair 2016; Quinn & León 2019), and the amount of buildings that is enriched with this method may be considered as a significant advancement given that this data source was overlooked hitherto. In fact, our implementation suggests that just one month of listings uncovers information of a large portion of a city’s building stock, which is ahead of some other acquisition techniques. New buildings, which are yet to be constructed, can also be detected, but much of the method is burdened by errors in the listings. Future buildings can be identified also from property transactions by extracting those with completion year in the future, and the same goes for historical data of demolished ones, possibly allowing analyses of the evolution of the urban form.

In the work, we have taken advantage of amenities that are being advertised as part of real estate. The mechanism is reliable in identifying the approximate locations of amenities missing in databases, providing both a mechanism to assess completeness and one that signals omissions in the database to mappers and other parties managing a spatial source.

As this work presents a new channel to collect spatial data, and it argues that real estate may be a form of user-generated geospatial data that was previously not considered in the field, it provides prospects for multiple directions for future work. For example, it would be beneficial to investigate whether we could use photographs provided in listings for further applications, such as 3D model reconstruction and to infer footprints. While points extracted from real estate data have been sufficient to serve the purpose of this research, reconstructing footprints may bring the research forward. Next, regarding implementation, it would be useful to provide a system that would continuously scrape ads and download property transactions for always-on sensing of building information and translating it to a building database.