Introduction

More than 55% of the world’s population now live in cities (Vilar-Compte et al., 2021). Residential location choices (RLCs) have significant implications for the sustainable development of cities. From a city perspective, existing research shows that people’s choices of residence can have a significant impact on the local economy (Li et al., 2013), spatial structure(De Vos et al., 2018; Næss et al., 2019), the environment(Engebretsen et al., 2018; Huu Phe and Wakely, 2000), the urban transport system (Taniguchi et al., 2014) and epidemic prevention and control(Liu and Tang, 2021). From an individual perspective, residential satisfaction can contribute significantly to overall life satisfaction(Campbell et al., 1976). Residential location modelling is therefore considered to be the core of one of the grand challenges of contemporary social science(Pagliara et al., 2010).

Numerous studies have been conducted to develop universal models of RLC in light of the importance of residential location. In general, existing RLC models can be categorised into two types. One type of research takes RLC models into account as an integral part of urban complex models(Albeverio et al., 2007; Baynes, 2009; Tonne et al., 2021), including MUSSA II, RELU-TRAN (Anas and Liu, 2007) and UrbanSim (Waddell, 2002). Models such as these are based on the interaction between the land market, labour market, the distribution of industry, and transportation to analyse RLCs (Ahlfeldt et al., 2015). Enabled by the use of massive data from multi-dimensions, they focus more on the interdependencies between sub-modules, rather than on the nature of location choices. The second type of RLC model explores factors that affect RLCs. Using the Multinomial Logit Model, they have examined the impact of individual characteristics and location characteristics on RLCs, such as age, gender, the number of family members and accessibility to infrastructure (Baum-Snow, 2007; Buzar et al., 2007; Campbell et al., 1976; Chen et al., 2016; Delgado and Bonnel, 2016; Garcia-López, 2012; Lee et al., 2010; Levinson, 2008; Melia et al., 2018; Portnov et al., 2011). Nevertheless, households and spaces can be characterised by a variety of dimensions, leading to unmanageable arrays or model specifications that are difficult to assemble for effective calibration (Pagliara et al., 2010).

Travel behaviours significantly impact RLCs (De Vos and Singleton, 2020), with studies indicating that people prefer to live in neighbourhoods that facilitate satisfying trips (De Vos and Witlox, 2016; Ettema and Nieuwenhuis, 2017). Low levels of travel satisfaction may encourage individuals to move to a different type of neighbourhood that allows for more frequent use of preferred modes of transportation (De Vos and Witlox, 2017). This illustrates that RLCs are shaped not just by amenities but also by personal preferences, for example, a car enthusiast may prefer to live in a suburban neighbourhood, or someone who enjoys walking or cycling may opt for an urban area (De Vos and Singleton, 2020). Therefore, the RLC model that focuses solely on the amenities fails to account for the influence of individual preferences on RLC. However, up to now, there has been no RLC model that takes individual travel behaviour into account. We aim to fill this gap by constructing an RLC model from the perspective of travel behaviours. During the modelling process, we rely on the allocation of travel time between home-based travels to build the RLC model, which not only diminishes the need for various types of data but also aids in simplifying the model’s structure.

Big data and related analytics bring new opportunities for understanding RLCs. Human mobility data derived from spatiotemporal mobile phone trajectory data could be helpful to develop the travel-behaviour based RLC model. Mobile phone trajectory data has the advantage of a high sampling rate, large geographic coverage, low collection cost, and accurate information about space and time (Ni et al., 2018). Based on the time budget and a working-resting timeframe, by combining mobile phone data with geocoded location information, we identify residential locations and workplaces through the comparison of stay durations across different times and places (Phithakkitnukoon et al., 2012; Yan et al., 2019; Zhao and Gao, 2023). Other locations where the stay exceeds 30 min are considered non-work sites. We can obtain a comprehensive picture of residents’ home-based travels by analysing travel behaviour originating from or destined for residential locations, encompassing both commuting and non-commuting purposes. In this paper, we analyse trajectory data collected from over 16 million mobile phone users in three consecutive years between 2018 and 2020 in two megacities in China—Bei**g and Shenzhen.

Compared to existing research, this paper has the following novelties: (1) We focus on analysing residents’ revealed preferences rather than their stated preferences in RLCs. Revealed preferences are based on real decisions made in real-life situations. Stated preferences, on the other hand, are derived from what individuals say they would do, often in response to hypothetical scenarios, which may not always translate to actual behaviour due to biases or the hypothetical nature of the situation (Fujii and Gärling, 2003). Additionally, the use of mass mobile phone signalling data also reduces the issue of small samples, which is commonly encountered in stated preference studies (Thorhauge et al., 2016; Wan et al., 2021). (2) This study extends existing RLC models by considering individual residential preferences, which are proxied by home-based travel behaviours. We test the validity of the model in multiple ways, including adding control variables, changing the spatial scale of the observation unit, testing for endogeneity, and considering historical RLC. (3) This RLC model can be used not only to analyse the spatial distribution of residential locations at the group level, but also to analyse the RLC at the individual level. As an example of the model’s application, we assess dynamic changes in RLC behaviours and make predictions based on previous travel behaviours.

Analytical framework

The RLC model, based on home-based travel behaviour, is developed, and Fig. 1 describes the process of our modelling. From the population’s perspective, we construct the RLC model according to the gravity model (Batty et al., 1974), and from the viewpoint of individuals, we analyse RLCs based on the assumption of utility maximisation. Ultimately, the same RLC model is derived.

Fig. 1
figure 1

Analytical framework.

The gravity model and the RLC model

The population-level RLC model we employed is the constrained gravity model, which is essentially grounded on a balance between benefits and costs (Batty, 1983; Batty et al., 1974), as shown in Eq. (1).

$$T_{ij} = O_jP_{ij} = O_j\frac{{m_if\left( {r_{ij}} \right)}}{{\mathop {\sum}\nolimits_k {m_kf\left( {r_{ik}} \right)} }}$$
(1)

In this model, the attraction or benefit (mi) of residing in any given location is weighed against the deterrence or cost (\(f( {r_{ij}} )\)) to that location from another, with commuting commonly acknowledged as a form of deterrence (Barbosa et al., 2018; Pagliara et al., 2010). Owing to the constraints of financial budgets, the choice of residential location is inevitably affected by housing prices (DeSalvo and Huq, 1996; Zhuge et al., 2016). However, quantifying a region’s attractiveness is quite challenging. Built environment and demographic characteristics are frequently seen as factors that affect a location’s attractiveness (Bhat and Guo, 2007; Ettema and Nieuwenhuis, 2017; Schirmer et al., 2014), while relocation choice, which is also a form of RLC, is mainly influenced by individual preferences, such as the impact of historical factors and individual habits (Clark and Lisowski, 2017). However, to date, no studies have attempted to include individual preferences in the RLC model. In this study, we use HBNC time to represent a location’s attractiveness. Firstly, HBNC time is a reflection of residents’ revealed preferences, which can indicate their real needs. Secondly, HBNC travel is often for the consumption of built environment. The greater the demand for a certain amenity, the greater the weight of travel for these types of amenities in the HBNC time. Therefore, HBNC time includes information about an individual’s preferences. Compared to using amenities as a measure of a location’s attractiveness, using HBNC time as a proxy variable more closely aligns with our understanding because residents may choose a location mainly based on some of its built environment, rather than all of them. When

$$m_i = e^{\alpha {{{\mathrm{log}}}}\left( {hc_i} \right) \,+\, \gamma HBNC\_time_{ij}}$$
(2)
$$f\left( {r_{ij}} \right) = e^{\beta C\_time_{ij}}$$
(3)

where

$$C\_time_{ij} = \frac{{Time_{ij} + Time_{ji}}}{{N_{ij} + N_{ji}}}$$
(4)
$$HBNC\_time_{is} = \frac{{Time_{is} + Time_{si}}}{{N_{is} + N_{si}}}$$
(5)

where i is a residential location, j is a workplace and s is a non-work site, \(C\_time_{ij}\) is the average commuting time for an individual in a month and \(HBNC\_time_{is}\) is his/her average HBNC time in the same month, Timeij(Timeji) is the total travel time from residential location i (workplace j) to workplace j (residential location i), Nij (Nji) is the total number of trips from residential location i (workplace j) to workplace j (residential location i), Timeis (Timesi) is the total travel time from residential location i (a non-work site s) to a non-work site s (residential location i) and Nij (Nji) is the total number of trips from residential location i (a non-work site s) to a non-work site s (residential location i).

Then, we can get the RLC model:

$$T_{ij} = O_jProb_{ij} = O_j\frac{{e^{\alpha {{{\mathrm{log}}}}\left( {hc_i} \right) \,+ \,\gamma HBNC\_time_{ij}\, +\, \beta C\_time_{ij}}}}{{\mathop {\sum}\nolimits_k {e^{\alpha {{{\mathrm{log}}}}\left( {hc_k} \right) \,+ \,\gamma HBNC\_time_{kj} \,+ \,\beta C\_time_{kj}}} }}$$
(6)

where Tij is the number of residents who work in location j and live in location i, Oj is the number of people who work in location j, Probij is the probability of residents choosing to live in location i and work in location j, hci is the housing expenditure of location i, \(HBNC\_time_{ij}\) is home-based non-commuting time, \(C\_time_{ij}\) is commuting time, and α, β and γ are parameters to be estimated. Consistent with the settings of quantitative spatial modelling, we include variables related to time in exponential form, while other variables are included in power-law form (Eaton et al., 2004; Heblich et al., 2020).

The RLC model attempts to use residents’ travel behaviour and housing costs to explain jobs-housing relationship. The model’s dependent variable is the probability of a residential location being chosen, which means the model tries to figure out the distribution pattern of where the workforce lives in relation to their places of work. In comparison to traditional gravity models, our RLC model is not only simpler in form but also encompasses more information regarding individual preferences.

Utility maximisation and the RLC model

The individual-level RLC model is based on the assumption of utility maximisation (Ahlfeldt et al., 2015; Heblich et al., 2020; Schirmer et al., 2014). We assume that the utility function of a risk-neutral resident o who works in location j and resides in location i is defined by the resident’s travel behaviour, housing expenditure and an idiosyncratic shock, as shown in Eq. (7). As commuting travel is a mandatory form of travel, we include commuting time (\(C\_time_{ij}\)) in the utility function as an iceberg cost. HBNC time (\(HBNC\_time_{ij}\)) has two parts, one that relates to consumption (αCij), and the other for travels (lij) that brings utility. Residents will make optimal choices regardless of individual preference differences, a heterogeneity parameter (zijo) which follows an extreme value distribution is thus included.

$$U_{{{{\mathrm{ijo}}}}} = \frac{{z_{{{{\mathrm{ijo}}}}}w_j}}{{e^{\beta C\_time_{ij}}hc_i}}\left( {\frac{{\alpha C_{{{{\mathrm{ij}}}}}}}{\xi }} \right)^\xi \left( {\frac{{l_{ij}}}{{1 - \xi }}} \right)^{1 - \xi }$$
(7)
$$s.t.\,\alpha C_{{{{\mathrm{ij}}}}} + l_{ij} = e^{\mu HBNC\_time_{ij}}$$
(8)
$$F\left( {z_{{{{\mathrm{ijo}}}}}} \right) = e^{ - z_{{{{\mathrm{ijo}}}}}^{ - \alpha }}$$
(9)

where wj is the average wage level in location j. When individuals attempt to maximise Uijo, the equilibrium utility is,

$$u_{{{{\mathrm{ijo}}}}} = \frac{{z_{{{{\mathrm{ijo}}}}}w_je^{\mu HBNC\_time_{ij}}}}{{e^{\beta C\_time_{ij}}hc_i}}$$
(10)

By summing up the individual utilities, we can estimate the probability of choosing a residential location within a city. Hence, the probability that a resident chooses to live in location i and work in location j is

$$\begin{array}{l}Prob_{ij} = \Pr \left[ {{{{\mathrm{u}}}}_{{{{\mathrm{ijo}}}}} \ge \max \left\{ {u_{{{{\mathrm{rso}}}}}} \right\};\forall {{{\mathrm{r}}}},{{{\mathrm{s}}}}} \right]\\ \quad = \,\frac{{\left( {e^{\beta C\_time_{ij}}hc_i} \right)^{ - \alpha }\left( {e^{\mu HBNC\_time_{ij}}w_j} \right)^\alpha }}{{\mathop {\sum}\nolimits_{k = 1} {\left( {e^{\beta C\_time_{kj}}hc_k} \right)^{ - \alpha }\left( {e^{\mu HBNC\_time_{kj}}w_j} \right)^\alpha } }}\\ \propto \,\frac{{e^{\alpha {{{\mathrm{log}}}}\left( {hc_i} \right) \,+\, \gamma HBNC\_time_{ij}\, +\, \beta C\_time_{ij}}}}{{\mathop {\sum}\nolimits_k {e^{\alpha {{{\mathrm{log}}}}\left( {hc_k} \right)\, +\, \gamma HBNC\_time_{kj}\, +\, \beta C\_time_{kj}}} }}\end{array}$$
(11)

The Probij in individual-based RLC model follows the same structure as that in population-based gravity model. Although the form of the population-level model and the individual-level model is the same, the interpretation of the models differs: the former explains patterns in population spatial distribution, while the latter explains patterns in individual residence choices.

The generalisation and contribution of the RLC model

The generality of the RLC model in this study is reflected in the following two aspects: (1) The construction of the RLC model is based on both population-level method and individual-level method, providing a solid theoretical basis for examining the behaviour of both individuals and groups. (2) RLC analysis based on the gravity model has been applied in the West Midlands Conurbation in central England (Batty et al., 1974), while utility maximisation-based modelling analysis has been applied in London (Heblich et al., 2020) and Berlin (Ahlfeldt et al., 2015). These different applications illustrate the flexibility and effectiveness of the model’s base structure.

Our version of the RLC model adds a new dimension: residents’ preferences. The addition of this information adds greater depth to our understanding of how demographic factors impact where people choose to live. While our version of the RLC model introduces new analytical perspectives, variables and functional forms that differ from existing studies, our aim is to extend the application rather than to challenge previous RLC models.

Data and variables

Study area

We have selected two megalopolises in China as the area of study, namely Bei**g and Shanghai. Bei**g, the capital of China, maintained a stable population of 21 to 22 million from 2018 to 2020. It is situated in northern China, an inland city that does not border the sea. Shenzhen is situated in southern China and next to Hong Kong and had a population of 16.66 million in 2018, which has risen to 17.63 million in 2020 (statistics were drawn from China Statistical Yearbook). Both cities are economic hubs of their regions and have the highest GDP in their respective urban agglomerations. Based on statistics in 2018–2020, Bei**g contributed 42% of the GDP in the **g-**-Ji urban agglomeration, encompassing 13 cities; Shenzhen contributed over 30% of the GDP in the Pearl River Delta urban agglomeration, which includes 9 cities.

There are also significant differences between Bei**g and Shenzhen. First, their geographical structures are different. Bei**g is mostly situated on a plain, which allows for easy urban expansion, while Shenzhen’s expansion is restricted by hills and its coastline. According to the layout of residential locations, workplaces and home-based non-workplaces of both cities in Fig. 2, Bei**g has a single centre, while Shenzhen shows a polycentric layout. Second, the two cities have different industrial structures. Bei**g’s workforce is primarily engaged in IT, business services and finance, which require less industrial space. Shenzhen, on the other hand, has a significant manufacturing workforce (Chandra et al., 2023; Chen and Kenney, 2007). Third, administrative influences differ in the two cities. While Bei**g, as the capital, is subject to more top-down government decisions regarding urban planning, Shenzhen, as a special economic zone, has fewer administrative restrictions.

Fig. 2: The distribution of workplaces, residential locations and home-based non-workplaces in Shenzhen and Bei**g.
figure 2

Graphs (a) and (b) depict the distribution of residential locations in Shenzhen and Bei**g, respectively. Graphs (c) and (d) depict the distribution of workplaces in Shenzhen and Bei**g, respectively. Graphs (e) and (f) depict the distribution of Home-based non-workplaces in Shenzhen and Bei**g, respectively.

Datasets and data processing

Mobile signalling data

We test the above RLC model with spatiotemporal travel trajectory data extracted from more than 4 million regular mobile phone users in Shenzhen and more than 12 million regular mobile phone users in Bei**g (see Table 1). The main data is mobile phone signalling data, with trajectories derived from the time the user communicated with a base station and the coordinates of the base station. We selected samples from November 2018, November 2019 and November 2020, specifically choosing those that appeared more than 10 days in a month. To reduce the impact of extreme values, commuting time over 180 min and HBNC time over 300 min were excluded. Due to COVID-19 starting in early January 2020, our pre-pandemic months include November 2018 and November 2019, while the post-pandemic period includes November 2020, allowing us to test the effectiveness of the RLC model following the pandemic.

Table 1 Distribution of people in categories for analysis.

The individual’s coordinate point position was calculated by the Operator using a multi-base station weighting algorithm. According to the Operator’s processing logic, points with a stay of more than 30 min are considered stay points. Moreover, the workplace is the longest stay point during the weekdays from 5 a.m. to 8 p.m., and the residence is the longest stay point from 8 p.m. to 5 a.m. Using these details, along with the start stay point and the end stay point for each trip and their exact time, we calculated the duration of each trip, namely travel time, identified the purpose of each trip and counted the number of each type of trips. Based on the analysis mentioned above, we can obtain the residents’ commuting time and HBNC time (see Table 2). Due to the Operator’s data protection rules, we can only extract the values of the above variables in a squared grid or tiles. Notably, only tiles with more than 5 identified residents were considered. To process the data, the study areas was divided into squared tiles, and we took the monthly average of commuting time and HBNC time of residents with residences falling in the same tile.

Table 2 Statistical descriptions of commuting time (minutes), HBNC time (minutes) and housing prices (yuan).

Housing expenditure

The housing data we used include housing prices and government guideline prices. Housing prices refer to the listed prices of individual housing units, which were obtained from public websites. We have provided statistical descriptions of our housing price data in Table 2. However, there is an issue that in some areas, the number of housing units listed may be limited, leading to an inaccurate representation of the area. To minimise this error, we calculated the average listing price for each neighbourhood (referred to as ‘jiedao’, the smallest administrative unit within a city) and then assigned this average price to each tile based on the jiedao where the centre of the tile is located.

Other data

We also utilised Point of Interest (POI) data, which are all publicly accessible from OpenStreetMap. These data were associated with each tile to generate control variables for the RLC model. This primarily included calculating the distance from the centre of each tile to the nearest subway, bus stations, hospitals, retail markets, parks and schools (Næss, 2006a; Næss et al., 2019; Rivas et al., 2019; Sander, 2006). To validate the robustness of the RLC model using the instrumental variables method, we also used precipitation data.

Empirical implementations

Our empirical analysis consists of two parts: model verification and model application, as shown in Fig. 3. The individual-level RLC model posits that individuals’ idiosyncratic preferences, which adhere to the Extreme Value Theory (EVT), are crucial. Therefore, we employ a fitting analysis method to determine whether residents’ travel behaviour aligns with an extreme distribution. Next, the RLC model is fitted using Generalised Linear Models (GLM) and verified by adding control variables, using instrumental variables and analysing the impact of scale effects (Barbosa et al., 2018). Finally, we utilise the RLC model to examine shifts in residential location preferences due to COVID-19 and to assess whether it can accurately capture dynamic changes in RLC, as well as to make forecasts based on historical travel patterns.

Fig. 3
figure 3

Road map of empirical analysis.

Verification of the RLC model

Home-based travel behaviour and EVT

Individuals’ idiosyncratic preferences aligning with the EVT is a crucial hypothesis in our RLC model. Given that each tile may contain a different number of people, we assign a weight to each tile based on the number of included residents. We then use the Generalized Extreme Value (GEV) distribution to check if commuting time and HBNC time align with EVT. The fitting results for commuting time in Fig. 4 show that residents selected a residential location that enables them to achieve minimum commuting time, given the spatial distribution of amenities and housing prices. Likewise, the fitting results for HBNC time in Fig. 4 show that HBNC time is maximised during RLC. This means that the way people travel from home aligns with our model’s hypothesis.

Fig. 4: Sample distribution of home-based travel time and corresponding fitted GEV distribution.
figure 4

Graphs (ad) are the sample distribution of commuting time and corresponding fitted GEV distribution. Graphs (a, b) are the fitting diagram and P-P plot for Shenzhen, respectively. Graphs (c, d) are the fitting diagram and P-P plot for Bei**g, respectively. Graphs (eh) are the sample distribution of HBNC time and corresponding fitted GEV distribution. Graphs (e, f) are the fitting diagram and P-P plot for Shenzhen, respectively. Graphs (g, h) are the fitting diagram and P-P plot for Bei**g, respectively.

There are reasonable explanations for the above findings. Travel is primarily driven by the expected benefits at the destination (Næss et al., 2019; Wang et al., 2018). While travel time constitutes a cost paid to participate in out-of-home activities, its impact on individual utility is highly dependent on whether activities are mandatory or optional (Ye et al., 2020). Commuting is rigid travel since work is the primary source of income, and stress-related effects (high blood pressure, self-reported tension and reduced task performance) may extend beyond the journey itself (Kluger, 1998). As a result, it is seen as unproductive time (Lyons and Chatterjee, 2008). Comparatively, HBNC travel offers greater flexibility, since residents not only have the option of choosing the departure time and destination of their trips, but also whether to travel. In other words, residents can decide not to travel to a particular destination if the travel cost is greater than the utility gained at that location. By maximising HBNC trips derived from leisure time, residents can increase their utility. In comparison to distance indicators between residential locations and amenities (schools, parks, etc.) which are primarily a reflection of the accessibility of amenities, HBNC travel reflects people’s personal preferences as well.

According to the fitting results for commuting time and HBNC time, we find that the concentration degree of commuting time and HBNC time for Bei**g residents is higher than that for Shenzhen residents. The reason for this phenomenon is possibly due to the differences of the two cities in urban structure and natural characteristics. Bei**g is a single-centre city (Yang et al., 2021), and urban expansion is not limited by space. In contrast, Shenzhen is a polycentric city (Lai et al., 2022), where mountains, rivers and seas largely constrain the city’s expansion.

Regression analysis of the RLC model

In this section, we first analyse whether the results of the RLC model conform to our expectation, and then discuss the robustness of the results. Fitted using GLMs, a consistent pattern of parameters is observed in both Bei**g and Shenzhen, despite the differences in their spatial structures (Table 3). The regression results of RLC model show that the probability of a tile being chosen as residential location decreases as the average commuting time within the tile increases (Commuting time was significantly negatively correlated with Probij) and the probability of the tile being chosen as residential location increases as the average HBNC time within the tile increases (HBNC time was significantly positively associated with). In addition, the housing expenditure in a given tile was inversely related to the probability of that tile being selected as residential location. That is, α, β < 0 and γ > 0, which is consistent with our expectations.

Table 3 Regression results of RLC model.

The more mandatory the activity, the greater the influence on the location choice of residence (As, 1978; Stopher et al., 1996). Hence we compare the coefficients of HBNC time and commuting time using the Wald test. The results in Table 4 show that the coefficient size of commuting time is significantly larger than that of HBNC time, suggesting a greater impact of commuting time on RLCs. As compared to HBNC travel, commuting travel is more mandatory. The destinations for HBNC travel are, in most cases, highly substitutable, while the workplace is generally more rigid. In addition, commuting travel is a prerequisite for HBNC travel, especially maintenance travel related to consumption, such as grocery shop** and medical appointments (Loa et al., 2021). Therefore, this result is in line with our expections.

Table 4 Wald test on coefficients of key variables.

Robustness test 1: control variables

Amenities have an impact on RLCs (Campbell et al., 1976). To reduce errors caused by omitted variables, amenity variables are added to the RLC model to test the impact of missing variables. The results in Table 3 indicate that there were no significant changes in the significance and sign of the core explanatory variables, demonstrating the robustness of our RLC model. The HBNC time proposed in this study not only reflects the convenience of amenities associated with the residence but also the residents’ revealed preferences. Therefore, HBNC time can, to a certain extent, act as a proxy for these amenities. We observed changes in the explanatory power of the model by adding control variables to it. As shown in Fig. 5 (see Table 3 and Supplementary Tables 12), including control variables improved the model’s goodness of fit (i.e., R2) by 2% in Shenzhen and by 11% in Bei**g. Similar modest changes are noted in the coefficient of HBNC time, especially in Shenzhen, but the change in the coefficient of commuting time is negligible in both cities. Hence, HBNC time serves as a good proxy for the availability of amenities and individual preferences for these amenities in both Shenzhen and Bei**g.

Fig. 5: The results of robustness test.
figure 5

Graphs (a, b) depict the percentage change in coefficient and goodness of fit with the inclusion of amenity variables in Shenzhen and Bei**g, respectively. The vertical axis indicates the percentage change of the coefficient of HBNC time/coefficient of commuting time/goodness of fit of the regression model. On the horizontal axis, reg_1 is the baseline regression, and reg_2 to reg_7 represent regressions that gradually add the variables of subway, bus station, hospital, retail market, park and school. Graphs (cf) show the relative importance of commuting time and HBNC time in determining residential locations on diverse spatial scales (tiles of 250 m × 250 m, 500 m × 500 m, 1000 m × 1000 m, and 2000 m × 2000 m, respectively). To ensure the reliability of the results, we choose the maximum value of the coefficients of commuting and HBNC time at the 5% confidence level.

Robustness test 2: endogeneity

Although the results of the model are significant, there may still be self-selection bias. For example, aggregation will promote the increase of infrastructure, and the increase of infrastructure will lead to further aggregation. We use instrumental variable framework to verify the robustness of the RLC model. As our RLC model is based on human mobility, weather is an ideal instrument (Aral and Nicolaides, 2017). Gender and age could cause gaps in commuting, income and individual preferences (Dökmeci and Berköz, 2000; Fuchs, 1986; Green and Hendershott, 1996; Huebner and Pleggenkuhle, 2015; Shin and Tilahun, 2022; Venter et al., 2007). We used the amount of precipitation per month per tile, the percentage of age per tile, and the percentage of gender per tile as instrumental variables, employing two-stage least squares method for the examination of endogeneity (see Supplementary Table 3). All groups passed the weak identification test, indicating that our model is robust.

Robustness test 3: scale effect

Due to the use of mobile signalling data in this study, the accuracy of individual positions will increase with the size of the tile. Therefore, we need to test the robustness of the RLC model on different scales. The platform developed by the operator provides tiles of 250 m × 250 m. Based on this, we further divide the two cities into tiles of 500 m × 500 m, tiles of 1000 m × 1000 m and tiles of 2000 m × 2000 m, respectively. Through training our RLC model at different scales, we find that housing prices, commuting time and HBNC time all register consistent coefficients that are significant at the 1% level, despite modest changes in the coefficient size (see Supplementary Tables 49). This indicates that our model is applicable at different scales.

The relative importance of commuting time and HBNC time is also examined at different spatial scales. To assess the impact of these two factors, a new index(RAV) is created. As shown in Eq. (12), this index is the absolute value of the ratio of the commuting time coefficient to the HBNC time coefficient.

$$RAV = \left| {\frac{{Coefficient\,of\,commuting\,time}}{{Coefficient\,of\,HBNC\,time}}} \right|$$
(12)

RAV greater than 1 indicates that commuting time has a greater impact than HBNC time. As shown in Fig. 5, commuting time has a consistently greater impact on the choice of residential location across different scales compared to home-based non-commuting (HBNC) time. This is in line with our expectations, therefore, we consider the results to be robust.

Robustness test 4: time-lagged terms

We incorporate time-lagged term in the model to test its robustness, which is inspired by prospect theory and the collective mobility model (Clark and Lisowski, 2017; Xu et al., 2021). When other conditions remain constant, it is possible to explain current RLCs by using historical RLCs. After including the probability of a residential location being chosen in the previous period, as shown in Table 5, all results are consistent with the baseline regression. The goodness of fit of the models in both cities has improved significantly, suggesting that the choice of current location is significantly influenced by historical residential location distribution.

Table 5 Regression results for the RLC model to capture the change in jobs-housing relationship.

Application of the RLC model

We explore two applications of our RLC model. First, whether external shocks will affect the applicability of the RLC model. As a result of an exogenous disruption that eliminates the cues that trigger individual behaviours, people are forced to resort to deliberate decision-making (Verplanken et al., 2008; Verplanken and Wood, 2006) and make rational changes regarding their residential locations. Considering that rational choice is a fundamental assumption in our modelling process, we expect the RLC model to capture such changes. Second, to what extent our RLC model can be used for predictions. Prior research has confirmed the predictive power of RLC models based on amenities and population characteristics. Our model, which focuses on travel behaviour, not only considers spatial characteristics but also individual travel preferences. We therefore expect good predictive power of our RLC model.

The impact of external shocks

We consider COVID-19 as an external shock and test its impact on RLCs through our model. To minimise the risk of infection, many individuals have begun to work and study remotely, as well as reducing their optional travel after the breakout of COVID 19 (Zhang et al., 2021). There are concerns that the pandemic may have changed residents’ living and working patterns (Gerwe, 2021; Liu and Tang, 2021). Therefore, we estimate the parameters of the RLC model separately for the pre-pandemic and post-pandemic periods to test the impact of the pandemic. According to the results (see Supplementary Tables 1025), neither the sign nor the significance of commuting time or HBNC time has changed following the pandemic. RAV remains larger than 1, indicating that the relative importance between commute and non-commute travel has not changed. However, we observe a significant increase in the RAV, as shown in Fig. 6. This indicates that, as a result of the pandemic, commuting time has become more influential on residential location decisions than HBNC time. Due to safety considerations, each trip not only requires thinking about the utility it brings but also the risk of infection. Thus, the importance of HBNC time in the decision-making process diminishes, for instance, residents have noticeably reduced their use of amenities (Yu et al., 2023).

Fig. 6: The absolute value of the ratio between the coefficient of commuting time and the coefficient of HBNC time before and after the pandemic.
figure 6

Graphs (a, b) test the change in the impact of home-based travel time due to the outbreak of COVID-19 in Shenzhen and Bei**g, respectively. Reg_1 to reg_7 correspond to those used in Fig. 5.

Prediction of the RLC model

In this section, we access the predictive power of the RLC model. Initially, we utilise 2019 data to train the model, which is then employed to forecast individuals’ RLCs for 2020. The predictive power of the model is assessed by contrasting the actual and forecasted values for 2020, as depicted in Fig. 7. It is noticeable that the model’s predicted values have a positive correlation with the actual values across various spatial scales. Furthermore, since we are using only a subset of the urban population in our sample, to reduce the errors brought by magnitude, we draw on the methods of ordinal utility theory and compare the differences between the predicted ranks and the actual ranks, as shown in Fig. 7. It is evident that the predicted ranks from the model also show a positive correlation with the observed ranks across all spatial scales.

Fig. 7: The prediction results of RLC model.
figure 7

Graphs (ah) are the P-P plot for predicted residential location distribution and observed residential location distribution. Graphs (ad) are the P-P plot for Shenzhen on grid with tiles of 250 m × 250 m, 500 m × 500 m, 1000 m × 1000 m and 2000 m × 2000 m, respectively; Graphs (eh) are the P-P plot for Bei**g on the grid with tiles of 250 m × 250 m, 500 m × 500 m, 1000 m × 1000 m and 2000 m × 2000 m, respectively. Graphs (ip) are the scatter for predicted residential location rank and observed residential location rank. Graphs (il) are the scatter for Shenzhen on the grid with tiles of 250 m × 250 m, 500 m × 500 m, 1000 m × 1000 m, and 2000 m × 2000 m, respectively; Graphs (mp) are the scatter for Bei**g on grid with tiles of 250 m × 250 m, 500 m × 500 m, 1000 m × 1000 m and 2000 m × 2000 m, respectively.

Conclusions

In a rapidly expanding urban environment, residents are experiencing both the convenience of agglomeration and its negative externalities (Arnott, 2007; Hong et al., 2020; Peng et al., 2017). RLC is essential not only for residents’ life satisfaction (Campbell et al., 1976), but also for the urban spatial structure (Næss, 2006b). Exploring RLC patterns is therefore a critical global issue. In this context, we develop an RLC model based on home-based travel and housing expenditure. This model aligns with both the population-level gravity model and the individual-level utility maximisation model. Analysing trajectory records of over 16 million mobile phone users from Bei**g and Shenzhen across three years, we ascertain two main points: (1) residents aim to minimise commuting time, aligning with existing research (Guidon et al., 2019; Jang and Yi, 2021), and (2) they seek to maximise HBNC time. The RLC model is not only robust but also demonstrates broad applicability: (1) it suits cities with varying urban structures and geographical features, (2) it is valid across different spatial scales and regressions, (3) it can detect the effect of external shock and be used for prediction.

This paper offers a novel perspective on analysing RLC behaviour, not only incorporating individual preferences into the RLC model but also reducing data demands and diminishing the statistical correlation between sub-modules of the urban complex model (Anas and Liu, 2007; Waddell, 2002). The model is capable of explaining patterns of residence choice, as well as forcasting housing demand because of its strong predictive performance. Furthermore, since our RLC model is based on revealed preferences, the model can be combined with other models based on spatial characteristics to evaluate the efficiency of infrastructure provision and the impact of external shocks on the jobs-housing relationship (Næss, 2006b; Næss et al., 2019).

Although the proposed RLC model has many advantages, there are several limitations that need to be mentioned. First and foremost, there may be omitted variables. Our RLC model is constructed based on residents’ travel behaviours, and it has included factors related to the built environment that are associated with travel. Nevertheless, it does not consider factors such as the noise and air quality, which can influence RLCs but are less related to travel behaviour. In future studies, these environment variables should be better considered. Secondly, we obtained secondary travel trajectory data rather than original call detail records. There is no way to verify the quality of the travel beahviour data which is essential to the test of our RLC model. Although the same dataset has been applied in published works, there remains the need to cross-check its reliability. While the use of individual travel trajectory data is limited due to data security concerns, this affects the accuracy of our analysis in the empirical tests of our RLC model. Moreover, a binary distinction is made between mandatory and optional travels, thereby reducing the accuracy of using HBNC time as a proxy for amenities and individual preferences. In addition, HBNC time was underestimated because of the ignorance of co-occrrences of non-work site visits. That is, we failed to account for leisure travels made outside homes. Future studies may attempt to justify the laws found in this paper by identifying the different types of HBNC travels, which will help improve the explanatory power and predictive ability of this model.