Crowdsourced Data Mining for Urban Activity: Review of Data Sources, Applications, and Methods

The penetration of devices integrated with location-based services and internet services has generated massive data about the everyday life of citizens and tracked their activities happening in cities. Crowdsourced data, such as social media data, points of interest (POIs) data, and collaborative websites, generated by the crowd, have become fine-grained proxy data of urban activity and widely used in research in urban studies. However, due to the heterogeneity of data types of crowdsourced data and the limitation of previous studies mainly focusing on a specific application, a systematic review of crowdsourced data mining for urban activity is still lacking. In order to fill the gap, this paper conducts a literature search in the Web of Science database, selecting 226 highly related papers published between 2013 and 2019. Based on these papers, the review first conducts a bibliometric analysis identifying underpinning domains, pivot scholars, and papers around this topic. The review also synthesizes previous research into three parts: main applications of different data sources and data fusion; application of spatial analysis in mobility patterns, functional areas, and event detection; and application of sociodemographic and perception analysis in city attractiveness, demographic characteristics, and sentiment analysis. The challenges of this type of data are also discussed in the end. This study provides a systematic and current review for both researchers and practitioners interested in the applications of crowdsourced data mining for urban activity. DOI: 10.1061/(ASCE)UP.1943-5444.0000566. This work is made available under the terms of the Creative Commons Attribution 4.0 International license, http://creativecommons.org/licenses/by/4.0/. Author keywords: Crowdsourced data; Data mining; Urban activity; Review of methods.


Introduction
The development of technologies such as Information and Communications Technology (ICT) and Web 2.0 technology has brought a data revolution to the world (Kitchin 2014, p. 26). As an emerging type of big data, the interest in crowdsourced data has grown in many disciplines (Gray et al. 2015;Garcia-Molina et al. 2016). Two core technologies supporting crowdsourced data have emerged from the multitude of approaches and clustered around two main themes: device/platform-captured data; and user/system-interaction data. The former is the current wave of ICTs, such as digital devices, mobile phones, and the Internet of Things which have penetrated into almost every aspect of daily activities such as work, residency, commuting, communication, consumption, leisure, travel, and so forth, which has been captured with explicit or implicit content at unprecedented spatial and temporal resolutions (Kitchin 2014, p. xv). The second one is the emergence of Web 2.0 technology, which encourages internet users to generate and interact with, rather than only consume, online content (Batty 2012). This allows internet users to create, modify, and supply content to websites, boosting the production of user-generated content related to activities of the public. The penetration of these technologies undoubtedly has led to the explosion of crowdsourced data that are highly related to people's everyday life behavior (Kitchin 2014, p. 80). Consequently, crowdsourced data have been used in a large body of research, which quickly become an essential source of data-driven analysis in geography and urban studies (Miller and Goodchild 2015).

Background of VGI and Crowdsourced Data
In the field of geography and urban studies, several scholars have added different perspectives to the basic concept of crowdsourced data, and therefore, it is essential to place crowdsourced data in their context. For example, one perspective adds to the discussion by Crooks et al. (2015) stating that the term crowdsourcing, coined by Howe (2006), implied a coordinated bottom-up grassroots effort to contribute information, which is not necessarily limited to geographical information. Adopting this principle, Goodchild (2007) introduced the term volunteered geographical information (VGI) to refer to the geographical content generated by nonexpert users. However, Harvey (2013, p. 34) questioned the misuse by researchers who use VGI to refer to data sets that are contributed rather than volunteered by people. He argued that both volunteered and contributed data should be aggregated into the concept of crowdsourced data. Sui et al. (2013, p. 2) also pointed out that VGI is referred to as a type of crowdsourced data for geographic knowledge production. In the book of The Data Revolution, Kitchin (2014, p. 96) reviews concepts of various data types in the context of humanities and social sciences and mentions that data that are sourced from a large group of people could be recognized as being crowdsourced, for example, social media data. When applied to urban studies, Crooks et al. (2015) argued that crowdsourced data include explicit sources of collaborative, user-generated mapping and an implicit source such as social media. Over time, given the spread and depth of the type of data that have been generated from devices and platforms, the definition has broadened. Supporting this, See et al. (2016) reviewed the abstracts of 25,338 scientific papers about citizen-derived geographic information published between 1990 and 2015. The literature described this phenomenon 1 Ph.D. Candidate, Lab of Interdisciplinary Spatial Analysis, Dept. of Land Economy, Univ. of Cambridge, CB3 9EP, Cambridge, UK (corresponding author). ORCID: https://orcid.org/0000-0002-0182-3573. Email: hn303@cam.ac.uk using a multitude of terms, which have emerged from different disciplines; some focused on the spatial nature of the data such as volunteered geographic information (VGI) and neogeography, while other terms have much broader applicability, e.g., crowdsourcing, citizen science, and user-generated content, to name but a few. After identifying the sharp rise of the term crowdsourcing among other 27 relevant terms in academia, See et al. used the term crowdsourced geographic information as an umbrella term to represent different types of terms mentioned previously. Building on research by See et al. (2016), the concept of crowdsourced data in this paper refers to data both volunteered and contributed by individuals through ICT-integrated devices and user/system interaction with Web 2.0 technology. The term crowdsourced emphasizes the process of data collection, which refers to data sourced by the crowd, rather than the process of data generation. In this context, the main types of crowdsourced data in this paper cover social networking data, points of interest (POIs), and collaborative websites.

Crowdsourced Data and Urban Activity
In the age of big data, digital data and cities have formed a wide-ranging, diverse, and complex relationship (Kitchin et al. 2017a, p. 44). Crowdsourced data have shown potential in understanding urban activity and its underlying patterns and have been used to solve complex problems or fill important gaps in data analysis that traditional data sets could not cover in urban analysis Thakuriah et al. 2016). First of all, since crowdsourced data emerged with location-based services, they are able to provide geographic information such as geotag or geolocation, which is the most rudimentary and vital attribute for urban spatial analysis (Kitchin et al. 2017b, p. 6;Thatcher et al. 2018, p. 123). Second, crowdsourced data are characterized as high-frequency, which updates information that reflects what is happening at present. Furthermore, crowdsourced data are far more cost-effective than traditional data such as surveys or government censuses. Most importantly, this type of data has been collected from volunteered individuals, and their content includes rich information related to urban activity. It should be noted that although the aforementioned advantages of crowdsourced data have been widely perceived by scholars, they are still far from the point of view, abandoning traditional data sets such as traditional census and questionnaire-based data for understanding urban activity. When considering the total number of users and producers of crowdsourced data, they represent only a small fraction of the population; therefore, it would be erroneous to even consider replacing robust census data collection methods with crowdsourced data harvest as a solution for all data problems.
Although the advantages of crowdsourced data have been widely recognized and applied widely, it is apparent that a systematic understanding of how crowdsourced data contribute to urban activity analysis is still lacking. Previous studies either examined crowdsourced data in a general context or focused on specific application of crowdsourced data in urban studies. It is still difficult for researchers in urban studies to have an overall understanding of crowdsourced data in terms of data types, metrics, and methodologies, and furthermore to apply the data in their studies (Shelton et al. 2015;Chen et al. 2017;Xu et al. 2017). Particularly in the current context of big data, how to engage powerful techniques from computer science in terms of data mining is also an obstacle for the majority of researchers in urban dynamics (French et al. 2017). Therefore, this paper aims to investigate how crowdsourced data mining helps understand urban activity and understand how the established perception of crowdsourced data will replace other types of data collection. In order to achieve these goals, this paper not only focuses on the types and characteristics of crowdsourced data but also critically presents how the methods are applied to data processing. Therefore, it is anticipated that this paper will offer urban researchers the opportunity to develop more robust applications while analyzing urban activity. This study reviews the literature of crowdsourced data applications in the domain of urban activity since 2013. It first introduces review methods, especially for literature inclusion and bibliometric analysis. Based on the cocitation analysis of included papers, it then identifies the fundamental domains, key researchers, and papers on the topic of crowdsourced data. This is followed by a qualitative review of synthesizing data sources, applications, and methods engaged in spatial analysis and sociodemographic and perception analysis. This review also summarizes the potential challenges of crowdsourced data mining.

Literature Search
This study first conducted a literature search on the Web of Science database to include papers in the review process. The search query covers two key concepts: crowdsourced data; and urban activity analysis. Each concept is an umbrella of the search terms. Concept I includes terms regarding crowdsourced geographic information, i.e., crowdsourced data, social media, and geotagged. Concept II refers to terms such as urban, city, space, planning, and so forth to select papers focusing on urban analysis. In this way, papers retrieved by this search query are highly relevant to the topic: urban activity analysis with crowdsourced data. In order to retrieve targeted pieces of literature more accurately, this paper adjusts search items after multiple searches. The list of search terms eventually included in the search query is given in Table 1. In the search query, the Boolean AND is used to combine the two main concepts, while OR is used to include research papers. The search terms are expected to appear in field TS which refers to the fields of title, abstract, or keywords. Also, the search query refines publication year in the time span from 2013 to 2019. The final search query is: TS = (crowd*sourc* OR social media OR social networks data OR microblog* OR POI$ OR point*of*interest* OR VGI OR location-based OR LBS OR LBSM or LBSN OR volunteered geographic information OR user*generated content OR

Literature Inclusion
To exclude irrelevant studies, this paper refines the results with Web of Science Categories where urban studies and regional urban planning are chosen. Then, this paper screens the results with research question-based criteria to further narrow down the data. The inclusion criteria are: (1) Does the paper apply crowdsourced data to conducting urban spatial analysis? (2) Does the paper clearly state the method of processing crowdsourced data?
(3) Does the paper discuss the trends or challenges of crowdsourced data?
Finally, after reviewing papers with the aforementioned criteria, 226 papers are selected and subjected to review. The sources of papers are from journals and edited books such as Computers, Environment and Urban Systems; Cities; Landscape and Urban Planning; International Journal of Geographical Information Science; Isprs International Journal of Geo-Information; Urban Planning; Journal of Urban Technology; Seeing Cities Through Big Data Research Methods and Applications in Urban Informatics; Springer Geography; and Applied Geography. Almost 19% of articles are from Computers, Environment and Urban Systems because most applications of crowdsourced data are multidisciplinary research involving computer science and urban studies (see Table 2).

Bibliometric Analysis
Before reviewing the selected papers for qualitative synthesis, a bibliometric analysis based on the metadata of literature was conducted. Bibliometric analysis is used to explore the relationships among publications in terms of citation information, bibliographic information, abstract, keywords, funding details, and other metadata. As a method of bibliometric analysis, cocitation analysis is used to measure the frequency of which two documents/authors are together cited by others. The more cocitation two documents/authors get, the higher cocitation strength between them, and the more likely they are related semantically (van Eck and Waltman 2011). Author cocitation was proposed by White and Griffith (1981), which was described as a measure of the relatedness of authors' works. Through author cocitation, this analysis shows from which disciplines/domains a topic is derived and who are the pivotal researchers/scholars in each domain, and how they connect. This analysis not only gives a broad conception of the background but also specifically answers the questions: What are the fundamental domains of crowdsourced data mining in urban activity? Who are the impactful researchers in these domains? Additionally, through cocitation analysis based on reference lists of all the selected papers, this review identifies key studies which are highly cocited by included papers, answering the third question: What are the key papers in this topic? In addition, bibliometric analysis helps to cover those high-cited, valuable, and fundamental papers that are excluded in the previous process due to the limitation of time span. In this paper, these cocitation analyses are conducted and visualized on VOSviewer (van Eck and Waltman 2011).

Domains and Key Researchers
According to the results of the cocitation analysis, 6,867 authors are cited in 226 selected papers while 62 authors received more 15 citations. The network between cocited authors is visualized in Fig. 1. Each author is represented as a node in the figure whose size refers to the frequency with which two authors are together cited by others. The distance between nodes indicates the relatedness between two authors, and the thickness of lines indicates the total link strength. Here, total link strength indicates the sum of link strength of an author with other authors. Four clusters, assigned with different colors, are identified in the cocitation map. Although cocitation analysis can well illustrate the disciplinary structure well, it does not identify the topic of each cluster. To interpret the semantic clusters shown in Table 3, we searched the expertise and research interests of key researchers from their university profile page, Google Scholar, ResearchGate, LinkedIn, personal websites, and other sources. By synthesizing their research interests and expertise, we identified clusters as four domains, which are GIScience, Data Science, Urban Studies, and Human Geography.
GIScience plays a vital role in this topic, which is led by Michael Goodchild, Sarah Elwood, and Greg Brown, who clearly have a background in geography and geographic information systems. Their studies dominate the whole network of authors. Another big group led by Zheng Yu, Justin Cranshaw, and Daniele Quercia with data science background also contributes findings to this research topic. Most of them work with internet companies such as Microsoft Research, Foursquare and Google, which have access to big location-based data. With massive location-based social media data, they focus on exploring spatiotemporal patterns of human mobility to solve practical problems such as transportation, location recommendation, route planning, and emergency management. Researchers from Urban Studies also provide building blocks through applying crowdsourcing not only on understanding urban structure and urban form but also supporting urban models and planning support systems. Michael Batty is the pivot for proposing the nature of complexity of urban systems and emphasizing the importance of data derived from citizens. Another domain with a geographic background is led by Liu Yu, focusing on time geography and human geography. Their studies emphasize on the relationship between mobility and demographic characteristics and also pay attention to explore the patterns beyond spatiotemporal attributes.

Core Papers
According to the results of cocitation based on references, there are 44 references most cited (more than 10 times) among all the 9,499 references cited in 226 papers. The visualization of cocitation networks is shown in Fig. 2, where each dot represents one reference item. Those most cocited papers are the fundamental and core findings/arguments for studies in crowdsourced data mining within an urban context (see Table 4). In order to understand the main arguments of those core papers and how they link with others, we analyzed the relationship between them alongside the scatterplot shown in Fig. 3, in association with the key domains and researchers identified in the previous section. In general, based on the chronological order of these core papers, we found that this topic was first driven by GIScience and then fueled by the application in research from data science and human geography. Concerning the topic of crowdsourced data, Goodchild (2007) sharply noticed the potential from the blooming of techniques such as Web 2.0 and Global Position Services in the context of GIScience and coined the concept VGI. In this most highly cited paper, Goodchild also highlighted the vast potential of citizens who carry sensors and voluntarily contribute geographic information. Since then, this concept has been broadly accepted by many researchers. Haklay (2010) examined  Clusters Domains the quality of crowdsourced data and stated its robustness. After this, there is a rapid increase in applications of crowdsourced data. Before crowdsourced data became a hot topic, another type of data, called mobile phone data, had shown the potential of a massive and fine-granularity location-based data set. In the level of application, Ratti et al. (2006) creatively used mobile phone data to understand human mobility. However, this topic began to attract considerable attention only from 2008 when González et al. (2008) published a paper in Nature about the law of human trajectories at the individual level which stated that "human trajectories show a high degree of temporal and spatial regularity, each individual being characterized by a time-independent characteristic travel distance and a significant probability to return a few highly frequented location." This robust finding based on complex system and statistics has attracted massive attention from different disciplines and has contributed to countless research about human mobility and located-based data. It led a trend in the domains of data science (some of which are from computer science), exploring human mobility and urban dynamics integrated with different algorithms. The most cited algorithm is latent Dirichlet allocation proposed by Blei et al. (2003) and since then, the topic model has been widely applied to cluster human behavior or their mobility pattern from crowdsourced data. At the same time, the trend of data mining expanded from mobile phone data to crowdsourced data such as social media data (Cheng et al. 2011;Noulas et al. 2012) and POIs data (Yuan et al. 2012). One interpretation is that social media data are more accessible compared with mobile phone data and taxi trajectory data which are exclusively owned by telecommunication carriers and private companies, respectively.
Benefiting from data mining techniques in data science, the use of crowdsourced data starts to appear in Geography, especially Human/Time Geography. The overlapped clusters shown in Fig. 1 reveal the relatedness between these two domains. This obvious shift starts approximately in 2013 with two papers that (Crampton et al. 2013;Li et al. 2013) summarize the potential of crowdsourced data in finding patterns of urban dynamics, especially the socioeconomic patterns that have been neglected by previous research. Their argument is later echoed by almost two simultaneous papers: Liu et al. (2015) who propose the concept of social sensing that denotes capturing socioeconomic features from big data on the individual level and Shelton et al. (2015).

Sources of Crowdsourced Data
By reviewing the 226 papers included in the search result and 20 most cited references, the sources of crowdsourced data can be divided into (a) social media, published by individuals for sharing or social networking; (b) points of interest, specific location of functional buildings and facilities; and (c) collaborative websites, volunteered information, or contributed information with collaboration through online platforms. From the aspect of analysis, applications of crowdsourced data include activity patterns, patterns of mobility, social behavior, land use, semantic analysis, event detection, location inference, disaster management, traffic management, and so forth. Both attributes of each type of data and the method of each facet of analysis will be argued subsequently.

Social Media
As a technology built on Web 2.0, social media, for example, Facebook, Twitter, Instagram, and Sina Weibo, provide internet users with a platform to actively generate or contribute content (text, places, images, and videos) through sharing and interacting with others in the digital community (Kitchin 2014). With the proliferation of ICTs, social media data (stream/post) have become ubiquitous. Only for Twitter, there are approximately 500 million tweets sent by millions of users every day (Internet Live Statistics 2016). Sina Weibo, an equivalent of Twitter in China, has up to 376 million monthly active users according to the data in the third quarter in 2017. Most social media platforms provide sample data sets or application programming interface (API) to access historical or real-time social media streaming. Data harvested from the database record social media posts by row with metadata such as content, created time, location, and tags and also the information of users or so-called data subjects such as user id, user location, and profile image. Built on these high-dimensional features, social media data, in particular, work well for mining the patterns of human mobility, activity patterns, and social networks and sentiment detection, sociodemographic characteristics analysis, event detection, and disaster management (see Table 5).
Characterized as high spatiotemporal resolution, social media data are commonly used in the identification of human mobility pattern at intraurban, interurban, and global levels (Gabrielli et al. 2014;Hawelka et al. 2014;Liu et al. 2014;Wu et al. 2014;Abbasi et al. 2017), activity patterns (Noulas et al. 2013;Hasan and Ukkusuri 2014;Steiger et al. 2015;Martín et al. 2019) including specific patterns of tourists in urban areas , and event detection and emergency management (De Albuquerque et al. 2015;Granell and Ostermann 2016;Kim and Hastak 2018). Focusing on contextual information, usergenerated content of social media provides a source for describing actual activities of digital footprints (Lansley and Longley 2016) and becomes the indicator for sentiment detection with natural language processing techniques Hollander and Hartt 2018). In recent years, the application of social media in urban dynamics has expanded to urban social problems by integrating with census data to assign socioeconomic/demographic features to users (Li et al. 2013;Shelton et al. 2015) or by extracting demographic features (gender, age, and ethnicity) from profile information through methods such as face detection and name analysis  (Bocconi et al. 2015;Luo et al. 2016). Moreover, social media sources such as Instagram and Flickr contain massive image content examined for urban perception (Liu et al. 2016b) and patterns of tourists (Li et al. 2018).

POIs Data
POIs refer to location points associated with commercial facilities, public areas, and transportation facilities (see Table 6). POIs data are normally obtained from open-sourced databases such as Open-StreetMap POIs, business map service POIs database (Google Places and Gaode map), and POIs-based social networking data, known as check-in data, i.e., Foursquare or Yelp. POIs-related data contain location name, function, postcode, and address which reflects the distribution of facilities. Recently, there is noticeable integration between common social media platforms and location-based social networking services. For example, Twitter partnered with Foursquare and Yelp in 2015 and 2016, respectively, whereas Sina Weibo developed their POIs creator. This paper differentiates POI check-ins from social media data because POIs data are basically location/venues-oriented data gathering users' visit log or reviews, while social media steaming is content-based data whereby location is alternatively included from individuals. POIs data including POI check-ins are applied by researchers in exploring urban dynamics in terms of urban structure, urban functional use detection, patterns of urban activity, urban vibrancy, and population mapping. As location-based data, POIs have a strong connection with the urban-built environment. They have been commonly used to explore the physical features of urban areas such as urban form/structure (Pan et al. 2018;Song et al. 2018;Deng et al. 2019), urban boundary, and urban growth (Long et al. 2015;Daggitt et al. 2016). Based on the subdivided categories of places, a group of researchers use POIs to identify the land use at the store level and detect urban function (Yuan et al. 2012;Zhan et al. 2014;Frias-Martinez and Frias-Martinez 2014;Liu et al. 2017).
To explore patterns of activity, Foursquare Labs presented the global distribution of urban activity and allowed activity distribution analysis to visualize 500 million POIs data (Foursquare Labs  . Since the increasing integration between social media data and location-based services, POIs data have become more prevalent in exploring the purpose of human activity (Hasan and Ukkusuri 2015;Pouke et al. 2016;Yu et al. 2016) and urban dynamics related to human activity such as urban vibrancy (Jin et al. 2017;Yue et al. 2017;Wu et al. 2018), urban sociospatial inequality (Shelton et al. 2015), and urban deprivation (Quercia and Saez 2014). Overall, POIs data link human activities with the built environment, which provides researchers with an opportunity to understand purposeful urban activity and the function of land use.

Collaborative Websites
As a primary source of crowdsourced data, collaborative websites refer to the web services contributed by individuals by uploading and editing exclusively thematic content such as geographic maps and geotagged images complying with specific acceptance policy. OpenStreetMap is the wiki-like mapping service focusing on geospatial features such as land, transportation infrastructure, and building. This volunteered geographic information has been applied to urban land parcel identification , land cover assessment (Estima and Painho 2015), and urban boundary identification (Schlesinger 2015). Geotagged photosharing websites such as Geograph Worldwide and Panoramio provide platforms for photo sharing according to geolocation, content, and categories of photos. This type of data has been applied to the detection of location preference, patterns of travel routes, perception of the built environment, and city attractiveness (Paldino et al. 2015;García-Palomares et al. 2015;Dubey et al. 2016). Apart from these sources, there is an increasing number of crowdsourcing projects which are focusing on specific aspects of urban dynamics. Recently, Rae (2016) built a crowdsourcing website for citizens to digitalize the boundary of several cities on the interactive map. CASA (2018) from UCL released the Colouring London project to invite residents of London to assign color according to the features of buildings with which they are familiar.
Among the three main sources, social media is the dominant one because of its extensive coverage, deep usage penetration, massive size, and relatively open accessibility. Social media will further show its value in analyses of urban dynamics because of its growing integration with location-based services. For collaborative websites, it shows more flexibility of collecting knowledge from citizens by designing platforms focusing on different aspects of activities. The crowdsourcing mode in those projects with online websites will also be adopted by researchers in this area. Although this paper separately discusses the three primary sources and their main applications, what should be mentioned is that most of the studies require multiple sources of crowdsourced data. Data fusion (within crowdsourced data, other data sets including census data, smart card records, bank transactions, among others) is broadly used in the big data-driven analysis (Hasan et al. 2013;Lenormand et al. 2015;Jin et al. 2017).
In the following section, this paper synthesizes all the applications about urban dynamics mentioned previously and illustrates state-of-the-art of methods engaged in the application of spatial analysis, sociodemographic analysis, and perception analysis.

Mobility Patterns: Dynamic Flows of Urban Activity
According to the development of ICTs, location-aware devices such as smartphones, mobile devices, and vehicles allow users to record and share their whereabouts of urban activity. Based on spatial and temporal information, most of the studies on human mobility have been carried out with the geospatial data and analytical models (see Table 7) including the gravity model, the generalized potential model, the rank-based movement model, and the radiation model which have all been proposed (Abbasi et al. 2017;Barbosa et al. 2018).
These methods have been applied to different scales of mobility studies including intraurban, interurban, national, and global studies. At the intraurban level, Wu et al. (2014)   underlying patterns of trips and spatial interaction between 370 cities in China. The study suggested that crowdsourced data perform better when revealing the collective level of spatial interaction, compared with intraurban human mobility. At the global level, social media has been a suitable enough form of datasource to explore mobility between nations or districts, since it is difficult for mobile phone data to cover worldwide scales due to the high fragmentation of the mobile telecom market. Hawelka et al. (2014) uncovered global patterns of human mobility against the volume of international travelers, characteristics of flows between nations, temporal patterns of international travel, and mobility networks. Recently, a rank-based model has been developed to predict human mobility by ranking the probability of commuting between venues (Noulas et al. 2012;Liang et al. 2015;Chen et al. 2016). Furthermore, Abbasi et al. (2017) applied check-in weighting schema to rank the probability. However, the accuracy of the computing rank is not sufficient, which affects the estimation of mobility. To further subset mobility patterns, Yang et al. (2017) labeled users as natives and nonnatives and proposed the indigenization coefficient to estimate the extent of natives. Similarly, Li et al. (2018) examined the spatial interaction between locals and tourists. Studies elaborated previously only depict the spatiotemporal distribution of individuals, without considering many other features embedded in crowdsourced data. Steiger et al. (2016) identified spatiotemporal clusters of urban activity with a self-organizing maps (SOM) algorithm, which handles high-dimensional data sets well. Importantly, the clustering process also considers the semantic similarity of tweets supported by the LDA model, which further reveals the patterns of mobility.

Functional Areas: Activity-Based Analysis
To understand urban activity, it is not enough to identify the mobility patterns of human beings while neglecting the purpose and content of their activities. Due to the strong connection between human behavior and its linked built environment, activities happening in the urban space define the function of urban areas (Crooks et al. 2015). In addition, activities and habitats of citizens also form functional areas differing from administrative units in terms of extent and structure. To identify those functional areas in cities, traditional methods such as remote sensing technologies present limitation in capturing the socioeconomic attributes of human dynamics, and some social science methods such as interviewing, observing, and cognitive mapping are usually costly and time-consuming for researchers (Zhou and Zhang 2016;Chen et al. 2017). By leveraging the geospatial information with fine granularity and continuously updated content of individuals' behavior, researchers have attempted to delineate urban functional areas with crowdsourced geospatial data (see Table 8). It is fair to say that the introduction of crowdsourced data allows studies to change from movementbased analysis to activity-based analysis (Wu et al. 2014). Hollenstein and Purves (2010) analyzed 8 million Flickr images with georeferenced tags to understand how people describe city core areas with different names and how these areas are distributed. When it comes to the neighborhood level, Cranshaw et al. (2012) in the representative case, the Livehoods project, developed a clustering model considering spatial proximity and social proximity, with check-in data to map neighborhoods dynamically. To identify the functional areas or land use pattern, the clustering method shows its importance on aggregate objects into groups spatially. Wang et al. (2016) compared three representative spatial clustering algorithms, density-based spatial clustering of application with noise (DBSCAN), expectation-maximization (EM), and K-means, arguing that K-means, as an algorithm based on the distance of objects, is appropriate to process high-dimensional objects for identifying land use patterns. Differing from the commonly used clustering method, kernel density estimates (KDE), Aadland et al. (2016) developed an algorithm employing fuzzy-set theory to identify the boundary of neighborhoods.
In terms of identifying urban functional areas not just with spatial location, Yuan et al. (2012) introduced a probabilistic topic model which regards a region as a document and function as a topic, delineating urban functional areas through the clustering method based on LDA. From the perspective of urban planning, Crooks et al. (2015) provided insights into the urban forms and function powered by crowdsourced data and explained how to conduct implicit function classification based on POIs with the LDA topic model at three scales (buildings, streets, and neighborhood). However, the LDA topic model only considers the frequencies of POIs neglecting the inner spatial correlations, so Yao et al. (2017) engaged a deep learning model (Google Word2vec) to identify functions by considering the high-dimensional features of POIs at the travel analysis zones. In detecting functional areas, social media check-in data attract more attention than POIs since it is challenging to match human movements consistently just with POIs. However, as we mentioned before, not all social media comes with POIs reference. Zhou and Zhang (2016) trained a support vector machine (SVM) classifier based on tweets with foursquare venues and applied it to all the social media data to evaluate the content of activities. Liu et al. (2017) integrated the topic model with SVM for classification while including remote sensing to extract urban functional areas. Without predefining categories, Zhi et al. (2016) built up a model based on low-rank approximation (LRA) to detect functional regions and its temporal pattern with a large social media check-in data set in one year. To identify functional areas at the building level, Chen et al. (2017) applied a dynamic time warping (DTW) distance based on k-medoids to perform time series clustering. On top of all these, a group of researchers calculated the mixture of function to evaluate urban vibrancy through Shannon entropy , spatial entropy , and Hill number (Yue et al. 2017).
Despite the advantages of crowdsourced data in delineating functional areas,  also highlighted the biases when introducing geospatial data to analyze urban activity. Since this data heavily rely on mobile phone devices, night activities cannot be captured when the mobile phone is powered off.

Event Detection: Crowd-Based Monitoring
Crowdsourced data, especially social media data, are characterized as high frequency which describes what is happening in which parts of the city (Xia et al. 2015). Therefore, multiple crowdsourced data streams have been collected to detect and depict local or emergency events (see Table 9).
When information from crowdsourced data is extracted and aggregated, it is not enough to just collect the location and timestamp. Instead, it is essential to capture information from content, including text, tags, or images. Chen and Roy (2009) exploited event-related tags from annotated photos and then grouped photos based on tag usage occurrence. Multiple events could be identified in association with temporal and locational attributes. To reduce the workload of preselecting event-relevant tweets, Walther and Kaisser (2013) preselected posts based on geographical and temporal proximity and introduced a machine learning algorithm to evaluate whether detected events happen in the real world. As one type of event detection, some studies also attempted to identify traffic anomalies with crowdsourced data. For example, Pan et al. (2013) used the traditional GPS trajectory data set to detect the change of routine from drivers and fuse the social media data which related to traffic anomalies to conduct an in-depth temporal analysis.
Another branch of study under this analysis is emergency events management. The web user can be seen as a social sensor who can provide more information when emergency events happen. To estimate the location of a specific emergency event, Sakaki et al. (2010) examined different methods and found that particle filtering works better in estimating the epicenter of earthquakes. Crooks et al. (2013) built a similar sensor system with social media data and engaged the signal-to-noise ratio to detect the epicenter and impact area. Shifting from the physical location of emergency events to public opinion, Xu et al. (2016) conducted semantic analyses to extract main topics from the related social media.

Sociodemographic and Perception Analysis: City Attractiveness, Demographic Characteristics, and Sentiment Detection
The analyses summarized in the previous section are orientated toward answering the spatiotemporal variation of urban activity. Since the crowdsourced data are collected at the individual level, it comes with multiple features of data subjects beyond spatial and temporal information. In general, sociodemographic and perspective features may directly be included in the profile of users and the generated content, or it is hidden in the spatiotemporal preference of their activities. These features further allow the exploration of patterns of urban activity and their underlying mechanisms. This section summarizes the application of features beyond geographic information of crowdsourced data.

City Attractiveness
Evaluating city attractiveness based on crowdsourced data provides insights into several fields such as urban planning, flows forecasting, transportation, and economics. City attractiveness refers not only to mobility but also how people experience the city. Focusing on local attractiveness, Girardin et al. (2009) quantified attractiveness by fusing mobile phone data and geotagged photos from Flickr and tracked the evolution of central areas. However, this research could only depict the distribution and density of digital footprints without considering the driving force of attractiveness. To overcome this, Huang et al. (2010) used POIs and GPS trajectory to identify spatiotemporal attractiveness. In order to quantify city attractiveness more accurately, Sobolevsky et al. (2015) fused multisource data including Flickr, Twitter, and bank card transactions in order to identify foreign visitors and their mobility patterns. Regarding studies on global attractiveness, one representative case is that of Paldino et al. (2015), who analyzed the data set with geotagged photos over 10 years by ranking the total number of photographs taken by tourists. It also provided a novel method in terms of defining the home country of a user based on the photo numbers in different locations.

Social Demographics
Although the total size of social media is massive, its users are still sample data rather than being the representative of the entire population. To understand the sociodemographic background of social media creators, Li et al. (2013) detected the home location of users based on their time lines and linked social media with sociodemographic characteristics from census data by location. However, those linked to sociodemographic features are just collective features around a census unit. To detect features such as gender, age, and ethnicity at the individual level, a group of researchers involved in name analysis (see Table 10). Longley et al. (2015) emphasized the relationship between demographic features and the characterization of forename-surname pairs and applied it to demographic classification. In line with this, both Hofer et al. (2015) and Luo et al. (2016) conducted text mining to explore the demographic characteristics of social media users with their profile information and investigate the spatiotemporal characteristics of spatial patterns. No matter how socioeconomic/demographic features are extracted, those enriched features from crowdsourced data help to understand the urban dynamics of concentration, dispersion, and segregation. In the reputed areas of segregation, Shelton et al. (2015) proposed a methodological framework to group users according to the neighborhood they visit frequently visited and explored the sociospatial mobilities between those groups in Louisville, Kentucky. Davis et al. (2019) used Yelp reviews to infer home or work location of reviewers through mining locationrelated context and links to census data to identify the segregation of urban consumption in New York City. Focusing on internal migration, Fiorio et al. (2017) mined long-period data from Twitter and explored the characteristics of demographic mobility, which helps to understand long-term migration. Because of the highdimensionality of crowdsourced data, more studies focused on age, gender, sexuality, consumption power, economic status, and other identities are likely to be produced, which will be accurately extracted from unstructured information (Shelton et al. 2015).

Sentiment Analysis
Content, as the central part of social media data, has become a growing research subject in recent years with applications of natural language processing techniques increasing in recent years (see Table 11). Because of NLP technologies, the contextual information can be extracted and analyzed for detecting the content of activities and public sentiment. The increasing interactions between citizens and online social media have provided an opportunity to conduct sentiment analysis for a better understanding of urban human geography. Quercia et al. (2012) detected the sentiment variance in different areas of London and found a positive correlation between sentiment and socioeconomic well-being. As one of the most comprehensive studies, Mitchell et al. (2013) investigated the correlations between sentimental expression (happiness) from Twitter and the emotional and demographic characteristics through 50 states in the US. This study provides a novel methodology by using the mechanical Turk word list that scores the average happiness of each word. Similarly, Frank et al. (2013) applied the same assessment tool to examine the relationship between happiness and the patterns of life in the US. To explore the relationship between sentiment and socio-economic parameters, Guo et al. (2016) conducted unigram-based sentiment analysis with geotagged tweets for different socio-demographic groups and found that the number of jobs, children, and transportation availability can well explain the sentiment variations. However, the content from social media is not just text but also emojis used to express users' emotions. To fill this gap, Li et al. (2017) applied the multinomial Naïve Bayes classifier to evaluate these special features. Realizing the contribution of sentiment analysis to smart governance, Hollander and Hartt (2018) introduced sentiment analysis to investigate the propensity of resident sentiment in declining cities around the US.

Potential Challenges of Crowdsourced Data
From the aforementioned review, it can be found that crowdsourced data have become widely used in the field of urban activity analysis. Although advantages are highlighted in the previous studies described in this paper, crowdsourced data also bring challenges and difficulties that need to be clarified and tackled, such as the   (2018) challenges involved in data collection, data processing, and analysis. However, there are potential challenges when dealing with crowdsourced data which have been identified. The first challenge of crowdsourced data is concerning its representativeness, which is proposed frequently regarding biases (Huang and Wong 2016;Liu et al. 2016a). Although the proliferation of crowdsourced data is obvious, the users of this new form of data are relatively small by comparison with the overall population that needs to be studied (even smaller percentages are represented when studies focus on geotagged data). For instance, approximately 1% of tweets worldwide are geotagged, i.e., including the location information (Morstatter et al. 2013). The problem of representativeness consequently brings another problem relevant to statistical analysis where appropriate sampling is needed for valid inference. This is because the collection of crowdsourced data is automatically completed through APIs. Hence, some data may be oversampled or less sampled. Another concern and a source of biases is the reliability of crowdsourced data since the data are generated by individuals who may upload false or fake information on social media or collaborative websites. For example, the location tagged on social media can be any place in the world. Another challenge is linked with data processing originating from multiple sources (Li et al. 2016b); according to the review, one trend that has been identified is that researchers are attempting to fuse and integrate different types of data together. However, the data formats and structure of metadata are different. Specifically, researchers need to convert files such as CSV, KML, KMZ, AML, TXT, and JSON into a uniform format for conducting analysis. Adding to this is the danger of merging data sets of unknown granularity levels of crowdsourced data during the data fusion. When it comes to analysis, Liu et al. (2016a) also point out that this new type of data is facing a methodological challenge, since traditional approaches are limited to fully leverage the value of crowdsourced data because of its volume, granularity, structure, and so forth.
Among the studies reviewed in this paper, there have been various attempts to eliminate the biases in the studies mentioned previously and they have leveraged crowdsourced data in specific fields with appropriate methods, whether engaging data fusion or applying mixed-method research. This paper would argue that crowdsourced mining data have provided an unexpected opportunity to produce novel and meaningful research regarding urban activity.

Summary
This paper conducted a systematic review of studies in crowdsourced data mining for urban activity analysis. While there is no standard definition of crowdsourced data (Crooks et al. 2015), they can be explained as types of data that are collected from the crowd actively and passively (contributed by the user based on the terms of services) through the interaction between citizens and ICT-support services. Following the coordinated bottom-up process, crowdsourced data contain rich spatial, temporal, sociodemographic, and perception information related to urban activity, providing opportunities to get insights into urban dynamics from a perspective of the public. In the era of big data, crowdsourced data have advantages due to the massive volume of data, available access, fine granularity, real time, and high frequency. Given these characteristics, there is an increasing number of studies which vary in nature and scope that conduct urban activity analysis by using the main crowdsourced data sources, social media, POIs data, and collaborative websites, with different content such as text, images, tags, profile, and so forth, and each data source has its advantage in specific domains.
This review highlighted the application of crowdsourced data on spatial analysis, including mobility patterns, functional areas, and event detection, with reprehensive studies. The high-volume spatial-temporal information provides chances for mobility analysis exploring dynamic flows of urban activity rather than static distribution. In other words, the content of crowdsourced data is used to identify the purpose of movement and functional areas, which leads to activity-based analysis. Other contents such as text, tags, and images provide crowd-based information for event detection and management. In addition, this review examined the application of sociodemographic and perception analysis and states the possibility of crowdsourced data mining. Three main fields, city attractiveness, demographic characteristics, and sentiment analysis, are identified. By reviewing the various applications listed previously, it was found that crowdsourced data support the shift from static analysis to human dynamic analysis in the field of urban studies. This also provides building blocks for real-time modeling and dynamic simulation in the future.
Potential challenges, such as biases of crowdsourced data, are mentioned at the end of this review. Problems in data collection, data processing, and analysis, i.e., representativeness, coverage bias, and heterogeneity of data frame, should be realized by researchers. These also need to be tackled through eliminating irrelevant content, fusing with multisource data, introducing algorithms of data cleaning, and integrating both qualitative and quantitative methods. While there are concerns and challenges about crowdsourced data, it is important to value how such new forms of data can be explored and leveraged for revealing spatial, temporal, sociodemographic, and perception characteristics of urban activity and realize that a new data-driven urban analysis, involving GIScience, human geography, urban studies, and data science has been developed during the era of the digital data revolution.

Data Availability Statement
Some or all data, models, or code generated or used during the study are available from the corresponding author by request, such as the network file of cocitation of authors and the network file of cocitation of references.