امروزه با افزایش روز افزون استفاده کاربران از شبکه های اجتماعی، اطلاعات مکانی مردم گستر رشد چشمگیری داشته است. از میان انواع اطلاعات، محتواهای متنی کاربرتولید غالباً در ساختار مشخصی به اشتراک گذاشته نمی شوند. یکی از ویژگی های عمده این نوع اطلاعات محل مبنا بودن آنها است.محل های مورد گفتگوی بشر معمولاً همراه با ابهام و وابسته به بافت است. عملکرد محل یا به عبارتی عمده فعالیت هایی که افراد در یک محل انجام می دهند، به عنوان یک بافت در توصیفات محل، ازجمله ویژگی های عمده و متمایزکننده محل است. هدف این تحقیق استخراج عملکرد محل با استفاده از تحلیل محتواهای متنی کاربرتولید به اشتراک گذاشته شده توسط کاربران است. به این منظور ابتدا محل ها و نظرات کاربران در مورد محل ها در وبگاه TripAdvisorبه عنوان محتواهای متنی، جمع آوری شده، سپس از روش های مختلف پردازش زبان طبیعی به منظور آماده سازی و پیش پردازش داده ها استفاده می شود. در ادامه برای هر دیدگاه کاربر یک مجموعه واژگان با استفاده از مقادیر TF-IDFبه عنوان مقادیر بردار ویژگی ساخته می شود. سپس در یک روش نظارت شده این مقادیر به همراه عملکرد محل هابه عنوان ورودی به یک طبقه بندی کننده لجستیک رگرسیون به منظور آموزش مدل داده شده و با استفاده از آن عملکرد محل بر روی داده های آزمایشی پیش بینی شده است. نتایج ارزیابی روش از طریق محاسبه ماتریس درهم ریختگی نشان می دهد، صحت کلی روش پیشنهادی در حدود 96درصد است که رقم قابل توجهی است. همچنین بیشترین دقت و امتیاز F1 برای محل های سرو خوراکی است، درحالی که اقامتگاه ها به دلیل شباهت عملکردی به هتل ها کمترین دقت و امتیاز F1را دارند ولی با این وجود نتایج آنها نیز قابل اطمینان و رضایت بخش است.
عنوان مقاله [English]
Extracting Place Functionality from User-Generated Textual Contents Using Machine Learning Methods
In GIScience, spatial information has usually been presented in the form of space. However, human reasoning, behavior, and perception are mainly based on place, not space. Places are usually ambiguous and context-dependent and are related to the human experience of the world. Place functionality as a context in place descriptions is one of the main and distinguishing features of the place. Today, with the increasing use of users of social networks, volunteered geographic information (VGI) and crowdsourcing information has grown significantly. However, information obtained from social networks, e.g. check-ins, often does not have a complete and clear view of the concept of place and it does not include spatial information between phenomena, land uses, and points of interest (POI). It ultimately limits their ability to work with the concept of place. In this case, GIS should detect the place functionality that does not necessarily exist simply and clearly in the stored data.
2. Materials and Methods
To address these issues, this paper aims to extract place functionality based on analysis of user-generated textual contents. In order to achieve this goal, first places and user’s reviews about places in TripAdvisor website are collected through web crawling. The advantage of these data over other place-based data is their independence from formal descriptions of place. These data were collected in October 2020, and only English reviews are considered. New York City (NYC) is selected as our case study area. At first, for each place type, we extracted all corresponding places. Then, for each place, we extracted a maximum of 1000 top reviews. To prepare data, places without geographic coordinates, places out of the study area, duplicates or places whose type is unknown are removed. There are five types of place categories on TripAdvisor, including Attraction, Food Serving Place, Hotel, Shop, and Vacation Rental. Then, different natural language processing (NLP) methods are used to preprocess the reviews. First, each review is converted to lower case and tokenized, then punctuations and stop words are removed. Afterward, all tokens are stemmed and lemmatized. In the next step, proper features should be selected for knowledge discovery. We use a bag-of-words (BoW) feature selection method which features values are weighted using TF-IDF scores for each user’s review. Finally, in a supervised method, these values and place functionalities are trained using a logistic regression classifier to predict place functionality on the test dataset.
3. Results and Discussion
We randomly assigned 75% of the data set to train the model and 25% to test the results. Finally, the results are evaluated using common machine learning evaluation measures by computing confusion-matrix. The evaluation results demonstrate that the overall accuracy of the proposed method is about 96% which is remarkable. For Food Serving Place, the predictions are so close to reality that in 98% of cases the algorithm was able to correctly predict Food Serving Places. Also, about 0.8% of them are considered as Attractions. In the case of Hotels, the accuracy is 97%. However, about 1.8% of Hotels are incorrectly categorized as Food Serving Places. Attractions are also 93% correctly predicted and about 3.8% of them are mistaken for Food Serving Places. In the case of Shop, the accuracy is about 74%, because the number of reviews related to this type of functionality is lower, although this issue has been partially resolved by weighting the samples. Secondly, in many cases, people visit the shopping malls for entertainment and not just shopping, which has led to about 15% of Shops being classified as Attractions. Also, about 11% of these Shops are considered as Food Serving Places. One of the most important reasons for this is the action of buying food in these places, which is a kind of purchase. In addition, in some shopping malls there are places to serve drink and food. Since the reviews of the Vacation Rentals was less than other functionalities, the lowest accuracy (about 65%) is related to them. In 25% of cases, Vacation Rentals are classified as Hotels. This result is not too far-fetched, as Vacation Rentals and Hotels are very similar in function and are often used to accommodate travelers and tourists. Also, 4.8% and 4.6% of them are classified as Attractions and Food Serving Places, respectively. The maximum precision and F1-score is achieved for Food Serving Places while Vacation Rentals show the least precision and F1-score since their functionality is similar to hotels, however, their results are also reliable and satisfactory.
In this study, we tried to extract the place functionality by analyzing the user-generated textual contents shared on the TripAdvisor website by users. To achieve this purpose, different NLP methods were used to prepare and preprocess the data. The bag-of-words constructed for each user's review was then modeled to a logistic regression classifier, and the place functionality on the test data was predicted. In future works, the efficiency of other feature selection methods as well as other classifiers in extracting place functionality can be evaluated and compared. In addition, the place functionality should be extracted in more detail where different types of attractions can be distinguished.