Data-Driven Analysis of Borobudur Ticket Sentiment Using Naïve Bayes

The recent growth of social media is hugely influential and plays a significant role in various aspects of people's lives in the digital era. Twitter is a social media network that is widely used in Indonesia. Twitter users can engage in multiple activities, such as communicating with individuals and groups, writing daily activities, promoting businesses, arguing, and expressing ideas about a topic of discussion. At the beginning of June 2022, raising the entrance charge for Borobudur Temple became one of the concerns that caused a lot of conversation in the real world and on other social media platforms, including Twitter. The plan to increase the price of entrance tickets to Borobudur Temple has drawn various pro and con reactions in the community. This study analyzes public sentiment toward the planned increase in ticket prices for Borobudur Temple. Sentiment analysis of Twitter data can be implemented using a classification algorithm. The classification algorithms widely used in sentiment analysis research are Nave Bayes (NB) and Decision Tree (DT). The reason for choosing Nave Bayes and Decision Tree is because this algorithm is the most popular algorithm used to process text data classification; the process is simple, efficient, and performs well. This study's dataset source was taken from social media sites like Twitter. In comparison to the Decision Tree, which generates a test percentage of 100%, the accuracy of the Naive Bayes approach, based on the evaluation of the test results, produces the highest accuracy number. At the same time, the Decision Tree method's accuracy test yields a test accuracy value of 35.97%.


Introduction
The development of social media has recently been very influential and has played a significant role in various aspects of people's lives in the digital era [1] [2] [3]. The development of social media has recently been very influential and has played a significant role in various aspects of people's lives in the digital era [4] [5]. Statista Report shows 18.45 million users of the application founded by Jack Dorsey in Indonesia as of January 2022. This achievement places Indonesia as the 5th most Twitter user in the world. Through Twitter's social media, users can carry out various activities, including communication between individuals and groups, writing daily activities, promoting business, arguing, and expressing opinions related to a topic of discussion. Uploads are status messages (commonly called tweets) limited to 280 characters. Tweets written by users are sources that can be used to analyze public sentiment toward the issues discussed because the writing contains sentiments that can be used as a source of evaluation and consideration in making a decision [6] [7].
Various studies have been conducted to analyze public opinion using data from social media Twitter. Research on Twitter sentiment analysis of figures public [8], film review, Performance of the House of Representatives, National BMKG Twitter data review [9], go food sentiment analysis [10], review on Word Cloud-Based Shopee E-Commerce [11], and Indonesian-language Twitter sentiment analysis of peduli lindungi app [12]. Responses written on Twitter can be said to be a sentiment; these responses can be categorized into three: responses that support an event are called positive sentiment, responses that oppose or reject an event are called negative sentiments, and responses that do not support or reject are called neutral. In this study, the emergence of reactions among the Indonesian people, especially on social media and Twitter is interesting to analyze. Sentiment analysis on Twitter data can be implemented using classification algorithms. Classification algorithms widely used in sentiment analysis research are Naïve Bayes (NB) and Decision Tree (DT).
This study aims to classify sentiment values and determine the class of tweet data froTwitterer users related to the Borobudur Temple ticket price increase using the Naïve Bayes Algorithm and Decision Tree. The result will be obtained from the percentage of public sentiment towards the discourse on increasing Borobudur Temple ticket prices. The benefits of this research are expected to benefit both the author and readers regarding the sentiment picture of Twitter users towards the plan to increase Borobudur Temple tourist ticket prices. The data used in this study is the tweet data of the increase of Borobudur temples in Indonesian. This data is obtained through the process of crawling twitter data. The number of tweets used is 591, with data taken in June 2022. The contribution of this study analyzes public sentiment towards the discourse of increasing ticket prices of Borobudur Temple using the Naïve Bayes and Decision Tree methods.

Research Method
The stages of the research method are needed as a framework and guide for the research process so that the series of research processes can be carried out in a directed, orderly, and systematic manner. The following method proposed in this study is shown in Figure 1.

Data Collection
The first stage is the process of collecting data on Twitter or called crawling data. Crawling data on Twitter is a process of downloading data in the form of users or tweets from Twitter servers with the help of Application Programming Integration (API) (Eka Sembodo et al., 2016). In this process, the keyword used was "Borobudur ticket increase" with a total data of 591 tweets.  Data preprocessing is one of the essential stages in text mining (Alasadi & Bhaya, 2017), which converts raw data into ready-to-use and more structured data with data cleaning and uniformity because the data collected is usually still dirty data. The preprocessing stages applied in this study are normalization, case folding, cleansing, tokenization, stopword removal, transform cases, and labeling. The purpose of some processes at this preprocessing stage is to produce clean tweet data so that the Naïve Bayes & Decision Tree algorithm can be tested more optimally. An illustration of the data preprocessing process is shown in Figure  2. Data Preprocessing has seven stages, including:

Normalization
Normalization is a process that aims to correct words that have writing or spelling errors and words that are written abbreviated.

Case Folding
Case folding is a technique for replacing capital letters found in data with completely lowercase.

Cleansing
Cleansing is a process that aims to eliminate various information that is not needed in the sentiment analysis process in the form of links (http, https, pic.twitter), hashtags, usernames (written @username), and other special characters to obtain better analysis results.

Tokenization
Tokenization is the process of breaking each sentence contained in data into pieces of words. The trick is to make spaces as a reference to separate each word.

Stopword Removal
Stopword Removal is meaningless words that matter, such as the words "in", "and", "with", "by", and etc. So, stopword removal is a process to eliminate words that do not have valuable meaning to the data to be used.

Transform Cases
Transform cases is the process of converting all letters into lowercase or all capital letters.

Labelling
Labeling is the processing of stopword removal data results, where the results are given polarity calculations from the comments taken to get a classification, namely positive, negative, and neutral labels.

Data Sharing
At this stage, the distribution of datasets with labels will be carried out to get train and test data. The dataset division uses a ratio of 80% data train and 20% test data. This is used because, based on previous research, the comparison results of 80% of train data and 20% of test data get good results. The total tweet data is 367 data. This data resulted in 294 data trains and 73 test data.

Data-Driven Analysis of Borobudur Ticket … ■ 223
Naive Bayes is a classification or classification of data that calculates the probability of an available dataset [13]. According to [14], Naive Bayes is a classification to predict future odds by probability and statistical methods according to previous experience. The advantage of the Naive Bayes algorithm is that the data needed to determine approximate parameters in the classification process using this method only requires a small amount of training data [15]. While according to [16]. The advantage of Naive Bayes is that it is easy to implement and in many cases, gives good results, then the disadvantage is that it is not related between features or is independent, while in reality, the relationship must exist and cannot be modeled by the Naive Bayesian Classifier. This classification is based on Bayes' theorem. Naïve Bayes assumed that the effect of attribute values on a particular class was independent of the importance of other attributes. The process begins by entering the training data. The formula of the Naive Bayes algorithm is shown in the following equation (1) The subsequent calculation of the probability that the word i belongs to a particular category or class can be done using the following equation (2): ( Where : X : Sample data with unknown class (label) H : The hypothesis that X is data with probability class (label) C P(H|X) : The probability that the hypothesis is correct (valid) for the observed sample data X P(X|H): The probability of sample data X, when it is assumed that the hypothesis is valid P(H) : The probability of hypothesis H P(X): Observed sample data odds

Decision Tree Classification
Decision Tree is the most widely used algorithm for classification problems [17]. The Decision Tree algorithm is powerful, popular, logic-based, and easy to understand [18]. The interesting thing about Decision Tree is the use of a tree structure that serves to represent the rules formed from the results of classification [19]. Decision Tree uses a supervised machine learning method, a learning process where new data is classified based on existing training samples. The gain value is the gain information used to find variables/attributes in the dataset (S) used as root/node and branch nodes and is the attribute with the highest gain value. You can use the concepts of entropy, Gini coefficient, and misclassification to find information gain. The maximum profit value obtained from the attributes of the dataset (training data) is first used to find attributes that are worthy of the root (decision tree) of the decision tree. Then the process of searching for attributes that will be branches is repeated until we find the leaf that is the label of the class. The entropy for which the information is obtained with the entropy value is an expression for calculating the uniformity of attributes (A) from the sample data (S).
Entrop y (S i ) Information Gain with Gini Index Value Information Gain dengan classification error, value C. Error obtained from the smallest attribute value of the label class A decision tree has roots/nodes, branches, and leaves like a tree.

Evaluation
The data testing results are evaluated using the Confusion Matrix table, namely accuracy. The Confusion Matrix represents the predictions and actual conditions of the algorithm-generated data. Accuracy is used for evaluation to determine the correct prediction ratio to overall data. Information: 1. True Negative / Actual Negative = actual data that is in the negative class and the model has predicted negative 2. True Positive = actual data that is in the positive class and the model has predicted positive 3. True Neutral / Actual Neutral = actual data that is in the neutral class, and the model has predicted neutral 4. False Negative = actual data that is in the positive or neutral class, but the model has predicted negative 5. False Positive / Positive Prediction = actual data that is in the negative or neutral class, but the model has predicted positive Data-Driven Analysis of Borobudur Ticket … ■ 225 6. False Neutral = actual data that is in the negative or positive class, but the model has predicted neutral

Findings 3.1. Data Collection
The first stage is the process of collecting data on Twitter or called crawling data. Crawling data on Twitter is a process of downloading data in the form of users or tweets from Twitter servers with the help of Application Programming Integration (API) (Eka Sembodo et al., 2016). In this process, the keyword used was "Borobudur ticket increase" with a total data of 591 tweets. The following is the process of crawling data at the data collection stage:

3.2.
Preprocessing Data Preprocessing is one of the most important stages in text mining (Alasadi & Bhaya, 2017) which is used to convert raw data into ready-to-use and more structured data with data cleaning and uniformity because the data collected is usually still dirty data. The preprocessing stages applied in this study are cleansing, tokenization, transform cases, stopword removal, filter tokens (by length), and labeling.

Cleansing
Cleansing is a process that aims to eliminate various information that is not needed in the sentiment analysis process in the form of links (http, https, pic.twitter), hashtags, usernames (written @username), numbers, and other special characters to obtain better analysis results.

Processing Text
Text before cleansing RT @Bambang_DP: After seeing this video, it seems that the entrance ticket to Borobudur temple so 750k makes sense. https://t.co/atjgGnUdCT Text after cleansing After seeing this video, it seems that the entrance ticket to Borobudur temple so k makes sense Source: Research Results (2023)

P-ISSN: 2655-8807 Vol. 5 No. 2Sp 2023 E-ISSN: 2656-8888
Tokenization is the process of breaking each sentence contained in data into pieces of words. The trick is to make spaces as a reference to separate each word.

Processing Text
Text before tokenization After seeing this video, it seems that the entrance ticket to Borobudur temple so k makes sense Text after tokenization After seeing this video, it seems that the entrance ticket to Borobudur temple so k makes sense Source: Research Results (2023)

Transform Cases
Transform Cases is the process of converting all uppercase letters into lowercase letters so that they can relate to sentiment.

Stopword Removal
At this stage, the operator used is a stopword filter (dictionary) because the dataset used is in Indonesian. In this process, data is entered into a list of words including stopwords then the file is uploaded in the stopword filter operator (dictionary). At this stage, irrelevant words will be removed such as the word after, this, it seems, so, are words that have no meaning of their own if separated from other words and are not related to sentiment-related adjectives.

Data-Driven Analysis of Borobudur Ticket … ■ 227
In this process, words that have a length of less than 4 or more than 25 will be deleted, such as the words "in", "and", "with", "by", and so on which are words that have no meaning if separated from other words and are not related to adjectives related to sentiment. Table 6. Text Comparison Before and After Process Filter Token (By Length)

Processing Text
Text before doing filter token (by length) After seeing this video, it seems that the entrance ticket to Borobudur temple so k makes sense Text after done filter token (by length) After seeing the video, it seems that the entrance ticket to Borobudur Temple makes sense Source: Research Results (2023)

Labelling
Labeling is the processing of data results after cleansing, where the results are given labeling calculations from the comments taken, so as to get a classification, namely positive, negative, and neutral labels. From 591 tweets, after preprocessing the data can be obtained 367 twitter data. The following are the labeling results of 367 twitter data.

Data Sharing
At this stage, the distribution of datasets that already have labels will be carried out to get train data and test data. Dataset division uses a ratio of 80% data train and 20% test data. This is used because based on previous research the comparison results of 80% data train and 20% data test get good results [20] [21] [22]. The total tweet data is 367 data. This data resulted in 294 data trains and 73 test data.

Naive Bayes Test Results
The following stages are the design of the Naive Bayes method model testing process used, namely: :

Figure 6. Naive Bayes Model Testing
Based on the test results above, the results can be seen in the following table:

Evaluation
Evaluation of the data testing results using the Confusion Matrix table, namely accuracy. The Confusion Matrix represents the predictions and actual conditions of the data generated by the algorithm used. Accuracy is used for evaluation to know the ratio of correct predictions to overall data.

Data-Driven Analysis of Borobudur Ticket…
■ 230  Based on the table with confusion matrix above, using the naïve Bayes method to check tweet data, the results obtained are an accuracy rate of 100% or 367 of the correct words and 0% error, which is as many as 0 words or commonly called there is also no error in detection. The positive class data that corresponds to the optimistic prediction is 132 data. The neutral class data that matches the neutral prediction is 123 data. The negative class data that corresponds to the negative prediction is 112 data. Based on the table with the confusion matrix above, using the decision tree method to check the tweet data, the results obtained are an accuracy rate of 35.97% or 132 of 367 correct words and 64.03% of errors, which is as many as 235 of 367 words in its detection. The positive class data that corresponds to the positive prediction is 132 data. The neutral class data predicted positive is 123 data. The negative class data predicted into positive is 112 data.

Conclusion
Based on the analysis conducted in this study, it can be concluded that this study classifies sentiment on tweets with the Text Mining process using the Naïve Bayes and Decision Tree methods.
Analysis of the calculation results of the Naive Bayes and Decision Tree methods was carried out based on data sources obtained from the Twitter crawling API, with 591 data then pre-processed into 367 data, 80% (294 data) used as training data, and 20% (73 data) as testing data. The accuracy results of the Naive Bayes method based on the evaluation of test results produce the highest accuracy value compared to the Decision Tree, which produces a test percentage of 100%. At the same time, the accuracy results of the Decision Tree method resulted in a test accuracy value of