DATA PREPRATION
1.SENTIMENT ANALYSIS USING DECISION TREES
The dataset used is reviews dataset that contains many columns related to reviews like 'title', 'length', 'rating' etc.
​
(a) Data Cleaning and Processing
​
A subset of this dataset is used for further processing. 'text' contains the reviews, 'summary' containing the title/summary of the reviews and rating which is basically the rating given by the customers. The ratings are mapped to sentiment label. Sentiment 1 is for positive sentiment and 0 is for negative sentiment. All the ratings above 3 are given positive (1) label and the ratings below 3 are given negative(0) label.
Data Cleaning process includes
-
Converting all text to lowercase to standardize the data.
-
Removing Special Characters: Special characters and numbers are removed, as they are often not useful for sentiment analysis.
-
Tokenization: Break down text into individual words or tokens.
-
Removing Stopwords: Commonly used words (e.g., "the", "is", "in") are removed since they usually don't carry sentiment.
-
Aggregating Tokens: Rejoin tokens into a processed string for vectorization.
​
​
Dataset after Cleaning and preprocessing

(b) Training and Testing Dataset
Why Disjoint Split?: Creating a disjoint split between training and testing data ensures that the model can be evaluated on unseen data, simulating how it would perform in real-world applications. It helps prevent overfitting, where the model performs well on training data but poorly on new data.
Creating the Split: The dataset is divided into training and testing sets, in a ratio of 80:20. The training set is used to train the model, and the testing set is used to evaluate its performance.
​
Link to Training dataset can be found here.
Link to Testing dataset can be found here.
​
Training Dataset

Testing Dataset
