top of page

DATA PREPRATION

1.SENTIMENT ANALYSIS USING DECISION TREES

The dataset used is reviews dataset that contains many columns related to reviews like 'title', 'length', 'rating' etc. 

​

(a) Data Cleaning and Processing 

​

A subset of this dataset is used for further processing. 'text' contains the reviews, 'summary' containing the title/summary of the reviews and rating which is basically the rating given by the customers. The ratings are mapped to sentiment label. Sentiment 1 is for positive sentiment and 0 is for negative sentiment. All the ratings above 3 are given positive (1) label and the ratings below 3 are given negative(0) label.

Data Cleaning process includes

  • Converting all text to lowercase to standardize the data.

  • Removing Special Characters: Special characters and numbers are removed, as they are often not useful for sentiment analysis.

  • Tokenization: Break down text into individual words or tokens.

  • Removing Stopwords: Commonly used words (e.g., "the", "is", "in") are removed since they usually don't carry sentiment.

  • Aggregating Tokens: Rejoin tokens into a processed string for vectorization.

​

​

Dataset after Cleaning and preprocessing

Screenshot 2024-03-27 at 12.18.37 AM.png

(b) Training and Testing Dataset 

Why Disjoint Split?: Creating a disjoint split between training and testing data ensures that the model can be evaluated on unseen data, simulating how it would perform in real-world applications. It helps prevent overfitting, where the model performs well on training data but poorly on new data.

Creating the Split: The dataset is divided into training and testing sets, in a ratio of 80:20. The training set is used to train the model, and the testing set is used to evaluate its performance.

​

Link to Training dataset can be found here.

Link to Testing dataset can be found here.

​

Training Dataset

image.png

Testing Dataset

Screenshot 2024-03-27 at 12.28.45 AM.png

CODE

Code for Decision Tree using Python can be found here.

​

​

​

bottom of page