Data Preparation and Code | Data Quality Classif

DATA PREPRATION

1.SENTIMENT ANALYSIS USING DECISION TREES

The dataset used is reviews dataset that contains many columns related to reviews like 'title', 'length', 'rating' etc.

(a) Data Cleaning and Processing

A subset of this dataset is used for further processing. 'text' contains the reviews, 'summary' containing the title/summary of the reviews and rating which is basically the rating given by the customers. The ratings are mapped to sentiment label. Sentiment 1 is for positive sentiment and 0 is for negative sentiment. All the ratings above 3 are given positive (1) label and the ratings below 3 are given negative(0) label.

Data Cleaning process includes

Converting all text to lowercase to standardize the data.
Removing Special Characters: Special characters and numbers are removed, as they are often not useful for sentiment analysis.
Tokenization: Break down text into individual words or tokens.
Removing Stopwords: Commonly used words (e.g., "the", "is", "in") are removed since they usually don't carry sentiment.
Aggregating Tokens: Rejoin tokens into a processed string for vectorization.

Dataset after Cleaning and preprocessing

Screenshot 2024-03-27 at 12.18.37 AM.png

(b) Training and Testing Dataset

Why Disjoint Split?: Creating a disjoint split between training and testing data ensures that the model can be evaluated on unseen data, simulating how it would perform in real-world applications. It helps prevent overfitting, where the model performs well on training data but poorly on new data.

Creating the Split: The dataset is divided into training and testing sets, in a ratio of 80:20. The training set is used to train the model, and the testing set is used to evaluate its performance.

Link to Training dataset can be found here.

Link to Testing dataset can be found here.

Training Dataset

Testing Dataset

Screenshot 2024-03-27 at 12.28.45 AM.png

CODE

Code for Decision Tree using Python can be found here.