Data Preparation and Code | Data Quality Classif

DATA PREPARATION

For video and text summarization, especially when dealing with generating textual summaries from video transcripts, Naive Bayes be used for related tasks, such as:

Classifying sentences or segments of the video transcript as important or not important, based on features extracted from the text, to help identify key points for inclusion in a summary.
Sentiment analysis to determine the overall sentiment of sections of the video, which could guide the summarization process by highlighting significant emotional peaks or areas of interest

The two models can be combined to get a comprehensive Text Summarizer model that can summarize reviews, news articles, journals and even videos!

Given the large diversity of datasets, like YouTube video transcripts, Amazon product reviews, Research papers and News articles, Sentiment Analysis can be performed on reviews dataset, and Classifying sentences as important or not can be done on the journal dataset.

1. SENTIMENT ANALYSIS USING NAIVE BAYES

(a) Data Cleaning and Preparation

The first step is data preparation, this includes selecting suitable columns to create a labeled dataset.

Three columns: text, which contains the full review text, summary, which provides a brief summary or headline for the review and rating, which gives a numerical rating associated with each review, is chosen. This structure is suitable for sentiment analysis, as the numerical ratings serves as sentiment labels. The link to cleaned dataset can be found here.

Dataset after data preparation and cleaning

(b) Balancing

Second step is to check if the dataset is balanced. This can be done by plotting a histogram of rating. The graph shows that the data is well balanced and there are equal reviews with rating 1-5.

Histogram of Rating

The ratings are mapped to sentiment label. Sentiment 1 is for positive sentiment and 0 is for negative sentiment. All the ratings above 3 are given positive (1) label and the ratings below 3 are given negative(0) label.

Next step is to create a test-train split which involves partitioning the dataset into two segments: one for training the machine learning model (training set) and the other for testing the model's performance on unseen data (testing set). This approach is essential for evaluating a model's generalization ability accurately. A disjoint split is created to avoid overfitting. This ensures that the model learns to generalize from the training data and not memorize it. Overfitting occurs when a model performs exceptionally well on training data but poorly on new and unseen data. Another reason to create a disjoint split is for realistic performance evaluation. By testing on unseen data, we can assess how the model would perform in real-world scenarios, providing a more accurate evaluation on its effectiveness. This also helps in model tuning as the test set can help in tuning the models hyper parameters to improve performance without risking overfitting. The dataset is split into testing and training dataset by setting test_size to 0.2 and random state to 42

Link to training dataset can be found here.

Link to testing dataset can be found here.

Training Dataset

Testing Dataset

2. CLASSIFYING SENTENCES WHETHER THEY ARE IMPORTANT OR NOT

Second approach is to classify the sentences in video transcripts, reviews or articles based on their importance. This can help the Text Summarizer model to accurately summarize any given input.

(a) Data Cleaning and Preparation

This step involves creating a labeled dataset with two columns, text, which contains full sentences or paragraphs, and summary, which is a summary or title of the corresponding text. To classify sentences as important or not, we need to establish what constitutes "important" in this context. One approach is to choose any sentence that is present in the 'summary' and 'text' to be labeled as important. Since there's no direct mapping between sentences in the text and their presence or contribution to the summary, a heuristic approach for labeling can be taken:

Steps includes

Extract Sentences: Split the text in the text column into individual sentences.
Labeling Strategy: For each sentence, determine if it is "important" based on specific criteria related to the summary. One approach could be to check for overlap in key terms or phrases between each sentence and the summary. Sentences with a higher degree of overlap with the summary can be considered more important.
Feature Extraction: Transform the sentences into a format suitable for machine learning, such as using TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into a numeric format.
Train Naive Bayes Classifier: Use the labeled sentences as input to train a Naive Bayes classifier. The classifier will learn to predict the importance of new sentences based on the features extracted in the previous step.

Dataset after data preparation and cleaning

(b) Splitting the dataset into testing and training datasets

Applying the simple labeling function to a small subset to test tokenizes the text into sentences using the simple method. Cosine similarity is calculated between each sentence and the summary to determine the threshold for labeling a sentence as 'important' . Finally, based on this threshold, the sentences are labeled as 'important' or 'not important'. This dataset is split into testing and training dataset by setting test_size to 0.2 and random state to 42.

Link to training dataset can be found here.

Link to testing dataset can be found here.

Training Dataset

Testing Dataset

CODE

Code for Sentiment Analysis using Python can be found here.

Code for Classifying sentences whether they are important or not in Python can be found here.