Data Preparation and Code | Data Quality Classif

DATA PREPRATION

1.FEATURE SELECTION USING SVM

Support Vector Machines (SVM) are supervised learning models used for classification and regression analysis. However, they require specific data formats to function correctly:

Labeled Data: SVMs are supervised learning models, meaning they require labeled data. Each data instance must have a predefined label that the model will try to predict.
Numeric Data: SVMs cannot directly handle text data or categorical data; they require all input features to be numeric. For text summarization, text data must first be converted into a numerical format, typically through techniques like TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings, or one-hot encoding.

Dataset used for SVM

(b) Training and Testing Sets

Splitting Data: The labeled dataset must be divided into two disjoint sets: a Training Set and a Testing Set. The Training Set is used to build and train the SVM model, while the Testing Set is used to evaluate its performance and ensure that the model generalizes well to new, unseen data.
Why Disjoint: The Training and Testing sets must be disjoint to prevent overfitting. Overfitting occurs when a model learns the details and noise in the training data to an extent that it negatively impacts the performance of the model on new data.

Link to the data can be found here

Training Dataset

Testing Dataset

Data Preparation for SVM:

To use SVM for this task, text data needs to be converted into a numeric format. This can be done through

Text Vectorization: Convert both 'text' and 'summary' columns into numeric forms using techniques like TF-IDF. This method will convert the text into a matrix of TF-IDF features.
Labeling: In supervised learning, each input feature set (vectorized text) must be associated with a label. For summarization, SVM can be adapted differently, by turning the problem into a classification task if the summaries can be discretized into categories, or by using regression to predict aspects of the summaries.

Splitting Data:

Creating Training and Testing Sets: Typically, data is split in a ratio such as 80/20 , where 80% of the data is used for training, and the rest for testing. This ensures that the model is tested on unseen data.
Disjoint Sets: It's crucial that these sets do not overlap to ensure the model’s ability to generalize well to new data.

Why Numeric and Disjoint:

SVM works on geometric principles in a high-dimensional space, making numeric data essential as it involves calculations with vectors. The disjoint nature of training and testing sets is fundamental to validate the model's predictions against truly independent samples.

CODE

Code for SVM using Python can be found here.