Results and Conclusion | Data Quality Classif

RESULTS

Decision Tree model for feature selection has achieved success, with an accuracy of 89.29%.

The visualizations, particularly the bar chart of feature influences, provide insight into what the linear SVM model is focusing on within the data. This can be useful for understanding model behavior and for tweaking feature engineering or the selection of kernel and hyperparameters.

Top Features Influencing Summary Inclusion by SVM (Linear Kernel)

The bar chart illustrates the SVM coefficients for different words, indicating their importance in the decision-making process of the SVM model. Words with higher coefficient magnitudes are more influential in classifying a sentence as being part of the summary.

Top Words Influencing Summary

Confusion Matrices and Accuracy

1. Linear Kernel (C=0.1)

Confusion Matrix: - Accuracy: 87%

2. RBF Kernel (C=1)

Confusion Matrix: - Accuracy: 89%

3. Polynomial Kernel (C=10)

Confusion Matrix: - Accuracy: 88.25 %

Kernel Comparison

Performance: The RBF kernel with C=1 achieves the highest accuracy among the three kernels, suggesting that it might be better at handling the non-linearities in the data.

Precision and Recall:From the confusion matrices, the RBF kernel also shows a good balance between precision (low FP) and recall (low FN), compared to the polynomial and linear kernels. It manages to classify more true positives and fewer false negatives, which is crucial for summary identification where missing key information (high FN) can be detrimental.

The polynomial kernel with C=10 and the linear kernel with C=0.1 perform similarly in terms of TNs and FPs but vary in their TP and FN rates. The polynomial kernel seems slightly better at minimizing false negatives compared to the linear kernel. The RBF kernel at C=1 appears to be the best performer in this specific scenario based on accuracy and the balance of precision and recall.

Linear Kernal with C=0.1

Rbf Kernal with C = 1

Poly Kernal with C = 10

CONCLUSION

In exploring the use of Support Vector Machines (SVM) for text summarization, we discovered that the RBF kernel performs best, suggesting it adeptly handles the complex, non-linear relationships in text data. This capability is crucial for identifying which sentences are important for summaries and which are not. The insights gained from the importance of specific words in decision-making can guide the refinement of summarization tools, making them more effective for digesting large volumes of text quickly and accurately. This is particularly valuable in areas like legal document review, academic research, and news aggregation, where efficient information processing is essential.