Results and Conclusion | Data Quality Classif

CLUSTERING

RESULTS

(a) K- Means Clustering

Choosing the right value of k is an important aspect of K - Means clustering. There are different methods that can be used to choose the best value of K. The three methods to choose the value of k are Elbow Method, Silhouette Method and Gap Statistics.

Elbow Method

The elbow method is a graphical approach to finding the optimal number of clusters in a dataset. In this method the processed dataset is fit to the clustering algorithm -KMeans, for a range of cluster values. The cluster values chosen are {2, 3, 4}. Distortion for each cluster is calculated which is nothing but the inertia i.e sum of squared distances from each point to its assigned center. The distortion values can be visualized to identify the point where the reduction in distortion starts to slow down.

Three different K values using Elbow method

Visualization of the K values obtained from Elbow method

Inference

From the graph, it appears that there is a significant reduction in distortion from k=2 to k=3, but the reduction is less pronounced from k=3 to k=4. Therefore, k=3 can be a reasonable choice based on the elbow method.

Silhouette Method

The silhouette method evaluates the quality of clustering on how well the clusters are separated from each clusters and how cohesive they are internally. The procedure includes fitting the clustering algorithm for a range of cluster values {2, 3, 4}. Silhouette score is calculated for each cluster value, this score is the measure of how similar the cluster is to its own cluster compared to the other clusters. Silhouette scores for different cluster values can be visualized to select the best cluster value (highest silhouette score)

Three different K values using Silhouette method

Visualization of the K values obtained from Silhouette method

Inference

From the graph, it is evident that for k=3 the silhouette score is the highest = 0.3798127164299102. Hence k=3 is the best K value that can be chosen for K means clustering.

From Elbow and Silhouette methods the best value of K is obtained to be 3. Hence K means clustering is performed for k=3 and the clusters are visualized for better analysis

(a) Hierarchical Clustering

Hierarchical clustering is a method used for grouping similar data points into clusters in a hierarchical manner. It builds a tree-like structure, known as a dendrogram, where each leaf represents an individual data point, and internal nodes represent clusters of varying sizes. The algorithm operates by iteratively merging or splitting clusters based on the proximity of data points, using a linkage criterion such as Euclidean distance or other dissimilarity measures. Cosine Similarity is the distance metrics used for the unlabeled numeric data.

Hierarchical clustering can be agglomerative, starting with individual data points as clusters and progressively combining them, or divisive, beginning with a single cluster and recursively partitioning it. The resulting dendrogram provides a visual representation of the hierarchical relationships between data points, and the final clusters can be selected based on the desired level of granularity. This method is flexible and suitable for exploratory analysis, but its computational complexity increases with the dataset size. Height is set to 0.0067 and hence the desired number of clusters is equal to 3 according to the hclust.

Inference

Hclust suggests that the best value of clusters is 3 in hierarchical clustering.

From both K means and Hierarchical clustering techniques the best value of K is found to be 3.

CONCLUSION

Clustering techniques like K-means and hierarchical clustering can effectively group data into distinct categories based on certain features, in this case, text_length and abstract_length. Both the techniques gave the best fit cluster to be three. The identification of three clusters suggests that documents or data points within each group share similar characteristics in terms of their length. This understanding can help in categorizing or organizing information based on its content length, offering potential insights into document types or writing styles.