Data Preparation and Code | Data Quality Classif

CLUSTERING

Sample data for clustering after data preparation

DATA PREPARATION

For k-means and hierarchical clustering, unlabelled numeric data is required. Among the three different datasets that are available, the dataset consisting of research documents can be chosen for performing clustering. Since the dataset mostly consists of textual data, numerical features like title_length, abstract_length, can be considered for clustering. The sample of data that will be used for clustering can be found on the right. The sample of data for clustering can also be found in the image below

Sample data before data preparation

Visualization of sample data

CODE

Code for K Means Clustering in Python can be found here.

Code for Hierarchical Clustering in R can be found here.