.png)
DATA PREPARATION
The essence of every project revolves around the paramount importance of data. Choosing the right dataset is the fundamental and crucial element for any machine learning model. Ideally, the data utilized for training should closely align with the nature of the content the model endeavors to summarize. But, since the text summarization model has a diverse functionality, it is necessary to have a wide range of datasets. Hence the data chosen for this project involved a bunch of journals, amazon reviews and news articles posted on web. Finally all theses datasets will be combined to get a single dataset that will be divided into training, validation and Test data. This will be used for model training and testing. The code for all the data cleaning and visualization can be found here.

The first dataset chosen is a publicly available dataset - arXiv, featuring abstracts and titles of academic papers. This dataset is available on Kaggle in a metadata, json format. The arXiv dataset is licensed under Creative Commons CC0 1.0 University Public Domain Dedication and contains information about the submitter, author, title of the journal, additional comments and abstract of the paper. This dataset contains a huge number of records, hence it it is ideal to shuffle and limit the records to 25,000. The abstract column serves as the text for summarization, while the title functions as the reference summary. Data cleaning for this dataset includes dropping all the other columns and having only the title and the abstract column. The abstract column is renamed as text and title is renamed as summary. The two rows are checked for null/NaN or any missing values and are dropped if found. Only the rows with text length greater than 20 and summary length greater than 5 are chosen for the final dataset. The final dataset is converted to csv and is as shown in the image on the right.
Raw Data from Kaggle containing journals
Data after cleaning

Raw Data from API containing news articles
The second dataset that is chosen is from NewsAPI. It consists of news articles posted on the web. The base URL for the API is https://newsapi.org/v2/everything The code to get the response json from the API can be found here. This dataset includes information about the author, article, title and content of the article. Title and Content columns are kept and all other columns are discarded. After performing basic data cleaning like removing null values and NaN values, "Content" column is renamed as "Text" and "Title" column is renamed as "Summary" just like the first dataset. This dataset has only 100 records and hence more news articles are captured in a similar manner. Finally, all these news articles got from different APIs are put together in a single data frame and this is converted to csv.
Data after cleaning

Raw Data from hugging face containing reviews
The third dataset is of amazon reviews. This is got from Huggingface. The raw data is in json format and is read to a data frame for preprocessing. This data contains information about the product, review_body, review_title and the product category. For the text summarization model, only the review_body and the review_title columns are retained and they are renamed as text and summary respectively. The data is checked for null values and is dropped if found. The final dataset is converted to css for further analysis.
Data after cleaning

An experimental venture as part of this project involves YouTube video summarization. This involves taking a YouTube video, converting that to audio and from audio converting it to text. All these done in python in a single function which can be run on multiple video links to form a new dataset. This is still in progress and requires more reasearch