DATA PREPARATION

The essence of every project revolves around the paramount importance of data. Choosing the right dataset is the fundamental and crucial element for any machine learning model. Ideally, the data utilized for training should closely align with the nature of the content the model endeavors to summarize. But, since the text summarization model has a diverse functionality, it is necessary to have a wide range of datasets. Hence the data chosen for this project involved a bunch of journals, amazon reviews and news articles posted on web. Finally all theses datasets will be combined to get a single dataset that will be divided into training, validation and Test data. This will be used for model training and testing. The code for all the data cleaning and visualization can be found here.

The first dataset chosen is a publicly available dataset - arXiv, featuring abstracts and titles of academic papers. This dataset is available on Kaggle in a metadata, json format. The arXiv dataset is licensed under Creative Commons CC0 1.0 University Public Domain Dedication and contains information about the submitter, author, title of the journal, additional comments and abstract of the paper. This dataset contains a huge number of records, hence it it is ideal to shuffle and limit the records to 25,000. The abstract column serves as the text for summarization, while the title functions as the reference summary. Data cleaning for this dataset includes dropping all the other columns and having only the title and the abstract column. The abstract column is renamed as text and title is renamed as summary. The two rows are checked for null/NaN or any missing values and are dropped if found. Only the rows with text length greater than 20 and summary length greater than 5 are chosen for the final dataset. The final dataset is converted to csv and is as shown in the image on the right.

Raw Data from Kaggle containing journals

{"id":"0704.0001","submitter":"Pavel Nadolsky","authors":"C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan","title":"Calculation of prompt diphoton production cross sections at Tevatron and\n LHC energies","comments":"37 pages, 15 figures; published version","journal-ref":"Phys.Rev.D76:013009,2007","doi":"10.1103/PhysRevD.76.013009","report-no":"ANL-HEP-PR-07-12","categories":"hep-ph","license":null,"abstract":" A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predictions are made for\nmore detailed tests with CDF and DO data. Predictions are shown for\ndistributions of diphoton pairs produced at the energy of the Large Hadron\nCollider (LHC). Distributions of the diphoton pairs from the decay of a Higgs\nboson are contrasted with those produced from QCD processes at the LHC, showing\nthat enhanced sensitivity to the signal can be obtained with judicious\nselection of events.\n","versions":[{"version":"v1","created":"Mon, 2 Apr 2007 19:18:42 GMT"},{"version":"v2","created":"Tue, 24 Jul 2007 20:10:27 GMT"}],"update_date":"2008-11-26","authors_parsed":[["Bal\u00e1zs","C.",""],["Berger","E. L.",""],["Nadolsky","P. M.",""],["Yuan","C. -P.",""]]}

Data after cleaning

Raw Data from API containing news articles

{"status":"ok","totalResults":183,"articles":[{"source":{"id":"the-wall-street-journal","name":"The Wall Street Journal"},"author":"Steven Stalinsky","title":"Welcome to Dearborn, America's Jihad Capital...","description":"Imams and politicians in the Michigan city side with Hamas against Israel and Iran against the U.S.","url":"https://www.wsj.com/articles/welcome-to-dearborn-americas-jihad-capital-pro-hamas-michigan-counterterrorism-a99dba38","urlToImage":"https://images.wsj.net/im-919624/social","publishedAt":"2024-02-03T19:16:39Z","content":"Dearborn, Mich.Thousands march in support of Hamas, Hezbollah and Iran. Protesters, many with kaffiyehs covering their faces, shout Intifada, intifada, From the river to the sea, Palestine will be fr… [+251 chars]"},{"source":{"id":"the-wall-street-journal","name":"The Wall Street Journal"},"author":"www.wsj.com","title":"The Pope's Inquisitor Riles Conservatives. Some Call Him Heretic...","description":"Cardinal Victor Manuel Fernández has alarmed some Catholics with guidelines on blessing same-sex couples and a book on orgasms","url":"https://www.wsj.com/world/europe/the-popes-inquisitor-riles-conservatives-some-call-him-a-heretic-f70add41","urlToImage":"https://images.wsj.net/im-920473/social","publishedAt":"2024-02-03T17:00:03Z","content":"The Pope's Inquisitor Riles Conservatives. Some Call Him Heretic...Click here to read the full article The post The Pope's Inquisitor Riles Conservatives. Some Call Him Heretic... captured from The D… [+413 chars]"},

The second dataset that is chosen is from NewsAPI. It consists of news articles posted on the web. The base URL for the API is https://newsapi.org/v2/everything The code to get the response json from the API can be found here. This dataset includes information about the author, article, title and content of the article. Title and Content columns are kept and all other columns are discarded. After performing basic data cleaning like removing null values and NaN values, "Content" column is renamed as "Text" and "Title" column is renamed as "Summary" just like the first dataset. This dataset has only 100 records and hence more news articles are captured in a similar manner. Finally, all these news articles got from different APIs are put together in a single data frame and this is converted to csv.

Data after cleaning

Raw Data from hugging face containing reviews

The third dataset is of amazon reviews. This is got from Huggingface. The raw data is in json format and is read to a data frame for preprocessing. This data contains information about the product, review_body, review_title and the product category. For the text summarization model, only the review_body and the review_title columns are retained and they are renamed as text and summary respectively. The data is checked for null values and is dropped if found. The final dataset is converted to css for further analysis.

{"review_id":"en_0964290","product_id":"product_en_0740675","reviewer_id":"reviewer_en_0342986","stars":1,"review_body":"Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.","review_title":"I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review ...","language":"en","product_category":"furniture"}

Data after cleaning

An experimental venture as part of this project involves YouTube video summarization. This involves taking a YouTube video, converting that to audio and from audio converting it to text. All these done in python in a single function which can be run on multiple video links to form a new dataset. This is still in progress and requires more reasearch

CODE

The code for all the data cleaning and visualization can be found here.