Automatic Text Summarization System Using Transformers

Trinh Nguyen

Aug 03, 2022

Are you feeling tired of reading a long paper? An automatic text summarization system using Transformers can help you deal with it. In this article, we’ll show you build a summarization system using HuggingFace and Streamlit.

Let’s try to summarize a paper about “How BTS Became The Undisputed Kings Of K-Pop”

Amazing summarize result

summarized result — Figure 2: Summarized result.

Easy text summarization using Transformers

Abstract

We present a system that has the ability to summarize a paper using Transformers. It uses the BART transformer and PEGASUS. The former helps pre-train a model combining Bidirectional and Auto-Regressive Transformers while the latter, PEGASUS, is a State-of-the-Art model for abstractive text summarization.

Nowadays, there are two ways to approach automatic text summarization in AI, including Extractive Summarization and Abstractive Summarization. However, in this post, we just focus on Abstractive Summarization because it is more advanced and closer to human-like interpretation. It showsmore potential and is generally more interesting for researchers and developers.

Introduction

Summarizing reduces a text to its main idea and necessary information. This process helps you better understand and learn important information about their key ideas. You can use summaries for annotation and study notes.

Today, we will talk about several attempts to automate the summarizing process and see how they work.

Related work

If you aren’t familiar with Transformers and attention mechanisms , check our previous blog post to have a general knowledge about it.

Extractive Summarization: The extractive approach selects the most important phrases and lines from your documents. It then combines all the important lines to create a summary. So, in this case, every line and word of the summary actually belongs to the original.

Abstractive Summarization: This approach uses new phrases and terms that are different from the original document, keeping the meaning the same, just like how humans do in summarization. So, it is much harder than the extractive approach.

extractive and abstractive summarization example. — Figure 3: Extractive summarization and Abstractive summarization example.

Overall architecture

At the end of 2019, Facebook AI Language researchers published a new model for Natural Language Processing (NLP) called BART (Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension). BART transformer has outperformed other models in the NLP field and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.

On the other side, in 2020 researchers of Google AI Language also had a new model for Natural Language Processing (NLP) called PEGASUS (Pre-training with Extracted Gap-Sentences for Abstractive Summarization). They both achieve state-of-the-art results on 12 diverse summarization datasets.

Table of Content

What is BART?
What is PEGASUS?
Dataset
Implement automation text summarization system using HuggingFace.

What is BART?

BART is a Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension developed by Facebook AI in 2019. It uses a standard Transformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (Bidirectional encoder) and GPT (left-to-right decoder).

BERT: Random tokens are replaced with the token [MASK], and the document is encoded bidirectionally. Missing tokens are predicted independently, so it’s difficult to use BERT for generation.

GPT: Tokens are predicted auto-regressively, meaning GPT can be used for generation. However, words can only condition in the leftward context, so it cannot learn bidirectional interactions.

Automatic text summarization system using Transformers — Are you tired of reading a long paper? — Figure 6: BART which combine from BERT and GPT. Source

BART: Input to encoder doesn’t need to be in the correct order like decoder outputs. Here, a document has been damaged by replacing spans of text with [MASK] symbols. The damaged document (left) is encoded with a bidirectional Encoder (both directions), and then the likelihood of the original document (right) is calculated with an Autoregressive Decoder.

Because the BART transformer has an autoregressive decoder, it can be fine-tuned for sequence generation tasks such as summarization. In summary, information is copied from input but controlled, which is closely related to the denoising pre-training object. Here, the encoder input is the input sequence, and the decoder generates outputs autoregressive.

What is PEGASUS?

PEGASUS, which stands for Pre-training with Extracted Gap-Sentences for Abstractive Summarization developed by Google AI in 2020. They propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective.

In PEGASUS, several complete sentences are [MASK] from a document. PEGASUS is trained to predict these sentences. An input is a document with missing sentences, PEGASUS will recover them then the output consists of missing sentences concatenated together. This task is Gap Sentence Generation (GSG).

Although the main contribution of PEGASUS is Gap Sentence Generation, its base architecture includes an encoder and a decoder. So, PEGASUS uses a pre-trained encoder as a masked language model.

In the encoder module, we take random mask words from the sequences and use other words from the sequence to predict these masked words.

Figure 8: Masked language modeling (encoder). Source

In PEGASUS, the encoder (MLM) and decoder (GSG) train simultaneously.

Originally there were three sentences. One is masked with [MASK1] and used as target generation text (GSG). The other two are stilled retained in the input, but some words are randomly masked by [MASK2] (MLM).

The Dataset

The dataset that we can use for training BART and PEGASUS is CNN/DailyMail dataset.

This dataset has two features:

The article – The text of new articles.
The highlights – Represent the key elements of the text and can be useful for summarizing.

CNN/DailyMail dataset (Hermann et al., 2015) contains 300,000 articles (93k articles from the CNN, and 220k articles the Daily Mail newspapers), each article will have several highlights.

· Article length ~ 766 words

· Summary length ~ 53 words

Let’s have a walk-through of the code!

In this code, we use Newspaper3k and Streamlit to build a simple demo.

Install dependencies

pip install transformers\npip install newspaper3k\npip install streamlit

Run the code

import os\nfrom transformers import BartTokenizer, BartForConditionalGeneration\nfrom transformers import PegasusForConditionalGeneration, PegasusTokenizer\nimport torch\nimport streamlit as st\nimport time\nfrom newspaper import Article\n\nif torch.cuda.is_available():\n   device = torch.device("cuda")\nelse:\n   device = torch.device("cuda")\n\nst.title('Text Summarization Demo')\nst.markdown('Orient Development team')\nmodel = st.selectbox('Model', ["Bart","Pegasus"])\n\nlink_paper = st.text_area('URL of paper')\narticle = Article(link_paper)\narticle.download()\narticle.parse()\ninput_text = article.text\n\nst.text(input_text)def run_model(input_text):\n    start_time = time.time()\n    if model == "Bart":\n        bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)\n        bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")\n    \n        input_text = ' '.join(input_text.split())\n        input_tokenized = bart_tokenizer.encode(input_text, return_tensors='pt').to(device)\n\n\t\tsummary_ids = bart_model.generate(input_tokenized,\n                                    num_beams = 4,\n                                    num_return_sequences = 1,\n                                    no_repeat_ngram_size = 2,\n                                    length_penalty = 1,\n                                    min_length = 12,\n                                    max_length = 128,\n                                    early_stopping = True)\n    \n        output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]\n        st.write('Summary')\n        st.success(output)\n    else:\n        pegasus_model = PegasusForConditionalGeneration.from_pretrained("google/Pegasus-cnn_dailymail").to(device)\n        pegasus_tokenizer = PegasusTokenizer.from_pretrained("google/Pegasus-cnn_dailymail")\n        \n        input_text = ' '.join(input_text.split())\n        batch = pegasus_tokenizer.prepare_seq2seq_batch(input_text, truncation=True, padding='longest', return_tensors="pt").to(device)\n        \n        summary_ids = pegasus_model.generate(**batch,\n                                            num_beams=6,\n                                            num_return_sequences=1,\n                                            no_repeat_ngram_size = 2,\n                                            length_penalty = 1,\n                                            min_length = 30,\n                                            max_length = 128,\n                                            early_stopping = True)\n        \n        output = [pegasus_tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)]\n        st.write("Summary")\n        st.success(output)\n    print("--- %s seconds ---" % (time.time() - start_time))\n    \nif st.button('Submit'):\n    run_model(input_text)

Result

We’ve tried several papers and got a good paper summarization result.

Text: https://e.vnexpress.net/news/news/pm-orders-covid-19-inoculation-starting-this-week-4242380.html

Summary generated by BART:

“Prime Minister Nguyen Xuan Phuc requires the Health Ministry to commence Covid-19 vaccination for prioritized groups from this week. The poor, families under preferential treatment and some priority groups need to be inoculated with the vaccine quickly, he told a Tuesday meeting.”

Summary generated by PEGASUS:

“Vietnam has set a target of immunizing 10 million people against the deadly Covid-19 influenza virus by the end of the year, the government has said.”

Text: https://www.dw.com/en/coronavirus-home-tests-will-give-germany-more-freedom/a-56677136

Summary generated by BART:

“Jens Spahn says home coronavirus tests are an important step on the return to normalcy. Three such self-administered rapid antigen tests have been given special approval for use. German Chancellor Angela Merkel echoed her health minister in emphasizing the importance of treating those who are and are not vaccinated the same.”

Summary generated by PEGASUS:

“Germany’s health minister says the country is “on the right path” in its efforts to return to normal following the H1N1 pandemic.”

Text: https://www.euronews.com/2021/03/02/covid-19-unrealistic-for-pandemic-to-be-over-by-year-end-says-who

Summary generated by BART:

“World Health Organisation says it’s “unrealistic” to think the COVID-19 pandemic will be over before the end of the year. The number of new cases rose globally in the week ending February 22 — the first weekly increase recorded since early January. Confirmed cases roses in Americas, Eastern Mediterranean, Europe, and South-East Asia.”

Summary generated by PEGASUS:

“It started “unrealistic” to think the first CO-19 pandemic will be over before the end of the year, a top World Health Organisation official stressed on Monday.”

Conclusion

The BART transformer model that is trained on CNN/DailyMail data delivers a good performance. It does provide a fluent summary. However, we think that it still has some weakness. Since the BART model is trained on English vocabulary, we can’t use it for other languages. Plus, BART may miss out on some keywords that researchers might want to see as a part of the summary.

The PEGASUS model which is also trained on CNN/DailyMail data provides a shorter version of summary than the BART model. However, the summary isn’t always meaningful and correct. In the second text, PEGASUS mistakes COVIDE-19 for H1N1 pandemic). Sometimes, PEGASUS comes with wrong and incorrect information like that.

Medium link: Automatic Text Summarization System Using Transformers