Introduction
Hello, world! I am back with another cool project “NLP text Summarizer” which is an NLP-based project which shortens the long document. So, in this article, I am going to show you how you can shorten your long document file and we will also see how you can summarize a Wikipedia page. I will be showing you two methods for document summarization. The first one is for teaching you how things work and, the second method is for a reason if you just want to use the tool. The model we will be building is going to basic and easy so you can understand how a basic text summarizer can be built. Okay so, let’s do it!
What is a text summarizer?
NLP text summarizer is one of the cool applications of NLP(Natural Language Processing), which shortens the long document into shorter while retaining all important information from the document. You can see its live implementation in many news apps, and websites.
Read more –> Everything you need to learn about NLP
Idea behind this project
So how are we gonna do this? Actually, there are lots of ways we can do this but today I will show you the most basic and easy version of NLP text summarizer, which you can create by yourself very easily. So, below are the steps we are going to perform
Steps
- After loading the data we will create two copy of the data, one for word vector and another one for sentence vector
- On the word vector data, we will first clean it by removing all useless elements like stopwords, punctuation marks, etc.
- Then we will Tokenize the word vectorizer (split each word) and create a count vector out of it. If you don’t know what count vectorizer is then here’s the thing, count vectorizer extracts all the unique words from the text(sentence/paragraph/document) and then creates a word matrix where each unique word is plotted as columns and each unique text as a row. So this matrix represents the number of occurrences of each word in a document, you can see this in the image below
Now in case, you got a question, why do we do this weird stuff? It is because it helps us find the most used word which actually turns out to be an important word or word which adds the most value to a document. - After calculating the count vectorizer we will move to the second part of the process which is handling sentence vectors. Here we will not remove stopwords and punctuation marks because if we do, then we won’t be able to read the document as normally as we do.
So, here we will create a sentence vector, which is nothing but a vector of a sentence (list of all unique sentence) - Now here comes the most important part. In this step, we will score each sentence by taking out each word of the sentence and checking its score from the word vectorizer, and then we will sum up all the values of words available in a particular sentence.
After doing that we will have a score for all sentences, which we will use to filter out the most important sentence from the document. - In this final step, we will create the actual summary of the document, here we will create a threshold score like a score of 1.5xaverage of sentence score and we will choose any sentence for the threshold only if it passes the threshold.
The threshold can be anything you can tweak it by yourself to find out how much shorter summary best fits you.
Also read –> How to do twitter sentiment analysis? Hands on with Elon Musk’s Tweets.
NLP Text Summarizer
I am doing this in Google Colab and I am attaching its notebook file, so don’t worry about the source code. So, we will start by importing all libraries
# importing libraries
import re
import nltk
import spacy
import string
import numpy as np
import pandas as pd
# downloading required files (download if needed)
nltk.download('stopwords')
nltk.download('punkt')
!pip install wikipedia
import wikipedia
In this section we will be reading a document or any text file, if you have any pre-written document you can upload it, for that just uncomment the code from line 2 to 4 and comment the lines 7 and 8. For demonstration purposes, we will be using Albert Einstein bio from Wikipedia, if you want you can enter someone else name to extract their data.
# Input text - to summarize
# text = open("Elon Musk.txt",'r',encoding='utf8')
# text=text.read()
# text
# from Wikipedia
Albert_Einstein = wikipedia.page('Albert_Einstein')
text = Albert_Einstein.content
Okay, it’s the time to clean the extracted text, we will remove all stopwords, and punctuations. We will store cleaned data into a separate variable so that our original data remain unchanged. However we are remove useless elements(“[34 ref]”. “=== see also ===”, etc) from original data as well like
# removing waste elements
text = re.sub('\[[a-zA-z0-9\s]+\]', '', text) # removing character like [34 text] etc
text = re.sub('=+\s+\d?-?[a-zA-z0-9\:–\s-]+\s+=+', '', text) # removing character like == see also ==
exclist = string.punctuation
# remove punctuations and digits from oldtext
table_ = str.maketrans('', '', exclist)
newtext = text.translate(table_).lower()
Now from extracted data we will separate each word (tokenization) and then we will create a wordvector, which will show number of occurrence of each word in a document
# creating frequency table
def freqTab(text, stopwords):
wrd_tkn = nltk.tokenize.word_tokenize(text)
without_stpwrds = [x for x in wrd_tkn if x not in stpWrd]
# wrd frequency
word_dict = {}
for i in without_stpwrds:
if i in word_dict:
word_dict[i] += 1
else:
word_dict[i] = 1
return word_dict
stpWrd = nltk.corpus.stopwords.words('english')
word_count = freqTab(newtext, stpWrd)
FreqTable = pd.DataFrame({'Key': list(word_count.keys()),
'Value':list(word_count.values())})
FreqTable.sort_values(by='Value', ascending=False)
Here we are separating each sentence (sentence tokenization) from a document
# creating sentence tokenizer
sent_token = nltk.tokenize.sent_tokenize(text) #using original text to keep punctuation marks
def sent_weight(sentence, word_count):
sent_rank = {}
for sent in sentence:
for word, freq in word_count.items():
if word in sent.lower():
if sent in sent_rank:
sent_rank[sent] += freq
else:
sent_rank[sent] = freq
else:
pass
return sent_rank
sent_rank = sent_weight(sent_token, word_count)
sent_token
Now we will rank each sentence based on it’s word component’s occurrence, we calculated in wordvector.
# let's see the what is the average score achieved
average = sum(sent_rank.values())/len(sent_rank)
print(f'Total sentence: {len(sent_rank)} \nTotal Score: {sum(sent_rank.values())} \nAverage Score: {average}',)
After that we will filter out all sentence with the value above or equal to threshold value, which in our case is 2.2*average score ( you can change the threshold value, according to your need). And finally we will generate our summary.
# let's summarize it
# Storing sentences into our summary.
summary = ''
counter=0
for sentence in sent_token:
if (sentence in sent_rank) and (sent_rank[sentence] > (2.2 * average)):
summary += " " + sentence
counter+=1
summary
The second method we were talking about is this. This method is as simple as it looks, you just have to import the library and just throw your data that’s it. You can adjust the size of the summary by ratio parameter.
If you want to learn more about Gensim and it’s tutorial, you can read it here
# import gensim
from gensim.summarization import summarize
summarize(text,ratio=0.15)
If you want you can add some additional steps to this NLP text summarizer to make it more better, for example lemmatization which converts any words to it’s base form (running –> run, better –> good, etc). Well I hope you liked this basic NLP text summarizer project 😀, what do you think about this comment down below.
Data Scientist with 3+ years of experience in building data-intensive applications in diverse industries. Proficient in predictive modeling, computer vision, natural language processing, data visualization etc. Aside from being a data scientist, I am also a blogger and photographer.