DEV Community

Cover image for Finding The Most Important Sentences Using Data Science
Autumn
Autumn

Posted on • Originally published at skerritt.blog on

Finding The Most Important Sentences Using Data Science

The GitHub repo is here:

GitHub logo bee-san / tldr-News

🏖️ Generates a TL;DR of news using Natural Language Processing 🏖️

TL;DR

Uses an algorithm to determine the most important sentences in a news article and displays them at the top of the news article. Only works for BBC news articles. Isn't a proper Firefox extension either (although you're free to create one).

img

Everything inside the red box has been selected by the algorithm as the most important sentences, ordered from most important to least important.

What algorithm?

Term Frequency * Inverse Document Frequency

I explain all the code and the algorithm in the below blog post. Enjoy

https://skerritt.blog/tfidf/




We’re going to create a summary of BBC News Articles and place them at the top using a Firefox extension. This article is about the gnarly algorithm Term Frequency-Inverse Document Frequency (TF-IDF). We’re going to create a real-world usage in the form of a Firefox extension. I know what you’re thinking. “TF-IDF? Yawn 😴” but bare with me, it’s quite interesting!

When we’re done, it’ll look like this:

News article found here. Red box highlights the most important sentences. The red box is not in the final product, is only used for illustration purposes.

I promise you it's not as hard / boring as the algorithm's name makes it out to be!

Term frequency * Inverse Document Frequency

Don’t worry, the name of the algorithm makes me fall asleep every time I hear it said out loud too. This algorithm is 2 algorithms multiplied together. Let’s see how both of these work:

Term Frequency

Term frequency (TF) is how often a word appears in a document, divided by how many words there are.

The Term Frequency (TF) of a term, t, and a document, d.

Let’s say you’re reading a news article on Brexit. The word “Brexit” will appear a lot, so the term frequency of the word “Brexit” is high.

Quite often, we would want to build a dictionary (hashmap) of term frequencies alongside the term. Like {word: term frequency of that word} and then iterate through this dictionary to find out which word appears the most times.

Now, what if I told you that the term frequency dictionary would look a little like this:

{"and": 0.87, "the": 0.73}
Enter fullscreen mode Exit fullscreen mode

You can see how these common English words aren’t useful to us. Of course, most English texts have these words in them, but we call English words like these stopwords. Stopwords usually refer to the most common words in a language, although there isn’t one singular definition. You have to choose stopwords per usage. You have to decide on what words to use. Before processing some text you’ll usually want to remove stopwords to better process the text.

Words with capital letters in them differ from words without capitals. In programming, “Africa” and “africa” are two different things. Because of this, we want to turn everything into lowercase or uppercase to better process our text. We’re going to turn all words into lowercase.

Given a string, we want to remove stop words and turn it into lowercase. Our extension will give us a string of all text on a BBC news article. Don’t worry about where we get the text from yet, that’s done later in the Firefox extension section. For now, assume we have text that looks like this:

... These are external links and will open in a new windowA neat feature of many modern laptops is the ability to power them up through the USB port. Unlike the rectangular USB ports of old, the newer type - USB-C - can carry enough power to charge your 
machine.That’s great news: it means ...
Enter fullscreen mode Exit fullscreen mode

The above text is shortened to prevent the reader from falling asleep.

function prettify(document){
    // Turns an array of words into lowercase and removes stopwords
    const stopwords = ["a", "share", "linkthese", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any","", "are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours","ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves", "this"];
    // turn document into lowercase words, remove all stopwords
    var document = document.replace(/[.,]/g, '');
    let document_in_lowercase = document.split(" ").map(function(x){ return x.toLowerCase() });
    return document_in_lowercase.filter( x => !stopwords.includes(x) );
}
Enter fullscreen mode Exit fullscreen mode

This is the function which will “prettify” our documents. Line 3 is an array of stopwords I found on StackOverflow. I added “share” and “linkthese” since these are commons words in the news article we don’t want.

Line 5 is Regex. The square brackets mean or. [,.] means “activate on a comma or a full stop ”. /g means global. Once you find one ‘,’ or ‘.’ don’t stop, continue searching the string. The empty string is what we’re replacing it with. If we find a full stop or a comma, replace it with nothing — delete it. This is because the words “Africa.” and “Africa” would be classified as two different words without this.

From my Instagram, @Brandon.codes

Line 4 splits the document up into separate words. The map function applies a function to every element in an array. Once the string is split up into an array of words we apply the toLowerCase() method to every element. It makes every word lowercase.

From my Instagram, @Brandon.codes. The arrows are Arrow Functions (Lambda / Anonymous functions).

We then return the lowercase words once we’ve filtered out stop words. Filter() creates a new array with only the elements for which the function inside returns True.

If a word is a stop word, it’ll result in True which means we’ll get a new array of only the stopwords in the document. We use the negation operator “!” to get the opposite, which is what we want. To return a list of words with no stopwords in it.

Now we want to count how many times each word appears in the document. This will be useful for both Term Frequency and Inverse Document Frequency. First, we want to get all the unique words from an array of words.

function uniqueWords(words){
    const unique_words_set = new Set(words);
    return unique_words = Array.from(unique_words_set);
}
Enter fullscreen mode Exit fullscreen mode

We convert the array into a set because sets have no repetitions. This lets us get only the unique words in the array. Sets also don’t have an order, so we can’t use array indices to access elements. We need to turn it straight back into an array. For more about Set Theory, check out this article I wrote.

Okay, now time to count how many times a word appears in the words array.

function countWords(words){
    // returns a dictionary of {WORD: COUNT} where count is
    // how many times that word appears in "words".
    const unique_words = uniqueWords(words);
    let dict = {};
    // for every single unique word
    for (let i = 0; i <= unique_words.length - 1; i++){
        dict[unique_words[i]] = 0
        // see how many times this unique word appears in all words
        for (let x = 0; x <= words_without_stopwords.length -1; x++){
            if (unique_words[i] == words[x]){
                dict[unique_words[i]] = dict[unique_words[i]] + 1;
            }
        }
    }
    return dict;
}
Enter fullscreen mode Exit fullscreen mode

This function goes through every single unique word and counts how many times that word appears in the array of words. The Term frequency function is quite long, so I’m going to break it down.

function termFrequency(document){
    // calculates term frequency of each sentence
    words_without_stopwords = prettify(document);

    // gets rid of trailing spaces
    const sentences = document.split(".").map(item => item.trim());
    sentences[0] = sentences[0].substring(146);

    const TFVals = countWords(words_without_stopwords)
    const unique_words = uniqueWords(words_without_stopwords);

    // actually makes it TF values according to formula
    for (const [key, value] of Object.entries(TFVals)){
        TFVals[key] = TFVals[key] / words_without_stopwords.length;
}
Enter fullscreen mode Exit fullscreen mode

Line 6 divides the document up into sentences. Sometimes sentences have white space before them. “Brandon. Dogs.” Has whitespace before “Dogs.” we apply the trim() method to each item to get rid of these trailing whitespace.

Regarding line 7, The first 146 characters of the first word is social media links. The rest of that word is a title or sub-title. Here, look:

Share this withEmailFacebookMessengerMessengerTwitterPinterestWhatsAppLinkedInCopy this linkThese are external links and will open in a new window **Ryanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding".**
Enter fullscreen mode Exit fullscreen mode

This is annoying, as the title is an essential part of the story and needs to be taken into account. So we remove the first 146 characters of the first word to get:

Ryanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding"
Enter fullscreen mode Exit fullscreen mode

Remember this formula?

The Term Frequency (TF) of a term, t, and a document, d.

The variable “TFVals” is calculating this formula. If we run the sentence “Hello, my name is Brandon. Brandon Brandon. The elephant jumps over the moon” through the term frequency function, we’ll get something that looks like this:

TF only returns 1 “Brandon”, so not every Brandon has a TF score. For this example, bare in mind that TF only returns 1 Brandon because we are calculating the singular term (word) frequency.

We have the term frequencies of words, but we want to calculate the most important sentences , not words. To do that, we go through every single sentence and see what words come up in that sentence which are in TFVals.

Remember earlier how I said we don’t count all 3 “Brandon”’s? I’ve done that now. Also, note that we’re splitting this up into sentences. There are 3 sentences in this image.

We just need to add them all up and divide by how many words we have. Since we’re only adding up the TF values of non stop words, it’s only fair if we divide by how many non stop words there are, instead of how many words there are in a sentence. If we don’t divide by how many words we have, long sentences have an advantage over shorter ones.

The sentence with the highest TF is “Brandon Brandon.”

This is what line 20 onwards does below. We go through every single sentence and calculate the TF values of each sentence, just as we did above.

function termFrequency(document){
    // calculates term frequency of each sentence
    words_without_stopwords = prettify(document);

    // gets rid of trailing spaces
    const sentences = document.split(".").map(item => item.trim());
    sentences[0] = sentences[0].substring(146);

    const TFVals = countWords(words_without_stopwords)
    const unique_words = uniqueWords(words_without_stopwords);

    // actually makes it TF values according to formula
    for (const [key, value] of Object.entries(TFVals)){
        TFVals[key] = TFVals[key] / words_without_stopwords.length;
    }

    // splits it up into sentences now
    var TFSentences = {};
    // for every sentence
    for (let i = 0; i <= sentences.length - 1; i ++){
        // for every word in that sentence
        let sentence_split_words = sentences[i].split(" ");
        // get the assiocated TF values of each word
        // temp.add is the "TF" value of a sentence, we need to divide it at the end
        let temp_add = 0.0;
        let words_no_stop_words_length = prettify(sentences[i]).length;
        for (let x = 0; x <= sentence_split_words.length - 1; x++){
            // get the assiocated TF value and add it to temp_add
            if (sentence_split_words[x].toLowerCase() in TFVals){
                // adds all the TF values up
                temp_add = temp_add + TFVals[sentence_split_words[x].toLowerCase()];
            }
            else{
                // nothing, since it's a stop word.
            }
        }
        // TF sentences divide by X number of items on top
        TFSentences[sentences[i]] = temp_add / words_no_stop_words_length;
    }

    return TFSentences;
}
Enter fullscreen mode Exit fullscreen mode

And that’s it. But we have a problem with using only Term Frequency. As you may have seen earlier, “Brandon Brandon” was the highest scoring TF out of all 3 sentences we looked at.

Popularity isn’t enough. We don’t want sentences that have the most keywords in them as they may not make sense, or they may be repetitions of each other. Such as in the “Brandon” Brandon” sentence. It has a high TF value but doesn’t hold much content.

It doesn’t contain much information and isn’t useful. We want a sentence that is both rare, unique and contains keywords common in the article. This is where inverse document frequency comes in.

Inverse document frequency

Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is. The formula for IDF is:

t is the term and d is the documents. We have multiple documents, we’re treating each sentence as its own document.

IDF used over many documents, whereas TF is built for one document. You can decide what a document is. In this article, each sentence is its own document.

The first few steps of IDF is the same as TF. We prettify the document, count the words in the document and get all the unique words.

function inverseDocumentFrequency(document){
    // calculates the inverse document frequency of every sentence
    const words_without_stopwords = prettify(document);
    const unique_words_set = uniqueWords(words_without_stopwords);

    const sentences = document.split(".").map(item => item.trim());
    sentences[0] = sentences[0].substring(146);

    const lengthOfDocuments = sentences.length;
    // prettifys each sentence so it doesn't have stopwords

    const wordCountAll = countWords(words_without_stopwords);

    // counts words of each sentence
    // as each sentence is a document
    wordCountSentences = [];
    for (let i = 0; i <= lengthOfDocuments - 1; i ++){
        wordCountSentences.push(countWords(prettify(sentences[i])));
    }

    // calculate TF values of all documents
let IDFVals = {};
Enter fullscreen mode Exit fullscreen mode

Lines 1–6 is nothing new. The for loop on line 17 loops through every sentence in the document. Since each sentence is a new ‘document’, we need to count the words of each sentence individually. We have to prettify them to get rid of the stopwords and turn them into an array of words. We push the wordcount object of each new sentence into wordCountSentences.

There are 3 sentences here. We don’t care how many times Brandon appears over the whole document. I tried to pick the words to give you something interesting without making the maths too large, but in the process I developed narcissism.

We’re now going to go through every single word and count how many times that word appears in every sentence and calculate the IDF score using the below formula.

I use log 10 here. Note: all logarithms are constantly scaled, so it doesn’t matter much on what logarithm you use.

A small example of how IDF is used.

Now we just do this for every non stop word.

And the code for this is:

function inverseDocumentFrequency(document){
    // calculates the inverse document frequency of every sentence
    const words_without_stopwords = prettify(document);
    const unique_words_set = uniqueWords(words_without_stopwords);

    const sentences = document.split(".").map(item => item.trim());
    sentences[0] = sentences[0].substring(146);

    const lengthOfDocuments = sentences.length;
    // prettifys each sentence so it doesn't have stopwords

    const wordCountAll = countWords(words_without_stopwords);

    // counts words of each sentence
    // as each sentence is a document
    wordCountSentences = [];
    for (let i = 0; i <= lengthOfDocuments - 1; i ++){
        wordCountSentences.push(countWords(prettify(sentences[i])));
    }

    // calculate TF values of all documents
    let IDFVals = {};

    // how many times that word appears in all sentences (documents)
    wordCountSentencesLength = wordCountSentences.length;
    // for every unique word
    for (let i = 0; i <= unique_words_set.length - 1; i++){
        let temp_add = 0;
        // count how many times unique word appears in all sentences
        for (let x = 0; x <= wordCountSentencesLength - 1; x++){
            if (unique_words_set[i] in wordCountSentences[x]){
                temp_add =+ 1;
            }
        }
        IDFVals[unique_words_set[i]] = Math.log10(wordCountAll[unique_words_set[i]] / temp_add);
}
Enter fullscreen mode Exit fullscreen mode

Now we want to get the IDF Values of all the sentences, we use the same code from TF here but replace some things to make it work.

If I’m being truthful with you, I did a simple “find and replace” the variables. Instead of “TF” in the comments, I replaced them with IDF. Instead of “TFVals”, I replaced it with “IDFVals”. Nothing important has happened here, so feel free to skip this part.

function inverseDocumentFrequency(document){
    // calculates the inverse document frequency of every sentence
    const words_without_stopwords = prettify(document);
    const unique_words_set = uniqueWords(words_without_stopwords);

    const sentences = document.split(".").map(item => item.trim());
    sentences[0] = sentences[0].substring(146);

    const lengthOfDocuments = sentences.length;
    // prettifys each sentence so it doesn't have stopwords

    const wordCountAll = countWords(words_without_stopwords);

    // counts words of each sentence
    // as each sentence is a document
    wordCountSentences = [];
    for (let i = 0; i <= lengthOfDocuments - 1; i ++){
        wordCountSentences.push(countWords(prettify(sentences[i])));
    }

    // calculate TF values of all documents
    let IDFVals = {};

    // how many times that word appears in all sentences (documents)
    wordCountSentencesLength = wordCountSentences.length;
    // for every unique word
    for (let i = 0; i <= unique_words_set.length - 1; i++){
        let temp_add = 0;
        // count how many times unique word appears in all sentences
        for (let x = 0; x <= wordCountSentencesLength - 1; x++){
            if (unique_words_set[i] in wordCountSentences[x]){
                temp_add =+ 1;
            }
        }
        IDFVals[unique_words_set[i]] = Math.log10(wordCountAll[unique_words_set[i]] / temp_add);
    }

    let IDFSentences = {};
    // for every sentence
    for (let i = 0; i <= lengthOfDocuments - 1; i ++){
        // for every word in that sentence
        let sentence_split_words = sentences[i].split(" ");
        // get the assiocated IDF values of each word
        // temp.add is the "IDF" value of a sentence, we need to divide it at the end
        let temp_add = 0.0;
        let words_no_stop_words_length = prettify(sentences[i]).length;
        for (let x = 0; x <= sentence_split_words.length - 1; x++){
            // if the word is not a stopword, get the assiocated IDF value and add it to temp_add
            if (sentence_split_words[x].toLowerCase() in IDFVals){
                // adds all the IDF values up
                temp_add = temp_add + IDFVals[sentence_split_words[x].toLowerCase()];
            }
            else{
                // nothing, since it's a stop word.
            }
        }
        IDFSentences[sentences[i]] = temp_add / words_no_stop_words_length;
    }
    return IDFSentences;
}
Enter fullscreen mode Exit fullscreen mode

Now that we have written code to calculate the IDF value of sentences; we have completed the formula from earlier. “is a dog” has the highest IDF value.

We now know how unique or rare a sentence is. This isn’t that useful since we want the sentence to be information-rich too. We want some way to combine the popularity of TF with the uniqueness of IDF. This leads us to our next section…

TF-IDF revisited

We now have TF and IDF functions implemented. The only thing left to do is to multiply them together.

function TFIDF(documents){
    // calculates TF*IDF
    const TFVals = termFrequency(documents);
    const IDFVals = inverseDocumentFrequency(documents);

    let TFidfDict = {};

    for (const [key, value] of Object.entries(TFVals)){
        if (key in IDFVals){
            TFidfDict[key] = TFVals[key] * IDFVals[key];
        }
}
Enter fullscreen mode Exit fullscreen mode

The objects TF and IDF both stem from the same data, so TF isn’t going to contain something that isn’t in IDF. Because of this, we can iterate through one object and use the same key. We multiply the value in TFVals by the value from in IDFVals.

Our next step is to calculate the 3 most important sentences in our TF-IDF Object. Iterating over the [key, value] of the object with a couple of if statements works perfectly.

function TFIDF(documents){
    // calculates TF*IDF
    const TFVals = termFrequency(documents);
    const IDFVals = inverseDocumentFrequency(documents);

    let TFidfDict = {};

    for (const [key, value] of Object.entries(TFVals)){
        if (key in IDFVals){
            TFidfDict[key] = TFVals[key] * IDFVals[key];
        }
    }


    let max = 0.0;
    let max2 = 0.0;
    let max3 = 0.0;

    let max_sentence = "";
    let max2Sent = "";
    let max3Sent = "";


    // finds the top 3 sentences in TFidfDict
    for (const [key, value] of Object.entries(TFidfDict)){
        if (TFidfDict[key] > max){
            max = TFidfDict[key];
            max_sentence = key;
        }
        else if (TFidfDict[key] > max2 && TFidfDict[key] < max){
            max2 = TFidfDict[key];
            max2Sent = key;
        }
        else if (TFidfDict[key] > max3 && TFidfDict[key] < max2 && TFidfDict[key] < max){
            max3 = TFidfDict[key];
            max3Sent = key;
        }
    }
    return ("<br>" + "" + max_sentence + "<br><br>" + "" + max2Sent + "<br><br>" + "" + max3Sent);
}
Enter fullscreen mode Exit fullscreen mode

You’ll see at the bottom we return the formatted string. We format it so it looks nice when we insert it into the webpage. Each <br> is a line break, a space in the text. The black dots are bullet points. We’re now going to implement this algorithm into a Firefox extension. 🔥🦊

Getting & changing text in a BBC news article

Go to any BBC news article, right-click and press “inspect element”. You’ll see a nice box at the bottom of the screen. Use the element selector tool in the top left-hand corner and hover over the article. We can see that the whole article is encompassed within a CSS class of ‘story-body’.

If we go further in, we can see that all the actual text in the article is encompassed by paragraph tags, inside this CSS class.

We’re going to use JQuery to select the text.

// get all text from .story-body within p tags on a BBC news web article
let $article = $('.story-body').find('p').text();
Enter fullscreen mode Exit fullscreen mode

This line selects all the <p> tags within the story-body class. Now we want to get the text, we do this by applying the method .text().

We want to add our text to the top of the article. JQuery has a method called prepend which lets us prepend data to the top of an object.

// get all text from .story-body within p tags on a BBC news web article
let $article = $('.story-body').find('p').text();
// insert text into body of document
let insert = $('.story-body').prepend(TFIDF($article));
Enter fullscreen mode Exit fullscreen mode

And we’re done! We can now identify the most important sentences in a BBC News article and display them right at the top. Just time to turn it into an extension.

Firefox extension basics

Firefox extensions have 2 main parts. The Javascript you wrote and the manifest.json file which tells Mozilla what your extension does. We’re going over manifest.json now.

{
    "manifest_version": 2,
    "name": "TL;DR - Summary of BBC news articles",
    "version": "1.0",

    "description": "This extension creates a summary of BBC news articles using TF*IDF", 
    "content_scripts": [
      {
        "matches": ["*://www.bbc.co.uk/news/*"],
        "js": ["jquery.js", "tldr.js"]
      }
    ]

}
Enter fullscreen mode Exit fullscreen mode

manifest_version tells Firefox what version of manifest you are using. Name tells Firefox what the name of your extension is. Version tells Firefox what version number your extension is. These 3 are mandatory.

description tells Firefox what your extension does.

content_scripts tells Firefox what scripts to load when the URL matches what you have inputted. In order for the scripts you have specified to run, the current URL must match at least one of the URLs you have specified. You can use 2 special characters here:

  1. * ” Matches zero or more characters. In this instance, I don’t know whether the user will load HTTP or HTTPS so I have it step to load both. I also don’t know what exact article the user will look at, so I have it set to activate on any article.

  2. ? ” matches exactly one character.

The Mozilla Developer Network has a nice explanation of these:

For example: "*na?i" would match "illuminati" and "annunaki", but not "sagnarelli".

Since we’re going to be using jQuery, we’ll import the jQuery JS file as well into the website, before our script executes. You can grab the jQuery file from here. Copy and paste into a file named “jquery.js”.

Enter “about:debugging” into your Firefox URL to load this page:

From here, click “Load Temporarary Add-on…” and then click any of the files in the extension. Once you do, you should see this:

Mozilla have a nice article on the basics of Firefox extensions, here.

Now load any BBC news article to play with it!

Conclusion

You’ve now seen the awesome power of TF-IDF and a real-world application for it. This idea came to me because I have email anxiety. I get so nervous about reading emails that I wanted a quick summary of them to calm my thoughts. Alas, this is my first time ever writing Javascript. I started out on something easier like BBC news articles.

Here are some ways you can improve upon this code, if you so wish:

  • Dynamically select how many sentences in a summary you want. You can find out the average TF*IDF value in the whole article and anything over X you can include in the summary. This makes it so long articles are treated as equally as shorter articles.
  • Extending this to work on any other websites you wish.

Want to become a better developer? Sign up to my email list. Find out more here. You'll receive 7 one-a-day articles of my best content. No spam. No commitments. Unsubscribe anytime.


Top comments (0)