Basics of Natural Language Processing / #NLP and Python Implementation
The Bag of Words aka BOW is a simplifying representation used in Natural Language Processing (NLP) and Information Retrieval (IR). In this model, a text word or a whole sentence is represented as a multi-set of its words which is known as Bag, disregarding any grammar or language rules and regulations even the sentence sequences but counting the occurrence of word appeared how many times. This model has also been used for Computer Vision techniques.

The Bag of Words most commonly used in methods of document classifications where the repetition or occurrence of each word is used as the feature for training the classifier.
Let’s look at an example:
There are three sentences as given below: –
- He is a good boy.
- She is a good girl.
- Boy and Girl are good.
So, in the above sentences is/are/am etc. are irrelevant features so we’ll not take them into our consideration. Now, we will calculate the frequency of each important words:
Words | Frequency |
Good | 3 |
Boy | 2 |
Girl | 2 |
So, by this table, we’ll use vectorization to calculate the features of every given sentence: –
Here, we are just pointing out the bag/multi sets of words and their occurrence in the respective sentence given above 1, 2, 3 respectively.
Feature | F1 | F2 | F3 |
Sentences | Good | Boy | Girl |
1 | 1 | 1 | 0 |
2 | 1 | 0 | 1 |
3 | 1 | 1 | 1 |
After, these feature calculations we can have a new output table that can be used for NLP data processing.
Machine Learning Application Implementation:
Python Implementation¶
from keras.preprocessing.text import Tokenizer
sentence = ["Abhishek Tyagi is a Mechatronics Engineering Student."]
def print_bow(sentence: str) -> None:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentence)
sequences = tokenizer.texts_to_sequences(sentence)
word_index = tokenizer.word_index
bow = {}
for key in word_index:
bow[key] = sequences[0].count(word_index[key])
print(f"Bag of word sentence 1:\n{bow}")
print(f'We found {len(word_index)} unique tokens.')
print_bow(sentence)
Applications/Advantages: –
- Useful for spam filtrations.
- Handle small data easily.
- Easy to implement.
Disadvantages: –
- Can’t prioritize the important words.
- Every word has the same value.
- Doesn’t apply to big data.
[…] Bag of Words (BoW) Algorithm NLP […]
[…] Bag of Words (BoW) Algorithm NLP […]