Why we needed the word-2-vec when we already have Bags of word and TF-IDF algorithms in NLP?
The word 2 vec gives us a crucial data optimization as well as data interpretation by preserving the given sequence of the datasets and storing the data semantically or you can say it can save the semantic information of the given dataset.
Where Bags of Words and TF-IDF lacking the quality of data interpretation in Natural Language Processing (NLP).
Points to ponder: –
- Both Bags of Words and the TF-IDF approach can not able to store the semantic information of the given datasets. And sometimes TF-IDF gives importance to uncommon words like some words that aren’t available in the stop dictionary but they may or may not stop words for some figure of speech or part of speech.
- There is definitely a chance of Over-Fitting of data in the case of NLP.
To overcome these problems we introduced Word 2 Vec. The word2vec algorithm uses a neural network (mostly ANN/RNN) model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a Vector. The vectors are chosen carefully such that a simple mathematical function (the Cosine/Sine between the vectors) indicates the level of Semantic similarity between the words represented by those vectors.
Introduced a solution: –
- In this specific model, each word is represented as a vector of 32 or more dimensions instead of a single number as was in previous algorithms.
- Here, the semantic information and relation between different words are also preserved.
- As you can see in this graphical representation in which every single word is represented as a vector and have its own value on Dimension 1 (X-Axis) and Dimension 2 (Y-Axis). The case is, every word has its vector value and therefore the most probable similar words have to lie or lean on towards each other, which gives us an immense insight of similar words in the given data and those similar words are somehow similar in maybe nature/quality/another probable similarity of any type. Like, Men and Women both come from the human species but when you see the graph, then they or more likely to have minimal distance than something uncalled things like a table/chair/animal etc. This whole process of predictions and making every word as a vector makes it the best algorithms among them. And, this is the semantic information that we need to stored to commit/predict the output of the datasets desired by the user.
- Implementation: – Python3
From genism.models import Word2Vec #loading the downloaded model model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, norm_only=True) #the model is loaded. It can be used to perform all of the tasks mentioned above. # getting word vectors of a word dog = model['dog'] #performing king queen magic print(model.most_similar(positive=['woman', 'king'], negative=['man'])) #picking odd one out print(model.doesnt_match("breakfast cereal dinner lunch".split())) #printing similarity index print(model.similarity('woman', 'man'))