What Is Embedding and What Can You Do with It (2024)

What Is Embedding and What Can You Do with It (3)

Word2vec (published by a team of Google researchers led by Tomas Mikolov), as a “breakthroug technique” in the natural language processing field, has been eight years old. They pioneered the concept of word embedding as the foundation of the technique.

So far, the use of embedding has been applied to a wide range of analyses and has had a significant impact. Many concepts in machine learning or deep learning are built on top of one another, and the concept of embedding is no exception. Having a solid understanding of embedding will make learning many more advanced ML techniques much easier.

Therefore, in today’s blog, I’ll guide you through the following topics to help you gain a thorough understanding of the definition and applications of embeddings:

Layman Explanation: A Task-Specific Dictionary
Intuitive Explanation: points that are walking
Why We Want Embeddings
The Various Applications of Embeddings
Summary

From Google’s Machine Learning Crash Course, I found the description of embedding: An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

That’s fantastic! That is probably the most precise and succinct description of embedding I could find online. Nonetheless, it is still a little perplexing and vague. So, how can we explain it in layman’s terms when pretending to have just a rudimentary understanding of machine learning?

What Is Embedding and What Can You Do with It (4)

Here is an example: when applied to a text mining project, embeddings can assist us in learning the semantic meaning of a word by studying what other words it often appears next to. Then we can produce a list of embeddings, which can be treated as a task-specific dictionary. If you want to learn more about a particular word in your corpus, go to the “dictionary” and look it up. However, instead of providing you with the human language definition, it will return a vector of numerical values to reflect its semantic meaning. Furthermore, the distance between those vectors measures the similarity and relationship between the terms in the project.

As you can tell from the name of word2vec - “word to vector,” which turns a word into a vector of numbers. In other words, embedding is a string of numbers that serves as a unique identifier. We can use the embedding technique to assign a unique numerical ID to a word, an individual, a voice sound, an image, etc., in your research. Scientists, using this idea, have created many fascinating 2vec-style models to facilitate the machine learning process.

Normally, embeddings are saved as vectors. The definition of a vector is “a quantity having direction as well as magnitude, especially as determining the position of one point in space relative to another.” Since the embeddings are assumed to have directions and magnitude in the space, intuitively, we can treat each vector as a road map for a walking point. We can imagine that all the embeddings probably start from the same point, and they will start to walk in the space following their directions and weights. After the walking, each embedding will arrive at a different endpoint, and the neighbor points are more similar to each other and thus should be classified into the same groups. Therefore, we tend to use embeddings for community detection or group clustering with the help of cosine similarity. And it is generally used in classification tasks as well.

To illustrate the idea mentioned above, what we can do is use dimension deduction techniques (PCA, UMAP, etc.) to reduce the size of the high dimensional vectors to 2D/3D and draw the points on a plot. Here is a blog that specifically shows you how to achieve that in a 3D space:

Visualize High-Dimensional Network Data with 3D 360-Degree-Animated Scatter PlotUse node2vec, networkx, pca, seaborn, etc. to visualize high-dimensional network datatowardsdatascience.com

Here are several reasons why we want to include embeddings in our project:

First, current machine learning models continue to favor numerical values as inputs. They are similar to math nerds in that when fed numbers, they can quickly capture vital information but are slow with discrete, categorical variables. However, when researching computer vision or voice recognition and the like, we are unlikely to be able to collect or only collect numerical data for our targets/dependent variables. Converting such discrete, categorical variables to numbers can help with model fitting.

What Is Embedding and What Can You Do with It (5)

Second, it helps reduce dimensions. Someone may argue that the one-hot-encoding technique is how we handle categorical variables. However, in today’s data science world, it has proven to be much less effective.

When dealing with a variable with four distinct types, we will usually create four new dummy variables to cope with it. Moreover, it has previously worked well.

Yet, consider the following scenario: we are researching consumer feedback for three products. We would only have one variable for each observation — the review content. We can build a term-document matrix and then put it into a classifier or some other algorithms. However, let’s suppose we have 50 thousand reviews for each product, and the number of total unique words in the corpus is one million. Then we will end up with a matrix whose shape is (150K x 1M). This is a ridiculously large input for any model. And that is when we need to bring in the idea of embedding.

Assume we reduce the dimensions to 15 (a 15 digits ID for each product), take the average of the embeddings of each product, and then colorize them based on the numerical values; this is what we got:

What Is Embedding and What Can You Do with It (6)

Even though there is no human language presented, we can still perceive that customers’ perceptions towards products A and B are more similar to each other, and product C is perceived differently. And this matrix’s shape is only (3 x 15).

Here is another example talked about on Google Machine Learning Crash Course, talking about how to use embedding for a movie recommender system: Embeddings: Categorical Input Data.

The third reason is to reduce complexity. It is kind of like the extension of the second reason. Embedding also can help translate very complex information into a vector of numbers. Here is an example of social network analysis:

What Is Embedding and What Can You Do with It (7)

Initially, we collected the data from social media and converted it into a social network analysis graph. In the graph, we can use the distance between the nodes and the color of the ties to interpret the similarities between the nodes. However, it is complicated and hard to read. Right now, we only have 14 nodes in the graph, and it is already a mess. Can you imagine what would happen if we were to investigate 100 nodes? This is referred to as complex (high-dimensional) data. However, by using certain techniques to aid in dimensionality reduction, we can transform the graph into a list of embeddings. As a result, instead of the jumbled graph, we now have a new, clean “dictionary” for the nodes. We can use the “dictionary” to make a human-readable visualization.

You must have been excited to see some applications of embedding in practice after discovering what it is and why we want it. So, I selected a list of interesting applications using the idea of embedding along with some related literature or usage demonstration.

Natural Language Processing:

word2vec:

sent2vec:

Paper: https://arxiv.org/abs/1703.02507
Explanation: https://bit.ly/3uxFJ7S

doc2vec:

Paper: https://cs.stanford.edu/~quocle/paragraph_vector.pdf
Demonstration: https://towardsdatascience.com/detecting-document-similarity-with-doc2vec-f8289a9a7db7

Image Analysis:

Img2vec:

Explanation and demonstration: https://becominghuman.ai/extract-a-feature-vector-for-any-image-with-pytorch-9717561d1d4c

Social Network Analysis:

node2vec:

Paper: https://arxiv.org/abs/1607.00653
Case study & code: https://medium.com/analytics-vidhya/analyzing-disease-co-occurrence-using-networkx-gephi-and-node2vec-53941da35a0f

In the context of machine learning, embedding functions as a task-specific dictionary. It encodes our targets with a series of numbers that serves as a unique ID. We like to use embedding because it can help transform the discrete, categorical variables into model-readable data, and it can also help reduce the data’s dimensionality and complexity. I also listed several selected 2vec-style models.

Later this month, I probably will write another article to demonstrate one of the usages of embedding as well. The algorithms above are only small portions of 2vec-style models. And if you are interested, I think this website may help you.

Please feel free to connect with me on LinkedIn.

FAQs

What Is Embedding and What Can You Do with It? ›

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words.

View Details ›

What are embeddings used for? ›

Embeddings convert real-world objects into complex mathematical representations that capture inherent properties and relationships between real-world data. The entire process is automated, with AI systems self-creating embeddings during training and using them as needed to complete new tasks.

Get More Info Here ›

What is the role of embedding? ›

Since embeddings make it possible for computers to understand the relationships between words and other objects, they are foundational for artificial intelligence (AI). Technically, embeddings are vectors created by machine learning models for the purpose of capturing meaningful data about each object.

Learn More Now ›

What is the purpose of the embedding layer? ›

It helps in capturing the semantic meaning of words and sentences. Recommendation Systems: Embedding Layer is used to represent users and items in recommendation systems. By learning embeddings for users' past behaviors or item features, the model can make personalized recommendations based on similar users or items.

Read The Full Story ›

What is word embedding with example? ›

For example, we humans understand the words like king and queen, man and woman, tiger and tigress have a certain type of relation between them but how can a computer figure this out? Word embeddings are basically a form of word representation that bridges the human understanding of language to that of a machine.

Learn More ›

What are the advantages of using embeddings? ›

Word embeddings are a powerful tool in natural language processing (NLP) that can capture the semantics of a language. Word embeddings have many advantages over traditional methods of representing words, such as improved semantic representation, increased efficiency, and increased accuracy.

Discover More ›

What is embedding in OpenAI? ›

Embedding vectors (or embeddings) play a central role in the challenges of processing and interpretation of unstructured data such as text, images, or audio files. Embeddings take unstructured data and convert it to structured, no matter how complex, so they can be easily processed by software.

Learn More ›

What are the three methods of embedding? ›

But if you are still wondering if one technique is better than another, or what it would take to start using a new sectioning method, keep reading. We'll cover that here. The three primary means of embedding tissue for sectioning are paraffin wax, Optimal Cutting Temperature (OCT), and resin.

Read The Full Story ›

How is embedding done? ›

Embedding is the process in which the tissues or the specimens are enclosed in a mass of the embedding medium using a mould. Since the tissue blocks are very thin in thickness they need a supporting medium in which the tissue blocks are embedded. This supporting medium is called embedding medium.

See Details ›

What is the difference between encoding and embedding? ›

While encoding captures basic data information, embedding goes further by translating high-dimensional representations into lower-dimensional vectors that preserve semantic relationships.

View Details ›

What is embedding in image processing? ›

Image embeddings are a numerical representation of images encoded into a lower-dimensional vector representation. Image embeddings condense the complexity of visual data into a compact form. This makes it easier for machine learning models to process the semantic and visual features of visual data.

Find Out More ›

What is word embedding simply explained? ›

Word Embeddings in NLP is a technique where individual words are represented as real-valued vectors in a lower-dimensional space and captures inter-word semantics.

Learn More Now ›

What is the best word embedding? ›

Word2Vec (Skip-gram and Continuous Bag of Words)

Word2Vec is a popular technique for learning word embeddings, based on neural networks that learn the optimal word representations by training on a large dataset. Word2Vec embeddings are efficient to compute and can capture complex linguistic patterns.