Select Page

Introduction to Embedding and GPT

In the vast realm of artificial intelligence (AI) and machine learning, a foundational truth stands firm: these systems thrive on numbers. While we live in a world filled with rich nuances and myriad forms of data, at their core, AI models recognise and understand numeric representations.

Rewinding to 2019, during my pursuit of “100 Days of Machine Learning,” I found that a staggering 80% of the effort was invested not in modeling, but in data wrangling – sifting through, cleaning, and transforming raw data into a structured format ready for machine learning consumption.

So, how do we bridge the gap between the varied forms of organisational data and the number-crunching prowess of AI models? Enter the concept of “embedding.” This post will delve into the intricate world of embeddings and how they can be harnessed to unlock the capabilities of advanced models like GPT, tailored specifically for an organisation’s unique data.

Understanding Neural Networks

Next, what is a neural network? It is a computational model inspired by the human brain’s structure. It consists of interconnected nodes (analogous to neurons) that process information in layers. Neural networks are a subset of machine learning (ML), a branch of artificial intelligence (AI).

In the AI/ML landscape, neural networks are especially popular for tasks like image and speech recognition. When applied to language tasks, such as translation or sentiment analysis, it falls under the domain of Natural Language Processing (NLP).

Demystifying Embeddings

Embedding is associated with NLP and is a technique where words or phrases are converted into vectors of numbers, allowing the computer to understand and process language. By representing words as vectors, we can capture their meaning and relationship with other words. As I said, this is crucial for tasks like text analysis, translation, and sentiment detection. Embeddings make it possible for machines to grasp the nuances of human language.

Embedding is mapping discrete or categorical variables to a vector of continuous numbers …

What does that actually mean you ask?! 🙂

Well, a discrete or categorical variable can take on one of a limited set of values, like days of the week or colours of a rainbow. It’s like choosing a flavour at an ice cream shop; you pick one specific option from the available choices.

And “a vector of continuous numbers”?!

I have got you … 🙂

Think of a row of water bottles lined up on a table. Each bottle can be filled to any level, not just full or empty. One might be half-full, another 1/4 full, and another almost to the top. A vector of continuous numbers is like a list of the water levels in each bottle. It captures the exact amount in each bottle, even if it’s somewhere in between full and empty.

Ok, so we have a collection of numbers but what does that mean with regard to embedding?

Imagine you visit a vast library with millions of books. Each book represents a unique category or topic. Now, if you had to describe each book’s content, it would be a lengthy process. Instead, what if you could summarise each book with a small set of keywords or a short summary? That would make things much more manageable!

So, each book represents a discrete or categorical variable. The set of keywords or short summaries for each book represents the embedding – a simpler, condensed representation of the book. Instead of describing the entire book, you use a vector (a list) of continuous numbers (the keywords or summary) to represent it; the genre is a simple example.

When we talk about neural networks, summarising each book is like training a neural network to understand and represent each category meaningfully.

Doing this reduces the library’s vastness (or dimensionality) to just a set of keywords or summaries, making it easier to work with and draw insights from.

In essence, embeddings help us take something vast and complex (like the content of a book) and represent it in a simpler, more digestible format.

OK, hopefully you are still with me! 🙂

Embeddings in Practice

So, how do embeddings help us? Here is an example of shallow embedding (ie simple).

Say I have three words:

  • apple
  • pear
  • hammer

I then convert all of these into embeddings. I then supply a fourth word to see if any of the above are similar. So let’s take “food” for example. Through some maths (cosine similarities, not important at this stage just for interest), we can determine which words are similar to “food” by comparing their vectors so that it would return “apple” and “pear”. Similarly, if I were to ask it for “fruit” I would get the same result but if I asked for “tool” I would only get the hammer back.

So now let’s imagine having a large number of documents in a business, Support documentation, Policy, Compliance etc and you wanted your team to be able to ask general questions about a topic, which this documentation covers, and get a meaningful “correct” answers back without having to trawl through every document… here enters embeddings.

The concept is exactly the same: we “chunk up” the documents, let’s say into 200-word blocks, and turn them into embeddings. Then, a team member can ask a question. We turn the question into an embedding and then find the “chunks” similar to the question.

Visualizing Embeddings

Because embeddings are vectors, we are able to visualise them. Below is such a visualisation:

Each dot on the graph represents a single “chunk” (200 words) of embedding data. The question was “Give me a detailed overview of your DR and BCP policies”. The RED dot indicates the question embedding. The BLUE dots represent the closet similar chunks to my question related to.

Now sit and soak in that for a second …

What this means is you can have people ask a question and get the “correct” answer back! Now, I can hear the questions flying in, “they are just chunks of info” and “they aren’t useful in that form”!

If that were it, you would be correct.

So, what has been released in the last 6-12 months that has been the catalyst for so much innovation? Large Language Models, GPT in particular.

OpenAI has a very easy-to-use Embedding API endpoint to help you with embedding. You can then use GPT-4 to formulate a response.

A Glimpse into the Technical Side

So, to finish up, here is a basic prompt I have been playing with that has helped format the response from GPT-4.

I firstly set the context; this helps to “tune” GPT to “Act as” the type of person you want it to respond as and how you want the response formatted:
You are a senior engineer and customer support agent for <our product>
You are an expert at interpreting documentation and giving helpful answers.
Your answers are always super detailed but clear.
You use markdown to answer in a well-structured and easy-to-read way.
You are great at interpreting the actual problem the user is trying to answer.
You quote in double-quote the relevant documentation parts in the answer where necessary.
You first answer the question, then proceed with an explanation.
I then inject the following:
Below are extracts from RedEye’s documentation that will help formulate the response:
<chunk_1>
<chunk_2>
<chunk_3>
<chunk_4>
<chunk_5>
<chunk_6>
<chunk_7>
<chunk_8>

Here is the response:

Conclusion and Looking Ahead

In the coming weeks, I will share a little more on the HOW this is done technically, but hopefully, this has piqued your interest, and you can see the huge potential this could have for just about every business. There are already fully fledged services that do this for you out of the box, e.g. https://www.chatbase.co/ (disclaimer: I haven’t tested this one). I was interested in building one myself so I could see the end-to-end process.
Stay tuned for the next article from my learnings on this 100 days of AI journey. 🙂