How to Generate Simple Document Embeddings with Python

Home » Others » How to Generate Simple Document Embeddings with Python

How to Generate Simple Document Embeddings with Python

Document embeddings are one of the simplest ways to give machines an understanding of text, and in our previous article, Document Embeddings Explained: A Guide for Beginners, we explored how they turn entire documents into dense numerical vectors that capture meaning and context. Now that you understand what embeddings are and why they’re useful for tasks like semantic search, classification, and clustering, this tutorial will show you how to generate them in practice using Python. Whether you’re working with short paragraphs, long articles, or a collection of documents, the steps in this guide will help you create embeddings that you can analyze, compare, and use in real applications.


document-embeddings-guide-td

This guide will walk you through the process of generating document embeddings using Python and a pretrained transformer model. Even if you’ve never worked with embeddings before, you’ll see how easy it is to convert text into meaningful numerical vectors that can be used for search, similarity comparison, clustering, and many other AI applications.

Objectives

By the end of this article, you will be able to:

  • Understand how to generate document embeddings using Python.
  • Install and use the SentenceTransformers library to create text embeddings.
  • Load a pretrained transformer model suitable for document-level representation.
  • Convert sentences, paragraphs, or full documents into numerical vector formats.
  • Compute semantic similarity between documents using cosine similarity.
  • Apply embeddings to simple, practical tasks such as comparing text similarity.
Tutorials dojo strip

Prerequisites

Before we dive into the guide on how to generate simple document embeddings with Python, make sure you have the following ready:

  • We will be using Google Colab as our development environment for this guide. If you don’t know how to set it up, you can follow our previous article guide: Data Preprocessing Guide for Beginners in ML.
  • You need to install the necessary libraries for this guide. Running this command on the Google Colab cell will install them.
!pip install sentence-transformers scikit-learn

Generating the Document Embeddings

Importing the libraries

After installing, you can immediately import the libraries in the next cell. 

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

Loading the pretrained model

Now that the required libraries are imported, the next step is to load a pretrained transformer model. In this tutorial, we use all-MiniLM-L6-v2, a lightweight model from the SentenceTransformers library. This model has been trained to generate dense vector representations of text that capture semantic meaning, making it ideal for tasks like semantic similarity, clustering, and search.

model = SentenceTransformer('all-MiniLM-L6-v2')
sample_text = "Artificial intelligence is transforming businesses."
embedding = model.encode(sample_text)
print(embedding.shape)

After you run the code, you should see this output.

document-embeddings-guide-loading-pretrained-model

 

 

 

 

 

 

 

Preparing your data

The next step now that the model is loaded and working is to make a simple sentence dataset that we can practice on. You can make your own list of sentences or copy the following code.

# Example documents
documents = [
"Artificial intelligence is transforming businesses.",
"Machine learning allows computers to learn from data.",
"Python is a popular programming language for AI applications.",
"Deep learning models require large amounts of data.",
"AI can help automate repetitive tasks and improve efficiency."
]

Generate the embeddings

With our simple dataset, we can now start generating their document embeddings with this code. This code will first encode the data using the pretrained model and print out the shape of the generated embeddings.

# Generate embeddings for all documents
embeddings = model.encode(documents)
# Check the shape of the embeddings array
print("Embeddings shape:", embeddings.shape)

The code should give you this output.

Embeddings shape: (5, 384)

We can also inspect the first embedding vector with this code. It should output a list of floating-point values.

print("First embeddings vector: ", embeddings[0])

Compare similarity

To get the semantic similarity matrix of the data, we have to first compute it with this code.
# Compute cosine similarity between all embeddings
similarity_matrix = cosine_similarity(embeddings)
# Print the similarity matrix
print(similarity_matrix)

It should give you this matrix as an output. Each value in the similarity matrix shows how closely two sentences are related in meaning, with 1 meaning identical, 0 meaning unrelated, and values in between indicating partial similarity.

[[1. 0.527647 0.3797139 0.31519628 0.44177282]
 [0.527647 1.0000001 0.3662578 0.4742912 0.46388942]
 [0.3797139 0.3662578 1. 0.19788864 0.42667586]
 [0.31519628 0.4742912 0.19788864 0.9999999 0.26827574]
 [0.44177282 0.46388942 0.42667586 0.26827574 1.0000001]]

From this matrix we can then try to extract which pair of sentences is the most similar to each other using this code.

import numpy as np
# Set diagonal to -1 so we ignore similarity of a sentence with itself
np.fill_diagonal(similarity_matrix, -1)
# Find the index of the maximum similarity
most_similar_idx = np.unravel_index(np.argmax(similarity_matrix), similarity_matrix.shape)
# Print the most similar pair
i, j = most_similar_idx
print("Most similar sentences:")
print("Sentence 1:", documents[i])
print("Sentence 2:", documents[j])
print("Similarity score:", similarity_matrix[i][j])
 
It should give you this output that shows you which sentence pair is the most similar and their corresponding similarity score. The similarity score ranges from -1 to 1, where 1 means the sentences are very similar in meaning, -1 means they are opposite, 0 means they are unrelated, and values in between indicate partial similarity.
Most similar sentences:
Sentence 1: Artificial intelligence is transforming businesses. 
Sentence 2: Machine learning allows computers to learn from data. 
Similarity score: 0.527647

Summary

In this tutorial, we generated document embeddings using Python and a pretrained transformer model. We converted sentences into numerical vectors, computed cosine similarity to measure how closely they relate, and identified the most similar sentence pairs.

These steps form the foundation for tasks like semantic search, clustering, and text analysis, giving you a practical starting point to work with embeddings on your own data.

A copy of the Google Colab script can be accessed through this link

Next Steps

Now that you understand how to generate document embeddings and measure similarity between sentences, you can explore more advanced applications through these articles:

Free AWS Courses

References

Tutorials Dojo portal

Learn AWS with our PlayCloud Hands-On Labs

🧑‍💻 50% OFF – CodeQuest Coding Labs

$2.99 AWS and Azure Exam Study Guide eBooks

tutorials dojo study guide eBook

New AWS Generative AI Developer Professional Course AIP-C01

AIP-C01 Exam Guide AIP-C01 examtopics AWS Certified Generative AI Developer Professional Exam Domains AIP-C01

Learn GCP By Doing! Try Our GCP PlayCloud

Learn Azure with our Azure PlayCloud

FREE AI and AWS Digital Courses

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

Follow Us On Linkedin

 

 

 

Written by: Jaime Lucero

Jaime is a Bachelor of Science in Computer Science major in Data Science student at the University of Southeastern Philippines. His journey is driven by the goal of becoming a developer specializing in machine learning and AI-driven solutions that create meaningful impact.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses

Our Community

~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?