Husband. Father. Software engineer. Ubuntu Linux user.
On any blog, it’s really common to link to related posts near the end of an
article. It keeps readers on your website by linking to another post they might
be interested in, and it can help with SEO. For a long time, Jekyll has provided
site.related_posts as a
convenient way to link to related posts. Unfortunately, the default
implementation just lists the ten most recent posts (which might not actually be
that closely related). Jekyll does offer a better implementation using Latent
Semantic Indexing (LSI) with
classifier-reborn. This plugin
tries to populate related_posts
with posts that are actually related, but it’s
difficult to install and doesn’t always produce the best results.
I was aware of many of the problems and challenges with the existing approach in
classifier-reborn
since I updated classifier-reborn for Ruby 3 back in 2022.
Classifier-reborn isn’t too bad (and was useful enough to me that I updated it
for Ruby 3), but I’ve wished for a long time it was easier to use and produced
better results. More recently, with the rapid growth of ChatGPT and LLMs, I’ve
been wanting to try a personal project that could make use of modern AI. I read
an interesting blog post from Hacker News about how Embeddings are a good
starting
point, and
it occurred to me that embeddings from OpenAI would be a great way to get better
related post functionality into Jekyll! I decided to try building my own Jekyll
plugin for related posts
to see if AI would work well here, and I got some really great results!
OpenAI offers an Embeddings API that’s very easy to use. You provide some input text, and the API returns a vector embedding from OpenAI’s LLM. These vectors can be compared outside an LLM using simple vector similarity algorithms like cosine similarity. With a little searching, I found a SQLite plugin that extends SQLite with vector database functionality. This seemed like a great solution to me! I could cache vector embeddings from OpenAI in a small SQLite database, and use the same database (with the plugin) to perform a vector similarity search to find related posts!
Jekyll has a rich plugin ecosystem, and provides hooks that plugins can use to integrate with various steps in the build process. I designed my plugin as a generator plugin.
Generators run after Jekyll has made an inventory of the existing content, and before the site is generated.
When my plugin gets called during site generation, the first thing it does is
ensure that we’ve cached a vector embedding for every post. If any posts are
missing an embedding, we make a request to the OpenAI API to get it. Then, with
embeddings for all posts in our database, we perform a vector similarity search
for each post in SQLite, making this data available to use in the post itself
(via a Liquid template) as ai_related_posts
in the page data. The approach is
very simple, and turned out to work great!
One of my biggest concerns when designing this plugin was accuracy. Could I design a solution that would produce better results than classifier-reborn?
I think I was successful. Let’s look at some examples from my own blog, mikekasberg.com.
Here’s an example from one of my recent blog posts, 3D Printing Map Figurines with GPS. The table below shows the related posts produced by each approach.
It seems obvious to me that the posts on the right are much better related posts. All the posts generated by ai_related_posts are about 3D printing. Very relevant! In contrast, classifier-reborn only produced one related post about 3D printing. I’m sure there’s some reason the LSI approach thought the posts on the left might be related, but they seem somewhat random!
Let’s look at another example, Home WiFi Upgrades: Adding an Access Point with Wired Backhaul.
These results are interesting because two out of the three results are the same (but in a different order), and I don’t think either set of results is bad. But I do think the results on the right are, again, definitely better than those on the left. The most closely related article to “Adding an Access Point with Wired Backhaul” is “How to Test and Optimize Your Home Wifi Coverage”, and the AI plugin got this right! I also think the article about installing coax cable is indeed the next most closely related article, and the AI plugin got this right too while classifier-reborn missed this completely!
With evidence like the above, it seems clear to me that my AI plugin’s producing good results – much more accurate than classifier-reborn, which I was previously using on my blog. I could find many other examples where the AI approach produced better results, but I think examples above illustrate the point.
Another concern I had was performance. LSI is compute intensive, but when it uses computing libraries like Numo (a Ruby interface for LAPACK) it works fairly quickly. A Jekyll build on my machine using LSI with Numo averages about 3.5 seconds.
When I tested my AI plugin, the first Jekyll site build was very slow. But this was expected since it needed to fetch embeddings for every post for the first time. My blog currently has 84 posts, and this took 40 seconds (or about 0.5s per post). While not ideal, this is fine for a first run, and because we cache the embeddings the performance is much better after that. Any subsequent run takes about 4 seconds total. (Even with the cached embeddings, we perform a vector similarity search for each post on every build, for now.) So the performance isn’t faster than LSI, but it’s at least not noticeably slower. At only 4 seconds for a full site build with nearly 100 posts, I’m happy with the performance and it feels like a win to get better results than classifier-reborn in about the same amount of time!
Classifier-reborn is a open source plugin, so it’s free to use. My AI related
posts plugin is also open source and free to use, but requires calling OpenAI
APIs, which aren’t free. Fortunately, since we only need to call the API once
per post and we cache the results, the costs are minimal. I paid $5 for OpenAI
API access to get off the free plan and get higher rate limits. It turns out I
might not have even needed to do this – I got embeddings for all 84 posts in my
blog for $0.00 in API fees, using 1,277 tokens on the text-embedding-3-small
model. So while you do need an API key, it doesn’t seem like cost will be a
prohibitive factor to using the AI plugin. For most blogs, you can get
embeddings for all your posts from the OpenAI API for a few pennies!
I’m excited that relatively new AI technologies allowed me to build a plugin, with relatively little code, that produces better related posts than the LSI plugin that’s been used with Jekyll for a long time. And I’m excited to already be using it to make the related posts on my own blog better!
The plugin is open source on GitHub, and I’d love to see others start using it. I’d also like to collaborate to make it better! While it already produces great results, I think there’s potential to make the results even better, and to add integrations with other models and APIs besides OpenAI. (The approach should work with any model that can produce an embedding vector.) It’s exciting to see advancements coming out rapidly in the AI field and to think about how we might use them in the future!
👋 Hi, I'm Mike! I'm a husband, I'm a father, and I'm a senior software engineer at Strava. I use Ubuntu Linux daily at work and at home. And I enjoy writing about Linux, open source, programming, 3D printing, tech, and other random topics. I'd love to have you follow me on X or LinkedIn to show your support and see when I write new content!
I run this blog in my spare time. There's no need to pay to access any of the content on this site, but if you find my content useful and would like to show your support, buying me a coffee is a small gesture to let me know what you like and encourage me to write more great content!
You can also support me by visiting LinuxLaptopPrices.com, a website I run as a side project.