Does that need a LLM?

Today there is a rush to throw AI at every problem. There are certainly use cases where LLMs can be useful and they can accelerate development in many cases. However, the economics of LLMs can get tricky by relying on them too heavily. Every process using a LLM is going to increase the cost of your application. As your application scales so will the costs.

When I initially built my Bluesky feeds a LLM was a convenient way to disambiguate meaning of posts to classify them to the correct feed. I didn't need to invest time into collecting data for a model. With a little prompting I was able to have a classifier with over 90% accuracy up and running in a few hours. This was perfect for a hobby project.

As I've added more feeds I replicated this pattern. Each feed increased the volume of posts going to the LLM classifier and the costs started adding up. For a hobby project whose main benefit is for learning with no real prospect of a financial return the increasing costs were a threat to the long-term sustainability of the project. So, I asked do I need a LLM?

Do I really need a LLM for this?

Absolutely not. At least not to the extent I was using a LLM for classification. Basically, I was sending all posts that had a related keyword to the LLM classifier to determine if the post was actually related to the feed topic or not. With the NFL and CFB seasons starting the daily LLM costs were becoming unsustainable. I needed to reduce the LLM spend associated with the increased volume.

I started reviewing the posts that were being sent to the LLM classifier. What I discovered was that most posts going to the classifier are off-topic noise. For example, for the keyword "Georgia" I only care if it's used in the context of the Georgia Bulldogs college football team. In practice only a small number of posts mentioning "Georgia" will be about the Bulldogs. A majority of mentions are going to be about the state in general and its politics (this is Bluesky afterall!).

Based on this observation I began thinking about less expensive alternatives that maybe don't perform the entire classification task, but could remove the low-hanging fruit to be more discerning so that only the more ambiguous posts would be sent to the LLM classifier.

Detecting obvious off-topic posts

While designing this filter there were three goals: must be cheaper than a LLM call, maintain similar accuracy, and fix within the existing infrastructure (adding a new service negates the cost savings of reducing LLM calls).

To keep things simple I decided to build a simple logistic regression classifier to classify posts as being eligible for the LLM and those that are most definitely not. Since social media posts don't contain a lot of text I used embeddings as the features to train against to the model to capture more of the semantic meaning of the text. Embeddings also have the benefit of reducing the dimensionality of the input features. Instead of having a sparse feature space with thousands of features (words) the embeddings have a dense representation of only 1,536 features generated by the OpenAI text-embedding-3-small model.

For labeling the data I used Claude Code (waaaaay better than Cursor) to build a simple annotation CLI tool. The tool loaded posts from the database that could be labeled with a true/false flag indicating if the post should be a part of the feed or not. About 2,000 posts were labeled for the model to be trained against.

The training pipeline consisted of generating the embeddings for the labeld posts using the OpenAI embeddings endpoint to use as the input features. Since the data was imbalanced (far more off-topic posts than on-topic posts) the class weights were adjusted to account for this and a grid search was performed to find the optimal hyperparameters. The model was evaluated against a holdout dataset.

After the model was trained a threshold analysis was performed to find the threshold that optimizes for maximizing the number of off-topic posts filtered out while minimizing the number of on-topic posts filtered out. I found that a majority of off-topic posts had extremely low probability estimates from the model. Setting a threshold of 0.05 captured 90% of the off-topic posts and nearly 99% of posts below the threshold were off-topic. While a small number of on-topic posts were being filtered out, the cost savings were worth it.

Outcomes

Models were trained seperately for the NFL and CFB feeds. The model pipline was updated to generate embeddings with the OpenAI endpoint and the logistic regression models were loaded into the existing service. The models fit comfortably into the existing service infrastructure so there was not need to provision additional services. While the OpenAI embedding API does incur a cost, it is significantly cheaper than LLM calls. A LLM call for the gpt-4o-mini model is $0.15 per 1M tokens, but the embedding call only costs $0.02 per 1M tokens. Plus the prompt contains additional tokens with the classification instructions, which do not need to be included with the embedding call.

After a few days of running the models against the production traffic the number of LLM calls has been reduced by more than 80% for the two feeds models were trained for with not noticeable degradation in quality. The cost savings have been significant and make the project much more sustainable.

What's Next

Using the classifier to filter out obvious off-topic posts has been a great first step. It solved the immediate problem to keep the LLM costs in check for the higher post volume as football season starts. Next, this approach will be rolled out to the other feeds. The volume will be lower, but the cost savings will still be meaningful.

The end goal is to further reduce the reliance on LLMs in the classification phase. Expand the modeling so that custom models perform the classification and LLMs are reserved for only the most ambiguous posts.