Ā
Since joining Bluesky custom feeds have been one of my favorite features. Feeds have been my preferred method for browsing topics I'm interested in like woodworking, basketball, AI, and random pictures people share of #GoodLunch without having to search for specific people to follow.
With custom feeds developers can curate experiences for other users on the platform. Feeds can be simple like following specific hashtags, keywords or a curated set of accounts. Feeds can also be more complex by implementing ML models to determine the skeets to include in a feed. Lots of freedowm to do what you want!
One topic I like to follow is the NBA. The feeds available on Bluesky all either track specific hashtags (i.e. #NBA or #NBASky) or are a curated list accounts who predominantly cover the NBA. I wanted to build a more comprehensive feed to capture a larger portion of skeets related to the NBA on Bluesky.
The result is the NBA Now feed! The most complete feed of NBA content available on Bluesky! Building the feed required finding a balance between maximizing the reach of the feed to capture as many NBA skeets as possible while minimizing noise from unrelated skeets making their way into the feed.
This post is a discussion about challenges with building this feed, how I designed the feed to address those challenges and what's next for NBA Now!
Challenges
NBA related skeets are a tiny fraction of Bluesky activity. The main challenge is how can the NBA related skeets be efficiently identified from the stream of skeets coming from Bluesky?
Using a simple string search for hashtag and keyword identifiers (NBA, #NBASky, #nba etc.) to identify skeets is one approach. Keyword matching alone will retrieve some NBA content, but the feed will be incomplete since not all NBA skeets explicitly mention the NBA or use a hashtag.
What about all the skeets referencing NBA players and teams? Those are certainly related to the NBA and should be in the feed. Adding more keywords to capture these skeets creates a new problem with relevency of the skeets. Consider the following scenarios:
- Steph Curry is commonly referred to as simply "Steph" in skeets, which is a fairly common name. A keyword search for "Steph" is going to return a bunch of non-NBA skeets.
- How can team names like the "Kings" and "Spurs" be differentiated from other sports franchises with the same name?
- If a user mentions "Heat" how can it be identified as related to basketball and not the temperature outside? What about the Magic or the Jazz?
The objective becomes maximizing the capture of NBA skeets on Bluesky while reducing the noise of unrelated skeets reaching the feed.
To solve this challenge I implemented a hybrid solution combining keyword search and a LLM classifier to generate a complete feed of NBA skeets.
Building a Hybrid Solution
Skeets from the Bluesky stream are processed in two steps, filtering and classifying.
In the filtering step keywords are used to identify candidate skeets from the Bluesky stream. A robust set of keywords related to teams, nicknames and players in the NBA helps to identify candidate skeets for the feed. Since NBA skeets are a small portion of overall skeets on Bluesky the keyword filters are an efficient mechanism to filter out noise.
Keywords are also used for exclusion logic. Skeets with gambling related terms are automatically excluded from the feed. Similarly, skeets are compared to a list of bot accounts that are blocked from the feed.
Relying on keywords alone is not sufficient. To maximize the signal to noise ratio in the feed all candidates are put through a classifier to determine if the skeet is about the NBA. The classifier is a call to a LLM with the skeet text that instructs the LLM to determine if the text is related to the NBA. If the classifier responds back true then the skeet is added to the Postgres database of skeets to include in the feed. This helps to remove the noise introduced by skeets using keywords in a non-basketball context.
A Note on Implementation
Huge kudos to MarshalX for sharing his repo for building a bluesky-feed-generator! I worked off a clone of the repo for constructing my feed.
For the most part the code is the same. I modified data_filter.py
to perform the filtering and classification steps. I changed the database to Postgres. A table was added to track who visits the feed so I can build activity metrics. Then I deployed to Heroku to serve the feed.
All of this was pretty easy to follow with the GH repo!
Ongoing Challenges
Bots are an ongoing nuisence. Everyday I find myself adding a couple more accounts to the block list for slinging slop into the feed. Even though bots don't seem to be a rampent issue I still cringe when I see a skeet that's clearly from a bot account. If the problem gets worse I'm going to need to consider an automated way to identify these accounts.
Content moderation is also something I try to stay on top of. I made a conscious choice to remove all gambling related content. Every now and then some sneaks in, so I'm tightening the restrictions to limit the number of those skeets finding their way into the feed.
Similarly, the classifier isn't perfect. There are still instances where unrelated posts are getting into the feed. The main issues have been confusion with the "Kings" NHL team and the "Spurs" in the EPL. The unrelated content isn't a huge problem, but tuning the classifier to limit the impact would boost the feed's value.
Finally, all the discussion about these ongoing challenges is purely anecdotal. I'm not capturing hard data to quantify the value of the feed. For example, I don't know what the signal to noise ratio is. I also don't know how many NBA skeets I'm actually capturing. Capturing more data around these metrics would allow me to track the health of the feed.
Next Steps
Compared to other NBA feeds NBA Now is extracting a wider variety of posts and the level of noise seems low. The hybrid approach to the stream filtering is working well and is economical costing only a few cents a day.
The top of my priority list is starting to quantify performance of this hybrid system. How accurate is the classifier? Is a lot of relevant content being incorrectly excluded? I want to start gathering data to be more prescriptive in future updates.
I had a lot of fun building this feed, so why not build another? F1 and MLB are around the corner! Stay tuned!