How Enterprises Are Actually Using AI: A Growing Collection

AI is in the headlines on a daily basis. The industry is moving rapidly, but behind the scenes enterprises are using AI to deliver value by solving real business problems. In this garden I'm collecting AI use cases from Enterprises by reviewing the technical blogs they've written about the development of their various AI solutions.

By examining these implementations, I'm trying to better understand where AI is creating practical value in business settings and learn from the technical approaches these companies have taken.

Below is a handful of blogs I've read recently with more coming soon. If you have any suggestions email me at, matt@mattvielkind.com, and let me know!)

There's not enough here to draw too many conclusions just yet. The one takeaway so far is the number of different applications AI is being utilized for (developer tools, chatbots, data management, customer support etc). In some cases AI is enabling enterprises to solve problems that otherwise were not economically feasible to address, which to me is pretty cool. Looking forward to discovering more examples of enterprises generating value from AI with practical business solutions.

Empowering Engineers with AI

Slack

November 8, 2024

Use Cases: Slack primarily uses LLMs for three main purposes: automating responses to support escalations in channels, enabling contextual information search and retrieval across their documentation, and assisting engineers in developing AI applications.

Design: Their system is built on a unified LLM platform that provides access to multiple approved LLMs with a standardized input/output schema to ensure engineers could change LLM provider without breaking other areas of their code. They use Amazon Bedrock Knowledge Bases for vectorizing and storing data from internal sources. The system includes CLI tools, a playground environment for testing, and comparison tools for LLM outputs. Each channel can have customized configurations for prompts and response parameters.

What Worked: The app was deployed across a number of different internal engineering channels. Providing users with the ability to customize the bot for their channel (prompt adjustments, response lengths, retriever filters etc.) to their specific channel helped reduced the halluncination rates. Developers had a variety of preferences when interacting with the bot (Slack, web app, IDE etc.) Delivering the app via preferred channels helps reduce context switching among engineers.

To help engineers develop AI applications, Slack introduced a number of targeted tools designed for engineers to prototype new AI applications. These tools included a CLI and UI playgrounds for interacting with their AI system to test different prompts and settings. A separate comparison tool allows developers to make quick comparisons across LLMs to decide which one is best for their use case.

What Didn't Work: Hallucinations and multi-step problem solving remain an issue with the current LLMs, which limits the system's ability to help with more complex projects. Tuning the retriever to ensure relevant third-party content from the knowledge base is included in prompt context required quite a bit of tuning to get right. More generally the AI landscape is constantly evolving. Keeping up with the latest trends is time-consuming.

Takeaways: Success in implementing LLMs requires balancing technology with user needs, maintaining strong security and compliance, and fostering continuous improvement. The enterprise learned that providing multiple interaction points (Slack, web, IDE) was crucial for adoption, and that customization capabilities were essential for different team needs. They also recognized the importance of measuring impact while accounting for time lost to limitations like hallucinations.

Automation Platform v2: Improving Conversational AI at Airbnb

Airbnb

October 28, 2024

Use Cases: Airbnb primarily uses LLMs to enhance customer support operations by being able to provide a natural dialogue experience and being able to respond to dynamic conversation turns.

Design: Their Automation Platform utilizes a Chain of Thought (CoT) workflow where LLMs act as reasoning engines to determine tool usage and sequence. The platform constructs the prompt from the user task and provided context (previous chat history, user id, user role, trip information etc). Based on the provided information the LLM determines what tool to execute and performs the execution. Depending on the response the CoT workflow will either call the next tool or generate the final response for the user.

What Worked: Not relying entirely on LLMs for all user interactions. Airbnb used a hybrid of LLM-based and traditional conversational AI workflows. In cases dealing with confidential information with strict validations where accuracy is of the utmost importance Airbnb utilizes traditional conversational AI approaches. Whereas conversations that are more dynamic are better handled by LLMs. To mitigate the impact of hallucinations and potentially inappropriate content in the LLM response, a Guardrails Framework was created. The guardrails are LLM based and can be called as part of the Automation Platform to monitor LLM communications ensuring responses meet standards in helpfulness, relevance, and ethics.

What Didn't Work: Related a little bit to the point above, relying on LLMs for large-scale customer interactions wasn't feasible due to issues with latency and hallucination. Using the right tool for the specific scenario helped to alleviate some of the hallucination issues with LLMs.

Takeaways: As the old adage goes, use the right tool for the right job. Leveraging a hybrid approach combining traditional workflows with LLM capabilities works best for enterprise applications. With the Automation Platform developers can build dynamic experiences by leveraging the context retrievers and tools that the CoT workflow is built upon. Having guardrails in place improves the confidence that the LLM is providing users with helpful responses. LLMs are constantly evolving. As new advances are available the platform can expand to incorporate them.

Feedly AI Actions: Insights from the development process

Feedly

June 25, 2024

Use Cases: LLMs are implemented for three main use cases: performing recurring actions on individual articles (like summarizing or translating), analyzing multiple articles simultaneously, and answering questions across large sets of articles.

Design: There are limited details about the specific design of their system. In general, separate solutions were explored for different use cases, for example, filling a large context window for analyzing multiple artices, but opting for RAG to answer questions across a corpus of documents.

What Worked: Continuous evaluation and adoption of LLM models helped maintain and improve performance. As new models were released they were incorporated into their testing pipeline to determine the level of improvement. Significant investments were made in prompt engineering to maximize accuracy and improve reliability while limiting side effects like increased latency.

One specific breakthrough was the use of in-line citations. Providing in-line citations:

Reduced the complexity (no need to manage a separate references section in the UI!)
Improved the end user experience by visually seeing where citations apply
Increased development flexibility since the output and streaming formats were simplified.

Details about how the in-line citations were achieved are missing. Based on the information provided I'm assuming the citations were generated through some prompt engineering.

What Didn't Work: Off-the-shelf RAG tools proved unreliable, producing inconsistent results and failing to retrieve all relevant articles. Due to the poor performance of the off-the-shelf RAG tools Feedly opted to leverage large context windows for most of their uses cases.

Takeaways: The core learnings were that success required significant investment in prompt engineering and maintaining flexibility to adapt to rapidly evolving LLM technology. They found that careful attention to source attribution and citation was crucial for user trust, and that staying responsive to new model releases was essential despite the operational challenges it created.

Musings on building a Generative AI product

April 25, 2024

Use Cases: The system helps users get quick takeaways from posts, learn the latest company information, assess their fit for job opportunities, receive profile improvement suggestions, prepare for interviews, and get guidance on career pivots. These applications focus on making information more accessible and actionable for LinkedIn members.

Design: built their system around a 3-step RAG (Retrieval Augmented Generation) pipeline architecture. The pipeline begins with routing, which selects the appropriate AI agent for the task, followed by retrieval to gather data from internal APIs and Bing, and finally generation to synthesize the response. The system uses multiple specialized AI agents for different tasks, with organization split between a horizontal team managing common components and vertical teams developing specific agents.

What worked: The fixed 3-step pipeline structure provided a solid foundation, while using smaller models for routing/retrieval and larger ones for generation optimized performance. Embedding-based retrieval effectively injected examples into the system, and dividing work between specialized agents improved development speed. The centralized evaluation pipeline, shared prompt templates, and UX components helped maintain consistency. End-to-end streaming and async non-blocking pipeline design significantly improved performance and user experience.

What didn't work: Maintaining uniform experience across different agents proved difficult, and the manual evaluation process was slow and resource-intensive. Automatic evaluation remained a persistent challenge throughout development. LLMs initially struggled with API parameter formatting, showing a 10% error rate. Quality improvements plateaued after initial rapid progress, making the final polish more difficult than anticipated. The team also faced challenges with throughput versus latency trade-offs, and their initial capacity planning didn't adequately account for Chain of Thought token usage.

Key Learnings: While achieving 80% quality happened quickly, reaching 95%+ quality proved extremely challenging. The team found that prompt engineering felt more like tweaking expert systems than traditional ML approaches. Resource management, including GPU capacity, latency, and throughput, required careful balancing throughout the system. End-to-end streaming and async processing proved crucial for performance optimization. Finally, the team identified a clear need for better automatic evaluation systems to enable faster iteration in development.

How we built Text-to-SQL at Pinterest

April 2, 2024

Use Cases: Generating SQL from text to improve task completion speed among data users. With data spread across multiple domains finding the right data given an analytical problem is a significant challenge for data users that the Text-to-SQL solution aims to help simplify.

Design: The system has two main components, a table search system and then the LLM-powered Text-to-SQL generator. Pinterest maintains hundreds of thousands of tables in their data warehouse. Knowing what tables to utilize for a given task is a big challenge for data users. To help users discover the tables they need for their task Pinterest generated embeddings for the tables in their warehouse. A LLM was used to create a summary for each table by using the table schema and recent queries as inputs (see the summarization prompt in the article!) Embeddings are then generated using the summary from the LLM and stored in a vector store. Similarly embeddings were generated for query-level summaries and stored in the vector store.

In the Text-to-SQL generator if the user does not explicitly specify the tables that need to be used then the user's analytical question is transformed into embeddings and a vector search retrieves the top N tables to solve the problem. A prompt selects the best candidate tables to continue with the Text-to-SQL generation.

The article provides a bunch of prompt examples so you can see how they are implemented. Some of the prompts are available in the Pinterest querybook GitHub repo as well.

What Worked: Experimenting with different weights between the table and query based vector searches yielded huge improvements. Giving more weight to the table embeddings increased the hit rate from 40% to 90%.

What Didn't Work: No explicit negative learnings are mentioned, but a few areas of improvement are: enhancing table metadata, further refinement of the retriever step, add a step to validate if the generated SQL is valid, and to collect better metrics from end users for evaluation.

Takeaways: The value of iterative product improvements. The initial Text-to-SQL system focused on generating SQL for a user's analytical question with the assumption the user provided the tables to use. After refining the ability to generate SQL from a user's question using LLMs Pinterest expanded the capabilities. Recognizing that knowing the tables to use is burdensome of users, the second iteration was able to directly address that problem and provide additional value to end users.

Summarizing Post Incident Reviews with GPT-4

Canva

November 13, 2023

Use Cases: Automating generation of Post Incident Report (PIR) summaries using GPT-4, replacing what was previously a manual process by reliability engineers. Manually writing summaries was timely and produced inconsistent results. The summaries help engineers quickly review and understand incidents.

Design: PIR reports are fetched from Confluence. Content from the reports is parsed from HTML and sensitive data is removed. The sanitized text is send to the GPT-4 chat completion endpoint where a prompt runs to perform the summarization task. The generated summary is archived in a data warehouse. A copy of the generated summary is also maintained in Jira where webhooks capture manual modifications made to the summary. Tracking manual modificatios in JIRA provides a passive evaluation of the summary generation.

What Worked: Utilizing GPT-4 proved effective, with engineers rarely needing to modify the AI-generated summaries. Their prompt engineering strategy using two example PIRs worked well, as it prevented rigid copying while maintaining consistency. The system successfully maintained a blameless approach in summaries and stayed within reasonable cost parameters at about $0.60 per summary.

What Didn't Work: Canva also attempted to utilize fine-tuning, but didn't get good performance with the lack of training data available, only ~1,500 examples. They found that directly limiting API response length led to truncated outputs, leading them to instead specify desired length in the prompt.

Takeaways: GPT-4 chat was more effective than both fine-tuning and standard completion APIs for their use case. The project demonstrated that AI can effectively reduce operational workload while maintaining or improving quality. Their success relied heavily on careful prompt engineering and appropriate model selection rather than fine-tuning.

LLM-powered data classification for data entities at scale

Grab

October 2023

Use Case: Grab primarily uses LLMs for data classification and metadata generation, specifically to identify sensitive data and PII across their database tables and Kafka message schemas. The system automatically tags data fields with appropriate sensitivity levels and data types.

Design: An orchestration service manages classification requests from data platforms. Requests are aggregated into mini-batches where both a third-party classification service and GPT-3.5 tag categories to fields. For example, a tag could indicate if the field contains personally identifiable information (PII). The prompt contains a list of available tags with definitions of the tag along with examples for how the model should respond.

What Worked: Prompt engineering techniques proved effective, particularly clear requirement articulation, few-shot learning, and schema enforcement. The LLM showed high accuracy in semantic understanding, with users typically changing less than one tag per table.

What Didn't Work: Nothing is mentioned as being a dead end during the development.

Takeaways: The problem fits into a category that only LLMs could solve for efficiently. Previous approaches with regex patterns had too many false positives. Building an in-house classifier would have required significant investment in a dedicated data science team. LLMs produced a high accuracy and scalable solution. 80% of users found the new system helpful for tagging data entities. In the month after its rollout, 20,000 data entities were scanned saving 360 man days per year.

Changelog

2024-11-16: Seed planted.