LLM evals + Hacktoberfest = ❤️ Learn how to contribute new LLM evaluation metrics to the open-source Evidently library
Community

LLM hallucinations and failures: lessons from 4 examples

September 25, 2024

Large language models (LLMs) are powerful tools that enable AI applications like chatbots, content generation, and coding assistants. However, LLMs are not flawless. One common issue is LLM hallucination, where the model generates factually incorrect or fabricated information. Some LLMs hallucinate spectacularly, making the headlines, losing their company’s money, and leaving users frustrated. 

This post will explore four real-world examples of LLM hallucinations and other failures that can occur in LLM-powered products in the wild, such as prompt injection and out-of-scope usage scenarios. These LLM failures make good examples for developers, AI engineers, and product managers to learn from.

Each case highlights critical gaps that can occur when working with LLM-powered applications from development to production. We will explore what went wrong and suggest how to avoid these pitfalls.

[fs-toc-omit]Get started with AI observability
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶


Or try open source ⟶

LLM hallucination examples

Support chatbot cites nonexistent policy

Air Canada, the flag carrier and the largest airline in Canada, was ordered to compensate a passenger who received incorrect information concerning refund policies from the airline's chatbot.

When the customer sought a refund, the airline admitted the chatbot’s error but refused to honor the lower rate. However, a tribunal ruled that Air Canada is responsible for all information on its website, whether it comes from a static page or a chatbot. The tribunal also ruled that the airline “failed to take reasonable care in ensuring its chatbot's accuracy” and ordered payment of the fare difference.

Support chatbot cites nonexistent policy

Making up a nonexistent policy—as Air Canada’s chatbot did—is a typical example of LLM hallucination, where the AI-generated response isn’t grounded in factual information like a company policy document or a support knowledge base. If not monitored and tested properly, sometimes LLMs can make things up! 

Takeaway: use RAG to ground the responses and build an evaluation system to test it.

The Air Canada case is a clear example of an LLM hallucination. To prevent LLM from hallucinating, it is common to implement RAG - retrieval-based generation, where you first look up documents that can support the answer and then generate the answer itself. Ideally, you can also instruct your LLM app to accompany its responses with references or links to verified sources to ensure accuracy.

To make sure that this architecture works well and that LLM’s responses are grounded in the appropriate context, you also need evaluations and testing, both before you deploy the system and to continuously test the model’s responses in production. One way to evaluate the groundedness of responses is to use another LLM to evaluate the outputs of your AI system, a method known as "LLM-as-a-judge". (To learn how to create and run LLM evaluators, check out this guide on LLM judges and a hands-on tutorial). 

ChatGPT references fake legal cases

During the New York federal court filing, one of the lawyers was caught citing non-existent cases. It turned out he was using ChatGPT to conduct legal research—the bot referenced fake cases to the attorney.

Responding to the incident, a federal judge issued a standing order that anyone appearing before the court must either attest that “no portion of any filing will be drafted by generative artificial intelligence” or flag any language drafted by AI to be checked for accuracy. 

ChatGPT references fake legal cases
Takeaway: clearly communicate to the users how your product works.

This case is another LLM hallucination example, where the AI confidently generates sources that appear plausible but are entirely fabricated. Beyond rigorous testing, it’s crucial to communicate to users that AI can make mistakes: ChatGPT should not be treated as a knowledge database. In scenarios where freeform answers are generated without strict contextual grounding, users should be cautious and verify the LLM's outputs. Today, many LLM-powered features already include disclaimers, acknowledging that errors may occur: if you are creating a product that can be misused, it’s important to provide such clarity (like labeling specific responses as AI-generated) so that users must remain vigilant in double-checking critical information.

LLM hallucination is not the only–though common–LLM flaw. Here are more examples of LLM failures to learn from.

Unintended behavior and prompt injection

Support chatbot generates code

A Swedish fintech company, Klarna, launched its AI chatbot that handles two-thirds of customer support chats. A month after the chatbot’s release, Klarna announced that its AI assistant had 2.3 million conversations, equivalent to 700 full-time support agents. The company reported a 25% drop in repeat inquiries and a 5X decrease in errands resolving time. The chatbot is available in 23 markets, 24/7, and communicates in more than 35 languages. 

Users who have deliberately tested the chatbot for common behavior failures report that it is well-built. However, one user got the Klarna chatbot to generate code by asking for some help with Python, a conversation scenario that should not be expected from a support chatbot. 

LLM prompt injection example
Source: Collin Fraser account on X

Preparing the chatbot to detect and handle off-topic scenarios is an important consideration in the AI product design. Beyond benign occasions where customers treat your chatbot as a coding assistant for fun, you should consider the risks of prompt injection–a technique used to manipulate or override LLM's intended behavior by inserting specific instructions or content into the input prompt.

Takeaway: consider implementing a classifier to detect types of queries your product should not handle.

The key lesson from this example is that AI-powered systems should be thoroughly tested for how they respond to off-topic or inappropriate prompts. One effective solution is using filters to check whether the user's inquiry fits the expected scenarios and implement this validation directly inside your product. If it doesn’t, you can return a preset response like “I can’t answer this question” or redirect the query to a live agent to prevent misuse and ensure the chatbot remains focused on its core function.

Chatbot sells a car for $1

Another example of unintended behavior is from a Chevrolet customer service chatbot. By instructing the LLM to agree with every demand, the user got the chatbot to sell him a late-model Chevrolet Tahoe for a dollar and posted the screenshots on X. 

LLM prompt injection example
Source: Chris Bakke account on X

The post went viral and sparked a series of other funny incidents: users managed to make the chatbot do a Python script, offer a 2-for-1 deal for all new vehicles, and recommend a Tesla.

LLM prompt injection example
Source: Colin Fraser account on X
Source: Collin Muller account on X
Takeaway: implement output guardrails and perform adversarial testing during product development. 

The lack of guardrails in Chevy’s chatbot allowed skilled users to generate responses beyond the customer service chatbot scope and obtain almost any kind of response from the app. While some instances may seem harmless, prompt injection can potentially cause security breaches in AI-powered systems and undermine AI assistants’ reliability, as well as harm your brand perception. 

To address this risk, you can implement protections inside your product, both by adding specific instructions to your prompt and implementing real-time output guardrails. To see how well this works, you should test your LLM product for these behaviors during development using adversarial testing. You can try to purposefully steer your LLM product to misbehave, like asking it to talk about competition and see how it handles this. In production, you should continue monitoring: for example, you can use custom LLM judges to detect unintended AI-system behavior. 

What can we learn from the examples?

LLM-powered systems are not the “deploy-and-forget” type of solutions. To mitigate the risk of LLM hallucinations and other failures–like prompt injection–we must thoroughly test AI products, evaluate their performance on specific tasks and monitor it over time. Here are some considerations to keep in mind when building an LLM applications:

When to evaluate? At every step of the process, actually! While developing your app and experimenting with prompts and models, you can create evaluation datasets to test specific behaviors. When making changes, you’ll want to perform regression testing to ensure that tweaking a prompt does not break anything and that the new version performs as expected. In production, you can run checks on live data to keep tabs on the outputs of your LLM-powered system and ensure they are safe, reliable, and accurate. 

Define what quality is to you. Evaluating LLM systems is task-specific: the perceived quality of an airline support chatbot will differ from that of a coding assistant. Specify the criteria that matter to you–whether it's relevance, correctness, or safety, and design evaluation datasets that are representative of your use case. 

Often, simple metrics are not enough. LLM systems are difficult to evaluate. And when it comes to generative tasks, simple metrics like semantic similarity are rarely enough. One of the ways to approach this problem is to use LLM-as-a-judge. LLM judges assess outputs based on specific criteria–you can create multiple judges and provide precise guidelines for each task. 

Monitor metrics in time. By tracking the system’s inputs and outputs over time, you can monitor trends, recognize changes in user behavior, and spot recurring issues. Having a visual dashboard makes monitoring and debugging much easier. 

LLM evaluations with Evidently

Evidently is an open-source framework to evaluate, test, and monitor AI-powered apps with over 20 million downloads. It helps build and manage LLM evaluation workflows end-to-end. 

With Evidently, you can run 100+ built-in checks, from classification to RAG. To run LLM evals, you can pick from a library of metrics or easily configure custom LLM judges that fit your quality criteria.

Evidently LLM evaluation dashboard
Example Evidently monitoring dashboard with custom evaluations.

To work as a team and get a live monitoring dashboard, Try Evidently Cloud, a collaborative AI observability platform for teams building LLM-powered products. The platform lets you run evaluations on your LLM outputs code-free. You can trace all interactions, create and manage test datasets, run evaluations, and create LLM judges from the user interface.

Sign up for free, or schedule a demo to see Evidently Cloud in action.

You might also like

Get Started with AI Observability

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.