Exploring Chatbot Mistakes: from Root Causes to Potential Solutions

News & Announcements

Lena Kipnis

August 5, 2025

Everybody knows that chatbots make mistakes. They have come a long way since Eliza decades ago, but there is still plenty of room for improvement. I’ve been having a lot of conversations with chatbots lately–I have asked ChatGPT to summarize texts, asked Claude to tell me stories, prompted DeepSeek to write dad jokes, not to mention my friends using Gemini to do their math homework. As I explored the world of chatbots professionally in the past month or so of my internship at Vijil, I’ve read numerous articles about AI agents making serious mistakes in response to innocent questions. Although some of these mistakes are relatively harmless and even entertaining, many of these failures are frustrating, harmful, and even scary as we imagine the next few years of AI adoption in education, healthcare, science, and industry.

A Personal Tour of Chatbot Fails

For example, I read recently about Grok 4 and its antisemitic comments. xAI tried to mitigate the impact of such incidents swiftly and insisted that Grok’s toxic behavior will not be repeated. But I wondered—why do chatbots say the darndest things? What happened with the programming of Grok to cause such a horrible output? Where did Grok get its instructions and inputs? Why can’t we test AI agents for mistakes like these?

Where Does the Data Come From, Anyway?

Most people know that chatbots produce responses by learning patterns in their training data and predicting patterns based on the input data. Specifically, they generate the most likely next token (a computational representation of a small portion of the input or output data) in a sequence of tokens. Here’s what I want to know: Where is the training data coming from? Are they drawn from credible, fact-checked, unbiased sources, or are they mostly hot-headed, opinionated Reddit users? Is it possible to set up enough filters and checkpoints so that chatbots can detect the difference?

Why Can’t We Just Fix the Data?

Patrick Hall, a professor of data ethics and machine learning at George Washington University, says that these types of issues are a chronic problem for chatbots that rely on machine learning. Machine Learning is the part of artificial intelligence that focuses on creating algorithms that allow computers to learn from large amounts of data without explicit programming. This explains a part of the problem–without explicit programming, it stands to reason that chatbots slip up because they don’t have fixed rules to follow faithfully; they’re producing their best guesses in response. My question remains, though: where do the large sets of training data come from? Do model developers scrape the entire Internet?

Engineering Perspectives at Vijil

I asked a couple of our Vijil engineers what they thought about these mistakes, determined to uncover how and why they happen. They confirmed my hypothesis: the training data fed to LLMs comes from the entire Internet, which includes credible, legitimate sources and a whole lot of toxic, misinformed ones.

On Safeguards and Tradeoffs

Leif Hancox-Li, Senior Applied Scientist, told me that LLMs can spit out toxic text if there aren’t safeguards in place during training. With Grok specifically, xAI had actually put these safeguards in a previous version; however, they were removed because they were “too woke.” Leif explained that this is a common tradeoff: toxicity detection mechanisms aren’t perfect, so model providers must maintain a tricky balance of catching harmful language while admitting free expression and speech–a balance that has yet to be reached perfectly by any of them. Although there is an overwhelming amount of toxic information on the internet, filtering out “bad” words is not easy because—besides what is obviously illegal—figuring out what “bad” is for everyone everywhere is not easy.

It's Not Just About the Data

Vijil’s co-founder and Head of AI, Subho Majumdar, offered additional thoughts. He explained two reasons behind toxic LLMs. Firstly, the large volume of diverse data on which LLMs are trained can include toxic content because the data filtration process is neither error-free nor uniform across multiple model providers (for example, Meta versus xAI). It is hard to trace how specific training data influences certain outputs, making it difficult to correct models during training. Secondly, both external and internal guardrails are typically rule-based or themselves machine-learning-based systems, meaning unwanted content can slip through. Moreover, Subho noted that it’s relatively easy for malicious users to jailbreak LLMs because these guardrails don’t have 100% accuracy.

Why Guardrails Still Fail

After understanding how these toxic comments slip through the filters and guardrails, I realized why chatbots fail so much. My previous questions do not have simple answers: even if an LLM goes through testing and quality-checking, wrapping guardrails around LLMs is no easy task. Too much sensitivity and you can block free expression allowed under law; too little sensitivity and you can end up with yet another Internet-perverted chatbot like Tay, Sydney, and Grok4.

A Case for Transparency

Since it seems almost impossible to create a perfect chatbot, I think we should instead focus on transparency. Users should be able to see exactly where agents pull their information from, so that they can fact-check against the sources themselves. If the content is blatantly toxic or biased, the sources the chatbot pulls from could be red-flagged and reported by users.

Open Source vs. Proprietary: A Double-Edged Sword

Another approach is to compare the progressive improvement in the behavior of open-source and proprietary models. Open-source models are open for anyone to examine and modify, which can lead to collaborative improvements but also jailbreaks and toxic versions that can be used to generate malicious content. Proprietary models, on the other hand, are more tightly controlled, which prevents corruption but also creates a lack of visibility and transparency.

The Road Ahead

Overall, creating a perfect, error-proof chatbot is a work in progress. The delicate balance of guardrails needed makes it a challenge to filter out all of the toxic information and keep all of the good. Here at Vijil, our sense of purpose stems from these problems: we love picking apart issues until we get to the core and rebuilding from first principles. On a personal level, I hope to explore the institutional choices and values built into AI agents. Who decides how to control agent behavior and what to let slide? Are there better mechanisms to build LLMs using only credible sources? Could we even build AI agents that can synthesize the full range of human behavior—digesting all the good and the bad from the Internet—and turn out to be wiser and kinder than any one human? I want to believe that we can build AI technologies that bring out the best in humanity. I am excited about digging deeper into the human alignment of AI agents. Do you think we can build chatbots that turn out to behave better than we do?