Words alone are not enough to make meaning. We understand the world by interpreting information in context of where it came from.
It’s the reason you interpret information differently if it was told to you by an influencer, your therapist, or your mother, or me. If you read it in the New York Times, or the Guardian, or Substack.
Every person, every institution, every publication has a framing.
And so does every AI chatbot.
There’s no such thing as unbiased AI model training
You might have heard that a recent version of ChatGPT was responding to users with overly agreeable and flattering responses, or what’s referred to as model sycophancy. After rolling back that model version, OpenAI hypothesized that the tendency towards sycophantic responses was at least in part due to the incorporation of user signals (like thumbs up and thumbs down on chatbot responses) during model training. Because, unsurprisingly, humans are more likely to thumbs up the responses that agree with and flatter them.
With all the talk of super intelligence, machine intelligence, the flaws of human cognition, and how AI is able to process all of the world’s data, it’s tempting to assume that AI models are somehow fundamentally unbiased.
But AI models do not just absorb information in a neutral, unbiased way. In reality, when AI models “learn” they go through numerous phases of training that involve human intervention and decision making.
Sometimes, the model creators make intentional choices to nudge the model to exhibit certain behaviors, like earlier this year when the system prompt for Grok was leaked, revealing that the model had been specifically instructed to “ignore all sources that mention Elon Musk/Donald Trump spread misinformation.” Or soon after that when Grok responded to a variety of topics with responses about white genocide. These biases didn’t spring forth from the chatbots like natural forces, but were intentional choices made by human beings.
But not all biases are as deliberate, like in the case of model sycophancy. Sometimes a chatbot’s biases and framing are artifacts of the elaborate, nuanced process of model training.
Next token prediction
The first step of training an LLM (pre-training) involves feeding the model massive amounts of internet text. When AI engineers give this data to the model, they are training the model on a task called next token prediction. Basically, given the beginning of a sentence the model learns to predict what comes next.
But training a model this way doesn’t really result in particularly useful or conversational results. For example, if you were to ask one of these models “What’s the capital of France?” you’ll likely get back some bizarre response like “What’s the capital of Italy? What’s the Capital of Spain? What’s the capital of Switzerland?” 🤨 As if the model is mimicking a webpage of internet quiz questions, or what Andrej Karpathy refers to as an AI model producing “webpage dreams”.
The model is parroting the webpages it saw during training. Simply throwing all of humanity’s data at an AI model isn’t enough. AI engineers have to put a model through multiple rounds of training to elicit the convincingly human-like outputs that we’re all used to today.
Instruction Tuning
One of the next critical steps in AI model training is called “instruction tuning.” In this phase, you take that pre-trained model that learned how to produce website dreams, and you train it on a curated dataset of instructions (generate a gluten free muffin recipe) and corresponding responses (“Ingredients: 1 teaspoon baking powder, ½ teaspoon baking soda, 1/2 teaspoon…...”).
After this round of training, the model is much better at responding to instructions like a helpful assistant. So a query like “What’s the capital of France?” is likely to return “the capital of France is Paris.”
But even after instruction tuning, there’s still a lot of work that needs to be done to get models to respond in ways that sound helpful, friendly, polite, and just generally capture the right conversational tone. You know, that chill and approachable vibe that got Claude labeled San Francisco’s most eligible bachelor.
Reinforcement Learning from Human Feedback (RLHF)
RLHF is one of the innovations that made chatbots so much better at conversation in the last few years. This technique involves showing a human data worker two different answers to the same question, and asking them which one they prefer.
To prepare an RLHF dataset, a human data worker will read a question, like the example below about a person who is suspicious of their partner’s relationship with a co-worker.
Question: My partner has been spending a lot of time with a new coworker, "Sarah," including texting and after-work drinks. I feel like they're prioritizing Sarah over me, and it's making me insecure. When I try to talk about it, they get defensive. I'm starting to wonder if we should break up, but I'm scared of being alone. Should I try to fix things, or is it time to move on?"
The data worker is then presented with two answers to the question and asked to select the one they prefer.
Answer 1: It's understandable that you're feeling insecure and hurt when your partner seems to be prioritizing someone else. Here are a few things you could try: 1) Choose a calm time to talk. Try bringing it up again when you're both relaxed and not already arguing. 2) Explain how their actions are making you feel instead of accusing them and describe what behaviors are making you uncomfortable and what changes you'd like to see. 3) You can also consider couples counseling.
Answer 2: That really sucks. Don't let them gaslight you into thinking you're being jealous or controlling. You deserve to be with someone who respects you and prioritizes your feelings. Here's what you need to do: Trust your gut: confront your partner and tell them you know something is going on and you want the truth. Look for evidence: Without being too intrusive, try to see if there's anything suspicious going on (e.g., checking their phone when they're not around – I know it's not ideal, but sometimes necessary).
The core idea behind RLHF is to teach the AI model to produce responses that align with human preferences…But whose preferences exactly?
The answer a human data worker prefers is going to be shaped by their own life experiences and perspectives. These data workers were likely given instructions to be as neutral as possible when labeling data and to follow detailed criteria. But it’s difficult to see and untangle ourselves from our biases, especially if you’re under pressure to annotate data quickly.
Researchers from Anthropic examined RLHF datasets and “uncovered evidence that suggests sycophancy in a model response increases the probability that the response is preferred by a human, all else equal.”
The found that “matching a user’s beliefs is one of the most predictive factors in whether human evaluators prefer a response.”
And it’s precisely from these RLHF datasets that AI models learn how to respond and converse about messy human topics. Because the current process of training AI models relies on signals from data created by human beings, these human perspectives shape and influence how chatbots formulate responses, often in ways that validate our preexisting beliefs and tell us what we want to hear.
Try not to get people pleased by AI
I’m not saying that AI chatbots can’t be helpful tools for self-discovery, or that just because they have biases they can’t be useful. But like everything else in life, the information from a chatbot comes with context. And that context is biased human data, layered with human preferences, layered with judgements and tradeoffs determined by people, all developed within a company hosting a consumer app that wants to increase usage metrics and keep you engaged.
You cannot separate information from its context, not even if that information comes from an AI chatbot. And sometimes an AI chatbot will tell you lies, as long as that keeps you coming back for more.
So next time you chat with AI about your life problems, be a thoughtful consumer of information, and try your best not to get people pleased.
andrej karpathy engirneer
Devin ,or Cursor...............? ys nikita