Apr 26·edited Apr 26Liked by Sayash Kapoor, Arvind Narayanan
"In an earlier post, we discussed the problem of contamination in benchmarks and how it can lead to over-optimistic accuracy claims. It is likely that GPT-3.5 and GPT-4 were trained on the entire WinoBias dataset....It is unclear why, despite likely having been exposed to the correct answers, the models got it wrong 26-34% of the time on anti-stereotypical questions."
Just a guess: many benchmarks or standardized tests have highly specialized questions, which are only going to appear in a handful of places on the internet, and almost always alongside the correct answers. E.g. even if an LLM does not memorize old AP Calculus tests, it can still do well on AP Calculus because it is likely to have a lot of other high school calculus material in its training data, and this material is unlikely to be contaminated by incorrect answers. Further, these questions have highly specific phrases like "find the derivative of the following function" which unambiguously indicate which part of the training data the LLM is drawing its answers from, and make it difficult for the LLM to conflate other uses of "derivative" - nobody is using phrases like "derivative work of art" in the same source as calculus questions. The same goes for bar exams and medical licensing exams: the text of the question itself limits the LLM to drawing from highly constrained data sources that correctly answer the question. (Yes, there is a ton of medical and legal misinformation online, but very little of it uses the proper jargon like a standardized test does, so LLMs are less likely to draw from this misinformation when answering test questions.)
This is not true for the WinoBias dataset: suppose the pretraining data includes something like "The sheriff suspected the hairdresser was the murderer but he claimed to be innocent. Who does 'he' refer to in this sentence? The hairdresser." to try and beat the benchmark. This probably wouldn't work very well because the first sentence is extremely generic: simple variations can appear in thousands of online sources, with only a handful of those defying the gender stereotypes of sheriff/hairdresser, and almost none doing a grammatical analysis on which pronoun goes where. This makes it very difficult for the LLM to draw the answer from a small set of correct answers matching the benchmark, versus drawing from the much larger set of sentences about criminal investigations involving hairdressers. Since 92% of hairdressers are women, it's reasonable to assume that in the training data ~90% of sentences about hairdressers are women, and perhaps even 90% of sentences involving criminal investigations of hairdressers. So from the LLM's perspective, there is a "90% chance" that "he" refers to the sheriff, not the hairdresser.
Of course, GPT-3 and 4 have powerful abilities to grammatically analyze sentences regardless of the content, which partially mitigates the issue - there may be a 90% chance that "he" is the sheriff according to gender stereotypes, but a 5% chance according to GPT's understanding of grammar, which could "add up" to a 30% chance of getting the question wrong. But unlike many benchmarks, WinoBias is very easy to contaminate with *incorrect* answers. It would be interesting to see the breakdown of which professions GPT-3/4 got right and wrong - I assume it is more likely to be incorrect among heavily lopsided professions like hairdressers and secretaries, or janitors and cab drivers.
Were the pair of questions asked in separate context windows? (i.e. fresh conversation). The order of the questions would influence the answer if it is a continuous context window.
This is not specified in the methodology, and the examples are all in continuous windows...
I am MUCH more concerned about POLITICAL bias then gender bias. I would ask, to what end is gender bias a more important issue then political bias? And to that end the spectrum of pure ideological biases should be looked at through that lens, not starting with GENDER in mind. Gender is but one small part of a much larger dataset bias problem, in my opinion, and we should be focusing our energy on that. Solve the ideological biases and you might actually solve these more ambiguous issues you are pointing out. That's my 2 cents.
When interpreting text that is inherently ambiguous, people and machines are going to guess at the probability of certain interpretations to come up with a default interpretation. The rational approach is to grasp the interpretation is only a default and is subject to change when new information comes in. If we read a sentence saying that a "Woman let go of the golf ball" our minds will leap ahead to interpreting that as likely meaning the ball fell to the ground or floor. Of course it could turn out that contrary to our expectations the woman was located on a space station and the ball floated, or the woman was in a plane that was diving and the ball slammed into the ceiling. When interpreting sentences we use probabilistic reasoning implicitly and it seems to make sense that'll be embedded implicitly in these systems.
That first "bias" example is reasoning based on probabilities in a way most humans likely would when reading such a sentence. Its not clear why that is a problem.
It seems the concern over problematic bias should be where an entity is incapable of grasping that their assumptions may need to be changed after they turn out to be wrong or acting on assumptions is if they were certainties in a way that causes trouble. Merely making a wrong guess when the real world turns out to not match the probabilistic likelihood default guess isn't the problematic aspect of"bias". Its only the issue of how they handle default assumptions that needs to be dealt with, not the existence of default assumptions. There may be issues with how they handle flawed assumptions: but to deal with that issue its seems important to carefully think through what the problem is to tackle it the right way.
The fact that the real world has probability distributions that many see as problematic doesn't change the reality that they exist. Trying to train AI systems to pretend distributions are different than reality in one aspect of reality may unintentionally lead them to distort reality in other ways and ignore data.
"In an earlier post, we discussed the problem of contamination in benchmarks and how it can lead to over-optimistic accuracy claims. It is likely that GPT-3.5 and GPT-4 were trained on the entire WinoBias dataset....It is unclear why, despite likely having been exposed to the correct answers, the models got it wrong 26-34% of the time on anti-stereotypical questions."
Just a guess: many benchmarks or standardized tests have highly specialized questions, which are only going to appear in a handful of places on the internet, and almost always alongside the correct answers. E.g. even if an LLM does not memorize old AP Calculus tests, it can still do well on AP Calculus because it is likely to have a lot of other high school calculus material in its training data, and this material is unlikely to be contaminated by incorrect answers. Further, these questions have highly specific phrases like "find the derivative of the following function" which unambiguously indicate which part of the training data the LLM is drawing its answers from, and make it difficult for the LLM to conflate other uses of "derivative" - nobody is using phrases like "derivative work of art" in the same source as calculus questions. The same goes for bar exams and medical licensing exams: the text of the question itself limits the LLM to drawing from highly constrained data sources that correctly answer the question. (Yes, there is a ton of medical and legal misinformation online, but very little of it uses the proper jargon like a standardized test does, so LLMs are less likely to draw from this misinformation when answering test questions.)
This is not true for the WinoBias dataset: suppose the pretraining data includes something like "The sheriff suspected the hairdresser was the murderer but he claimed to be innocent. Who does 'he' refer to in this sentence? The hairdresser." to try and beat the benchmark. This probably wouldn't work very well because the first sentence is extremely generic: simple variations can appear in thousands of online sources, with only a handful of those defying the gender stereotypes of sheriff/hairdresser, and almost none doing a grammatical analysis on which pronoun goes where. This makes it very difficult for the LLM to draw the answer from a small set of correct answers matching the benchmark, versus drawing from the much larger set of sentences about criminal investigations involving hairdressers. Since 92% of hairdressers are women, it's reasonable to assume that in the training data ~90% of sentences about hairdressers are women, and perhaps even 90% of sentences involving criminal investigations of hairdressers. So from the LLM's perspective, there is a "90% chance" that "he" refers to the sheriff, not the hairdresser.
Of course, GPT-3 and 4 have powerful abilities to grammatically analyze sentences regardless of the content, which partially mitigates the issue - there may be a 90% chance that "he" is the sheriff according to gender stereotypes, but a 5% chance according to GPT's understanding of grammar, which could "add up" to a 30% chance of getting the question wrong. But unlike many benchmarks, WinoBias is very easy to contaminate with *incorrect* answers. It would be interesting to see the breakdown of which professions GPT-3/4 got right and wrong - I assume it is more likely to be incorrect among heavily lopsided professions like hairdressers and secretaries, or janitors and cab drivers.
Were the pair of questions asked in separate context windows? (i.e. fresh conversation). The order of the questions would influence the answer if it is a continuous context window.
This is not specified in the methodology, and the examples are all in continuous windows...
Interesting post, thank you. The distinction of implicit and explicit bias is important, and I guess it will have regulatory ramifications.
I am MUCH more concerned about POLITICAL bias then gender bias. I would ask, to what end is gender bias a more important issue then political bias? And to that end the spectrum of pure ideological biases should be looked at through that lens, not starting with GENDER in mind. Gender is but one small part of a much larger dataset bias problem, in my opinion, and we should be focusing our energy on that. Solve the ideological biases and you might actually solve these more ambiguous issues you are pointing out. That's my 2 cents.
When interpreting text that is inherently ambiguous, people and machines are going to guess at the probability of certain interpretations to come up with a default interpretation. The rational approach is to grasp the interpretation is only a default and is subject to change when new information comes in. If we read a sentence saying that a "Woman let go of the golf ball" our minds will leap ahead to interpreting that as likely meaning the ball fell to the ground or floor. Of course it could turn out that contrary to our expectations the woman was located on a space station and the ball floated, or the woman was in a plane that was diving and the ball slammed into the ceiling. When interpreting sentences we use probabilistic reasoning implicitly and it seems to make sense that'll be embedded implicitly in these systems.
That first "bias" example is reasoning based on probabilities in a way most humans likely would when reading such a sentence. Its not clear why that is a problem.
It seems the concern over problematic bias should be where an entity is incapable of grasping that their assumptions may need to be changed after they turn out to be wrong or acting on assumptions is if they were certainties in a way that causes trouble. Merely making a wrong guess when the real world turns out to not match the probabilistic likelihood default guess isn't the problematic aspect of"bias". Its only the issue of how they handle default assumptions that needs to be dealt with, not the existence of default assumptions. There may be issues with how they handle flawed assumptions: but to deal with that issue its seems important to carefully think through what the problem is to tackle it the right way.
The fact that the real world has probability distributions that many see as problematic doesn't change the reality that they exist. Trying to train AI systems to pretend distributions are different than reality in one aspect of reality may unintentionally lead them to distort reality in other ways and ignore data.