Speakers: Max Niederman (head of quality, host), Stephen Yang (senior software engineer), Ege Erdil (CTO).
Max [00:00:00]
I’m Max Niederman, head of quality at Mechanize, and I’m here today with Stephen Yang, a senior software engineer here, and Ege Erdil, our CTO, to talk about evals. I want to start off with a very basic question, which is just: what is an eval exactly? And how does it relate to benchmarks and to RL environments?
Stephen [00:00:18]
An eval, as well as an RL environment and benchmarks, are just various ways to quantitatively measure the performance of an LM at a task. And this is very important whether you’re a user of a model trying to decide what model to use for a specific use case, or a researcher at a lab trying to improve your models at a task. The way that an eval works is that it consists of multiple test cases, and each of these test cases has three components: a prompt, which is the task that the model needs to complete; an environment that the model is situated in; and a grader that takes the work that the model completes and assigns a quantitative score to it. This is the same three components as what makes up an RL environment. So the difference is largely in how it is used. An RL environment is used during training with various RL methods like GRPO, for example, and that’s directly used for updating the weights of the model. Whereas an eval is used by researchers at a lab, for example, to pick what RL method to use or to tune the hyperparameters. This is one of the most fundamental differences between the two: how they’re used. And this has a lot of cascading effects as well. For example, one is the price that you might be willing to pay for an eval versus an RL environment. Because RL environments are run during training, you need much more of them, because the RL method is going to be much more sample inefficient than a researcher. And because you need so many more of them, you end up wanting to buy cheaper RL environments and just buying a very large quantity of them. But on the other hand, human researchers, their time is very scarce, and they’re also relatively sample efficient learners. So at the stage where the human researcher is involved, you want higher quality test cases with less noise in them. And so labs are willing to invest more in evals than in RL environments on a per test case basis, for example.
Ege [00:02:05]
I think that’s not the only reason why you might want to spend more money on evals. The other difference is that in a typical RL run, a single task usually will be used maybe a couple of times. You don’t want to reuse the same task too many times in RL, because if there’s not enough diversity, exactly because of the sample efficiency being poor, you won’t get enough generalization. So you really care about diversity. But that means that the amount of compute that you spend on inferencing any single task is kind of small. While if you have an eval suite that you really trust, then you’re probably going to be running it very frequently, much more frequently than you will be running a single task in an RL environment. So you’re spending way more compute on this task, which also means that the quality of the task is kind of complementary to that, which means you should also be willing to spend more money on it. It’s the same kind of thing where if you’re buying an expensive car, then you don’t want it to have very cheap tires, right? If you’re buying a very cheap car, that might make sense because the rest of the car is also very cheap, and so saving on costs by making the tires very cheap makes sense. But if you’re buying something expensive, which is the entire bundle of you take the eval and then it produces a score, then trying to save money by making the data part really cheap, when you’re going to spend a very large amount of money on the compute part, doesn’t really make a lot of sense.
Stephen [00:03:31]
And this is something that might be surprising to a lot of people. Because when people think about evals, they might think about public benchmarks like SWE-bench Verified, for example, and they might only see a new public announcement about a score once a month. So could you talk more about why labs need to run evals so many times?
Ege [00:03:47]
Yeah. So even for public evals, it’s not a matter of whether an eval is public or private, but rather whether the model it’s being run on is public or private. Labs do very few model releases compared to how many different checkpoints of a model they have. Because the way a model gets built is that you start from some initial state, some initial model. Maybe it’s a base model, maybe it’s a model you’ve already post-trained in some way and you’re doing further post-training on top of it. But it just gradually changes over time. People come up with many different ways of training the model. Those are applied maybe in parallel. Maybe there are merge steps where the results of those training steps get merged and somehow you merge the weights. And over time, every day, you have a slightly different checkpoint of the model. And every day people are doing different things and changing things around. And then you want to make sure that those changes people are making are not making the model worse and are indeed making it better, so you’re sort of gradually getting better. And once you reach a performance threshold that you’re satisfied with, it makes sense to spend a lot of resources on optimizing the inference and rolling this out to your inference clusters, because there is a cost to doing that. You can’t serve too many different checkpoints at once because you want the batch sizes to be large. There are other overheads with serving like 200 different models. Labs usually want to have, for various reasons, a small number of models that they release. This means the number of scores that a lab releases is going to be way less than the number of scores they measure internally, because for every model they release, they’re going to have like hundreds of checkpoints internally that they have never released. But they still need the scores on those checkpoints, because otherwise, if they are experimenting, say, with a new training method, they might go totally off the rails. They might think it’s working, but if they have no measurement of whether it’s working well or not, they might just go totally off the rails, for example. And that’s the reason why evals are run so many more times internally than results will be announced on them externally.
Max [00:05:56]
Yeah. So as Stephen alluded to, within labs there’s kind of this two-level optimization structure where researchers are trying to optimize the model at the high level, and then they use this lower level inner optimizer that’s like an RL algorithm that automatically updates a huge number of weights. And the outer optimization loop runs much less frequently. So you have maybe on the order of like hundreds of samples from an eval, whereas you have many, many orders of magnitude more for the RL environment. And then you also have users as maybe even a third tier. They’re trying to optimize over which lab’s model they choose, but they really aren’t going to spend that much time on this. They’re going to select from maybe a few different, maybe six or ten different models, maybe at the very high end. And so they really do not need to run their benchmarks very many times. And in practice, they just look at the benchmarks that either the labs or some other organization publishes.
Ege [00:07:03]
Yeah, I think the point about evals being run at a much lower frequency than RL environments might not be true. Because if you think about the wall clock time you spend, you’re running many, many tasks in parallel when you’re doing RL, but you’re not necessarily running many more tasks in sequence, because each rollout takes the same amount of time. So if you’re interpreting frequency literally, I don’t think that’s true.
Max [00:07:31]
Yeah. So I guess it’s more the throughput of samples rather than the number of times that any given task will be sampled from. I mean, actually, is that true? Because during an RL run, any given task will be run at a much higher parallelism than during evals. So if you’re thinking how many times does a task run, I don’t know if that’s true.
Ege [00:07:56]
Because when you do an eval, you often want to get the-- it depends on how large your eval suite is. But even if it’s hundreds of tasks, usually you’ll want to run the same task many dozens of times to get a low amount of noise. Otherwise, in an incremental checkpoint, it’s very difficult to tell if there was actually an improvement or not. If the model quality difference is very large, then the signal to noise ratio is high enough, even with a small number of samples. But usually with incremental checkpoints, where you only perturb it a very small amount, that’s not going to be true. So I think you’re still going to want to do a lot of parallel samples, even in eval.
Max [00:08:38]
Yeah. I guess what’s definitely true is that you will spend more compute on all of your tasks during an RL environment than during evals. Across all of your tasks, I think that’s true.
Ege [00:08:45]
Across any one task, that’s not true. Because the same one task in an eval will probably be used way more frequently than during an RL run.
Max [00:08:57]
Yeah. I do think it’s valuable to zoom out a little bit, though, and just reiterate this point that all three of these things are very similar. I think this is kind of underappreciated. But really, it’s the same problem to make a good eval, to make a good benchmark, to make a good RL environment, in a bunch of ways. Actually, I want to hear you guys talk about how exactly you go about this problem. Say you want to measure a very concrete ability, a very useful thing, like the ability of a model to do software engineering. This is a very broad thing. How do you go about creating an eval, creating an RL environment, creating these tasks to measure that ability?
Ege [00:09:41]
I think the best way to start is just: what do you try to use the model to do? So if the model can’t do something at all, that’s usually not a good target for an eval, because then the model will just get a score of zero. For example, we could do a benchmark on whether a GPT-5.5 can pick up this microphone, and it will get a zero on that benchmark. It won’t be able to do that. So this is kind of a trivial point, but it does mean that there’s this very easy benchmark you could technically create, but that will be kind of useless. So you want to pick something that is at the edge of what the LLM can do, or maybe it can’t quite do it, but it’s not too far off. It’s at least the right modalities to attempt the problem or something. And given that, you can just try to use the model to do various things, and you can see that it doesn’t work. And then you have to think about how would I measure that in an automatic way? If someone gave me an attempt at completing a task that I would ask a model or a human to do, and they sent me an output, how would I measure the quality of that output? Because the thing you’re really trying to do is to get away from “oh, I think this output looks good” as opposed to “looks bad,” which involves a lot of human judgment and is very subjective, to something that is more objective. So one very easy way to do this is to give the model a list of very specific requirements its output has to meet, and then you can grade it automatically, if the requirements are specific enough, on whether its final output actually met those requirements. This is how most early environments for software engineering used to be constructed, like a year ago, because models were not yet good enough that they could even execute these very specific, concrete instructions in a correct way. And that was very easy, because you could just literally, in many cases, even scrape data of public repositories and use their own test cases. And then even if you told the model, okay, this is exactly what you’re supposed to do, and the test case is just testing something that’s very much in distribution for what the model is told to do, it’s not testing some weird edge case that the model has to be creative and extrapolate the given prompt to think about or something. And that would work. But now models are good enough that that doesn’t work anymore. So this is something that worked a year ago, but doesn’t really work anymore. Now, I think you often need to come up with tasks where the model has something else it needs to do other than simply implement a simple sequence of instructions. For example, if you ask an engineer to build a feature in an app, you usually won’t give them a very detailed list of requirements.
Max [00:12:36]
I think it might be helpful to think about a concrete example of what a simple list of instructions means. Because would a simple list of instructions be like “build a GBA”? That would
Ege [00:12:43]
That would not be a simple instruction. A simple instruction would be like, go and change the input length validation on this password text box from eight characters to 16 characters, from a maximum of eight characters to a maximum of 16 characters. That would be very simple.
Max [00:13:06]
And what makes that simple as opposed to what I said? Because they’re similar length as a sentence.
Ege [00:13:12]
Well, they’re very different lengths in terms of the length of the code that you need to write to do it. That’s one thing that makes one of them simpler. The other thing is that one of them does not really have any weird edge cases you need to think about. The way you build a complicated piece of software is that there’s going to be lots of weird interactions that people can have with it. In the case of a GBA emulator, for example, you can have any number of games that you want to play on it. And those games might go very deep and they might have tons of interactions that you need to think about. The system itself has many components because it has audio, it has video output, the video output has colors. There are problems that you need to solve on the rendering side, on the side of making sure the game has the right logic and making sure that it doesn’t crash randomly and so on. While for a simple task, like changing the maximum length allowed on a password text box, you’re literally changing one number in the code. And it’s very easy to do that. And it’s very easy to tell if you’ve succeeded. Because if you just go to the page and try a couple of different kinds of inputs, and you try five or six inputs and they work, it’s very unlikely that some weird corner case of this feature was missed, and then in practice a user will complain that it doesn’t work. While if you’re working with a really complicated piece of software, often getting the basic functionality is not that hard. But then if you want to make it work in every single conceivable situation, then that’s really hard. Because you have to think about every single possible edge case, you have to think about how people will use the software. And that’s really hard.
Max [00:15:06]
Yeah. And I think it’s not just about the size of the code change either, or the size of the software that’s being written. It’s also kind of about how clear and objective it is. For example, if you ask Claude to write a good novel API contract, even if the API contract is pretty small to write out as function signatures or whatever, it’s very bad at this task.
Ege [00:15:31]
Yeah, I think it’s bad at this task because it can’t really imagine what people will use the API contract for. And so it’s just coming up with things that are similar to what it has seen in its training data. When you’re coming up with an API contract for a service that you use, you usually know what your needs are. If you’re a consumer of this service, you’re like, I really need this to be structured in this way, because otherwise it’s going to be useless for me. But when Claude is doing that, or GPT is doing that, it’s not doing it from that point of view. It’s doing it from the point of view of what would be a good answer to this question in some general way. And it’s just drawing on knowledge from its pretraining prior, what it’s seen on the internet basically. And that’s a very different way of solving this problem. I think the other thing that makes this complicated is even when the output is very objectively verifiable, if there are just too many weird edge cases, then it can make it hard. For example, the way a model would today go about this kind of task, if you give it a verifiable task it needs to accomplish, is it might write a set of tests, which initially, of course, will fail. And then it might try to iterate on the feature until it gets its own tests to pass. And that’s a very normal flow. That’s also what humans do sometimes. But the issue is that then the feature is only as good as your tests were anticipating. And if you didn’t think about a bunch of realistic ways in which this feature will get used, and users are just very high entropy in the real world, a set of real users are going to do things that you never imagined that they would do. For example, if you’re building a password validator, they might just paste in some characters that your system can’t recognize. Or they might paste in a huge input because they forgot that their clipboard had some input in it, and then they just missed that. For example, they try to paste in a huge document into your password text box, and suddenly the website hangs because it can’t handle the amount of input that is being put in. Things like this that a model is just not going to think about, because it seems like a weird thing to think about. Who would paste in 100,000 characters, or a million characters, into a password text box? That could easily happen to a real user. And then it’s kind of bad if what happens in that scenario is that the website crashes or hangs. So this is a very good example of a failure mode where a model would never think of this unless this was a problem encountered by humans and solved in some standard way that was very widely used across similar login forms. Because then the model will know the solution from its pretraining prior, even if it can’t do this explicit reasoning of, what happens if a user does that weird thing, I should probably address that. Even if it can’t think of that, some other humans will have thought of it for it, and then it will just draw on their solution. So LLMs are best when they’re able to do that. And they’re pretty bad when they’re working with a novel system where it has some novel failure mode, and they have to do this thinking on their own.
Max [00:18:59]
Yeah. So the prompt I mentioned earlier of implement a GBA emulator was not random. The reason I bring it up is because Stephen has been working on, and we’ve just recently published, this eval where we have models implement a GBA emulator. And it runs into this exact kind of failure mode that Ege has been talking about, where it’s unable to think about and test the software the way that a user would. Can you talk about how this failure mode shows up in GBA Eval, and also just briefly describe what GBA Eval is?
Stephen [00:19:33]
Yeah. So in GBA Eval, we task models with building from scratch, and with a very long time horizon, 24 hours, a software emulator for the Game Boy Advance, which is one of the retro Nintendo consoles. And in a sense, this is a completely verifiable task. And the reason is that the GBA is a deterministic piece of hardware. And what this means is that there’s no source of entropy in the hardware. So if you take a given set of inputs and a given ROM, it should always produce the exact sequence of frames. And so it’s verifiable in this sense. But that doesn’t mean that it’s trivial for the model to solve. And the reason is exactly this point that we’ve been talking about, that even though it’s verifiable, there’s edge cases and you need to write tests for those edge cases. And models are just really, really bad at actually playing the games on their emulator when they’re testing their work. And so in the environment for GBA Eval, models actually have the ability to observe the behavior of the perfect reference emulator given any set of inputs and a ROM. But they write very simple test cases. Most models don’t even write test cases with any inputs at all. They just let the game boot. Some of the more sophisticated models like Opus 4.7 or GPT 5.5 might write test cases where they load past the start menu by pressing the inputs like A. But no model actually will play the games, which means that a lot of the bugs that you’ll actually encounter when playing the games never surface. And this is why, for example, the emulators that they produce perform very well typically when booting the games, but perform very poorly when actual gameplay starts and needs to handle real inputs from the user.
Ege [00:21:08]
Yeah, and this is important because the hardware and the task is complicated enough that it’s not good enough to just be good at the boot to generalize to the case when the gameplay actually starts. For some kind of other hardware, maybe some kind of other software, that might work; the software might be simple enough that if you check its correctness just during the first X frames, then you can be confident that it will be correct the rest of the way. But a lot of real software, and this is also true in this case, the failure modes when you’re trying to implement it look more like a power law. And by that, what I mean is, on any kind of implementation, you can spend a small amount of effort to fix the really obvious failures. But then there’s going to be failures that are very rarely triggered. For example, that specific failure would only occur if you played the game for a thousand frames or whatever. And as you incrementally make progress, for example, let’s say you boot up the game, you play it for a short amount of time, and you notice a bug, you’re like, okay, I noticed a bug, let me just fix that. And then you patch one bug. But then there will be a bunch of other ways in which your emulator differs from the characteristics of either a reference emulator or the actual hardware. And patching all of those ways, there isn’t really a clean way to do it. The actual way you solve this problem in the real world is you get more and more data on what the failure modes are, maybe through bug reports, maybe through, if you’re a developer, you can literally get a lot of data by just playing the games yourself and observing what’s happening. But past some scale, you need to rely on bug reports sourced from a community of people who are using it, because that’s the only way you get enough data. And then you will encounter all these failure modes way out in the tails, which occur once every million frames averaged across some suite of games, or once every 10 million frames, or once every billion frames. And then you will incrementally try to reproduce them, write tests for them, patch them. And this is the way, in the real world, this kind of software gets to something like parity, approaches parity with the underlying thing it’s trying to replicate. It’s not a simple task where you just understand the logic of what you’re supposed to do, and then you just perfectly reproduce it based on this conceptual understanding of how this software should work. It’s more like, you can try to do that at the start, but it’s not going to work. And then what you’re going to have to do is incrementally patch these cases and improve your design until you’re no longer observing blatant failures, people are no longer reporting blatant failures.
Max [00:23:48]
Yeah. And to be clear, this is not unique to implementing an emulator for the GBA. This kind of thing happens in almost all software development. Can you talk about what the analogs of this are for other kinds of software development?
Ege [00:24:06]
So one good analogy might be with finding security vulnerabilities and exploits in software, which is very similar. So if you’ve never spent any time trying to find vulnerabilities in your software, then probably someone can spend a couple of hours and find something. It’s very likely, especially if you’re doing a lot of custom stuff. And okay, let’s say that you have spent some number of hours and you couldn’t find anything, or you found some obvious things and you fixed them. Now, this doesn’t mean that you fixed all the issues, because in any large piece of software, there is essentially a scaling curve of how much resources you want to spend trying to find vulnerabilities in it, which can literally mean how many human hours someone tries to figure out a way to exploit it. It can also mean how much compute you throw at the problem. But the case I want to focus on here is really just human hours interacting with the software and looking at the code and trying to test different things out. And often the more you want to spend per exploit, the more weird edge cases and things you will find that are constructed in much more elaborate ways that would not, for example, be something you think of when you’re trying to attack the software from the outside. But if you have access to the code, then you can often find really complicated exploit chains if you’re willing to spend millions of dollars or tens of millions of dollars on a single exploit, that you would never think of if you only spent two or three hours testing the software. At the two or three hour time horizon, the software could look good. But at a 20,000 hour time horizon where someone is doing everything they can to break it, then it might not be so good. And the basic thing here is that in a complicated system, the fact that you couldn’t find anything after say 10 hours of testing only really gives you enough confidence that the next 10 or 20 hours of testing won’t find anything. It doesn’t give you that much confidence that the next 10,000 hours of testing won’t find anything. And in fact, you should assume they will, because you don’t have that much evidence that they won’t. And in practice, this is more true the more complicated your software is in many ways. It doesn’t have to be just in lines of code, though that’s a very crude measure of complexity. It can also be how many different components it has, how unusual it is compared to what is industry standard, because if you’re just using standard components and you’re connecting them up, then the only place in which there could be exploits is places where you’re connecting them, so the surface is much smaller. But if you’re writing an entire 10,000 line app or whatever from scratch, then almost certainly there will be lots of ways it can be exploited, and there’s no way in which you can trust that. And it’s very similar on the feature side. Why is it that software that is used by hundreds of millions of people worldwide still has tons of bugs? For example, if I try to use a Waymo, which is now being used by a lot of people here, sometimes it’s going to take me to the wrong location. This has happened to me many times. I input an address, and the address it shows me is in the middle, it’s like a 10 or 15 minute walk away from where I want to go, and I don’t understand why that happens. But I think probably the reason it hasn’t been fixed is because they have other priorities and this bug has not been reported by sufficiently many people and they are not prioritizing it. And I think this happens in a lot of software, where there are known bugs that don’t get fixed because they are not being prioritized, because there are other issues that are more important.
Max [00:28:01]
I think it’s worth emphasizing this point that models are kind of working with an extreme lack of any of this user feedback, which is normally crucial to developing software. As you said, if you’re making a GBA emulator, typically that means that you’re going to, first of all, use it yourself for many hours, many more than the 24 hours that is available to models in GBA Eval, and then also you’re typically going to ship it to users and have a very iterative development cycle over the course of months or years, even decades. And this is also true of the least buggy software. For example, cryptography libraries frequently are many decades old, and this is just what you need to do if you want to get software that is actually bug free, or very, very few bugs or whatever.
Stephen [00:28:57]
Yeah, so I think there’s a really important point here when you’re making an eval, which is the trade-off between realism and cost. Because maybe ideally, if you wanted to measure the ability of an LLM to perform the job of a real software engineer, you would have to simulate an entire company around it, the process happening in that company, the communication with their co-workers.
Max [00:29:19]
Yeah, so obviously the model can’t have this very long development cycle where it talks to lots of users or whatever in an eval, and it also can’t do it in real life. Why is that?
Ege [00:29:32]
So these are different things. One of them is, it can’t do it in an eval because it’s difficult to put realistic users in an eval. And I think this is because LLMs have this other problem that they are very collapsed, they don’t take very high entropy actions. So it’s too expensive to put human users in any eval, so you really have to put simulated users, or an LLM is pretending to be a user, and then that doesn’t work very well because if you have hundreds of LLMs and they’re interacting with an app constantly, they’re just going to collapse into these same modes where they will always do the same things. You can give them instructions to do different things, in which case you are the source of entropy, you’re giving the entropy of which different things they should do, and then they can do different things, but only insofar as you hold them to try it. Otherwise they will all do the same thing. And it’s very different from human users. Human users will do tons of random stuff, and that will give you way more information about how your software behaves in unusual situations compared to what will happen if you just ask an LLM, please test the software. It will just test it in the most superficial way and then say, oh yeah, it looks good. And for that reason, you can’t really put LLM users in these evals. So why can’t you make an LLM get this feedback in the real world, why is it still bad in the real world? One of the reasons, which maybe you’re trying to get at, is that because you can’t put them in an eval or an RL environment or whatever, it’s very hard to train this capability, because you can’t measure it, and as a result, of course it’s bad, because it hasn’t been trained to be good at that. But I think there is a separate thing, which is that models are not good at understanding what the users want. They’re bad at having a model of a user and thinking about what a user wants. They’re good at, you give them a specific piece of feedback like this exact thing is broken, and if it’s something that is reproducible for them, then they can try to reproduce it and fix it as an individual thing. But they’re not going to be good at proactively anticipating that, which is often what you want to do as a developer. If you relied on user feedback for every single thing about your app, then you could just ship something completely broken, and then users might not even like it enough to give feedback, in which case it’s not going to get better. You have to give them something that is at least good enough that you will actually get users and they will actually give you feedback. If it’s just really bad, then they won’t even bother giving feedback, because it will be too bad. And I think any purely LLM-written software is just going to fail that test, because an LLM won’t understand what the problems of actual users are, and so it will fail to solve them. Any software is supposed to solve some kind of problem that a user might have, it’s supposed to be useful for that user, but because LLMs have such poor theory of mind and model of other users, they can’t anticipate what will be useful for other users, and that just breaks this entire feedback loop. If you can’t build something that is useful for some users, then they won’t give you feedback, and then you won’t get this data collection process that enables you to then make the software better so that you can get more users and they give you more feedback. This is often how a lot of real world software is able to get better, but you need some seed to start from. If you just don’t ship anything, then users are not going to give you feedback, and then this won’t get good.
Max [00:33:32]
think that first signal of whether you get users is probably valuable, and I think that if you did have this in an environment, models would probably be able to learn. We do see that in some very limited circumstances models are pretty good at predicting what users want. For example, models are pretty good at predicting what you’re going to write unit tests about. In fact, could you guys maybe talk about how we see this, especially in public evals, like
Ege [00:34:06]
SWE-bench Verified? Yeah, so first of all, what SWE-bench Verified is and how it’s great and whatever. SWE-bench Verified is a subset of an original benchmark called SWE-bench, which was created by going to, I think, under 10 or 20 open source repositories on GitHub and scraping a bunch of pull requests that were paired to specific issues. So basically there was an issue someone opened, and then there was a pull request that was later opened that was recorded as resolving that particular issue. And then the idea was that if this pull request bundles in some number of tests to validate its own functionality, then you can use those tests to grade the model’s work. The way you would do this is that you would give the model the original issue description as a prompt, then you would give the model the task of resolving that issue, and you would measure how well it resolved that by running the tests in the PR against the model’s solution. So you would take the test from a real PR, but the solution, the code that actually implements the fix for the issue, would be the model’s own code. You can get many tasks like this because there are a lot of pull requests. And even if you filter them for issues that are extremely vague, so that there’s no way in which someone could have known that resolving that issue would pass a specific expectation of the test. Sometimes people just write the test as regression tests and assume a bunch of details about their own implementation of the feature that someone who was just given the issue description will have no way of knowing, and in that case of course you can’t pass the test in a fair way.
Max [00:35:47]
Could you give like a concrete example?
Ege [00:35:47]
Yeah, so for example, let’s say that an issue description said, add an additional field. Let’s say we have a form, like an application form on an application tracking site or something, and it’s currently only allowing people to input a certain number of fields, and you want to add an additional field at the bottom. It’s a very simple task. But the issue description might be vague about exactly what the, say, HTML selector for the new text box should be. It might be that you need to come up with a short slug version of what it represents. Let’s say you’re asking people to tell you about their background, or you’re giving them a text box where they can describe anything extra that they want the people reading the application to know. What do you really call that text field? There’s a lot of possible names. And a regression test added for that specific feature might involve going to the page and then clicking the text box identified by some internal test ID tag, and that might work as a regression test because the shipped PR of course has that tag on that field, so the test just passes. But the original issue description is very unlikely to say, oh, you should use this exact test ID tag on this text box, because that’s not the kind of thing you would put in an issue. It’s just a very mundane sort of implementation detail that doesn’t really affect the functionality of the feature, and it’s not really something that you would think about when making the issue. So this is a very typical example where there is a specific expectation that the test has about some very minor or even technical detail invisible to a real user, but that just makes it convenient to write tests. A lot of testing frameworks, when you go to a web page, they will click things by locators that are invisible to a human, because that’s the convenient way to automate it. And then they will expect those locators to be present, and they will fail if they are not present. But those are not things you will actually put in the issue description. So you have to filter out a lot of these tests. When constructing SWE-bench, people tried to filter these out, but they didn’t have manual review of everything, so a lot of this stuff still got left behind. So then some people said, oh, we should really take a subset of this and have humans go through each example to make sure that we don’t have these bad cases which happen to not be caught by LLM filtering. And LLMs back then were much worse than they are now, so their filtering was also much worse for this kind of issue. They took a subset of the SWE-bench set and just showed it to humans, who were like, yes, I think this task is fair, I think the tests can be passed reasonably from the given issue description. And you form that test set according to that. But now you have another problem, which is that often the tests that are written as parts of a PR are not really intended to be adversarially optimized against. When I write tests as part of a PR and bundle them in, I’m not thinking that, oh, these should be so comprehensive that someone who can pass all these tests, I should be happy to merge in a second PR which is doing the same thing as mine that happens to pass all my tests. Because that’s really the criteria you care about. If someone opened this PR that the model has, the model’s different...
Max [00:39:36]
What do you mean by adversarially optimized? Isn’t this an eval and not an RL environment?
Ege [00:39:42]
By adversarially optimized, I mean that you want the score, regardless of whether it’s an RL environment or an eval or whatever. The problem you’re trying to solve is take a submission and assign a score to it in a way that measures how good it is. And if you just have very simple test cases, then that will very easily be saturated even though the feature actually doesn’t work. So it’s possible that you can get a perfect score, even if you’re not adversarially optimizing. You’re just doing something that seems straightforward, and it passes all the simple test cases, but then it will fail some complicated interaction, or it has code quality issues. There are a million ways in which it can be bad even if it passes the tests I bundled into this PR. And that’s because I’m not thinking about that problem. I’m not thinking that the tests I bundled into my PR, which essentially implement cached regressions, like in case someone in the future breaks something I want to know so I can fix it, that’s very different from trying to grade another person’s implementation to make sure it is at feature parity with mine. That’s not a task that tests are typically used to solve in a lot of repos. So the test coverage is often very little compared to the actual surface area of exposed features and interactions to the model. And the reason I said adversarially optimized against is because of this earlier point that our environments are very similar to evals. So when you train a model to be good at this kind of task, then you train it in part to anticipate the kinds of test cases that it will be tested against, and then you train it to write something that it is confident in, that it will pass whatever test that you will run, which is different from training it to make a high quality feature. So models will often have this, like if you work with a model you will often see models have this weird laziness behavior, or sloppiness, where they will do something that sort of looks like it works, and maybe it works in some very limited sense, but it just doesn’t have the richness and convenience to use and all the functionality that you would expect from a human implementing the same feature. And it might also handle various edge cases badly, things like someone pasting in a very long input that I was talking about earlier, like someone pasting a long document into a password text box, they might just not pass that. And this exact example is kind of contrived, but there are other examples. For example, let’s say that you ask a model to implement a pop-up window for something on a website where you click a button and there’s going to be a pop-up, and then you can fill in some text inside it. A common failure mode that I have seen on this kind of task is that often a human will want to select the text that they’ve put in the text box inside a pop-up, so they will want to do this motion of clicking on the left side of the box and then dragging their mouse to the right. And often when you do that, your cursor will end up outside the box, and then when you let go, the model’s code will interpret that action as if you had clicked outside the pop-up, and it will close the pop-up. So you won’t be able to select the text, and then you will lose your text, and then you will reopen the pop-up and your text won’t be there. So this is a very common kind of failure mode that happens because a unit test is extremely unlikely to exercise this specific interaction with the pop-up. It’s just likely to test something happy, like let’s open the pop-up, let’s put in some text, let’s click save, and then let’s see if it was correctly registered. And the model would pass that test with this implementation. But this kind of thing, like the cursor here and drag it over and then the cursor goes up outside and then you let go, that is a very high entropy interaction that you would really only get if a user was really using the feature. And a model is very bad at passing that kind of test because people usually don’t write unit tests for that kind of thing. It’s
Max [00:44:11]
kind of hard to write in a test because you have to move the cursor and there’s some trajectory, and you have to know what the starting point and the ending point is for different layouts and all these other things, where it’s much easier to write like select this element, put in some text, click this button. And in general you just see models very much try and implement things such that they will pass easy-to-write tests. And also be very sophisticated about guessing what kinds of tests will happen. For example, in the original example of adding a form field to an element, the model may very well try and guess the test ID that you’re going to use. I mean, this is very hard for it, but if it thinks that it’s going to have to pass a specific test ID, it might as well try. I’m anthropomorphizing a bit here by saying it’s thinking. But sometimes it literally is, like sometimes
Ege [00:45:09]
this behavior is more subtle. If you read the transcript of the model trying to do a task like this, sometimes it will just happen to pick a test ID without any explicit reasoning, and then it will just happen to be sampling from a distribution which is similar to the distribution that people actually use when they write these tests, so it just happens that it works. And in that case maybe it’s a bit debatable whether you call it thinking. But in other cases there will literally be a chain of thought where it’s explicitly thinking, oh, I was not told what this test ID should be, I wonder what it is, I wonder if I can find that out. And it would try to find it out, and then it can’t, and it’s like, oh, I can’t find it, so I should just make a reasonable guess. It will just be explicitly reasoning about it. And I think the reason that happens is very obvious, because even if you have some kind of very basic quality filtering for these tasks, the base quality is so bad of things you can scrape from the internet that the LLM will have been trained on tons of broken RL environments. Broken, for example, in the sense that there’s no way for the model to pass the test fairly, meaning that the tests had some specific thing they expect that was impossible to infer from the description of the task that was given to the model, like the issue description for example. And then the model will be under very strong optimization pressure throughout this kind of RL on broken tasks to just infer what the test will want and then try to do that, and then not do the things the test will not measure. Because if you waste time trying to do the things the test will not measure, then that’s just more tokens in your context that you could have spent on making sure that you really are very confident that you will get the things that you’re confident they will measure. So it just creates this perverse incentive, which is very similar to what you might expect if you have a human employee and you’re giving them bonuses or compensation for doing some specific set
Max [00:47:27]
of tasks. Or literally imagine giving them bonuses for passing unit tests and that’s the only way you
Ege [00:47:33]
evaluate their performance. Yeah. Well, of course then you will see some very strange behavior in the resulting software, and that is, I think, basically what’s happening. And it’s even more opaque in the case of models, because often what happens is that there is no explicit reasoning. It’s rare that the model will explicitly reason about this kind of decision. Often what will happen is that it will just implicitly do something where the behavior it does happens to have been reinforced in this way.
Max [00:47:57]
This is why I was saying it’s anthropomorphizing somewhat to say that it’s thinking, because it’s just the thing that it has learned to do.
Ege [00:48:09]
Yes, it’s just incredibly natural for it to do this. So it’s not like it’s trying to adversarially attack the grader in this way. Yes. And in fact, some of these selection pressures are so strong that even when you can explicitly tell the model over and over and over again to not do it, it doesn’t care, it still does it. For example, a behavioral pattern that is very similar to this: you can tell the model to not pipe command outputs to tail or head or grep or whatever, which is something that models will often do. They will run this really long-running command which takes like 10 minutes to run, and then they will send its outputs to tail dash 10, which just gives them the last 10 lines, and then they will realize that, oh, the last 10 lines of the command output didn’t have the information I want, so now I’m going to do tail dash 30, and then they wait and it’s another 10 minutes, and they go, that still doesn’t have the information I want, now I should do grep, I should search the output. The reason they do this is because during training they are not under strong optimization pressure to save clock time, but they are under strong pressure to save on tokens to preserve their context. So if you understand that, then this behavior makes perfect sense, because they’re completely insensitive to how long the command takes to run. So they don’t think of something like, oh, let me write the output of this command to a file and then let me just search that file, because from their point of view it’s not any different. But they are very sensitive to the risk that it uses more tokens.
Max [00:49:45]
It is a little bit different, right, because it uses more tokens.
Ege [00:49:45]
Yeah, that’s right. So they don’t want to do that because it uses slightly more, even though it saves a lot of time. If that step was not an optimization target, then you don’t care about that. The actual thing is that because often you’re limited to a fixed context size in RL, even if there is no explicit penalty for using more tokens, there is an implicit penalty, because the more tokens you use on something that is incidental, that’s fewer tokens you can use on actually getting a high score. And that results in this very weird behavior. And even if you tell the model, you can put it in your Claude.md or whatever, all you want, you can say do not pipe command outputs to tail, do not do that, and the model will just not care. Maybe immediately after you give the instruction, in the next turn it won’t do it, like the next couple of turns. But then the moment it starts doing a long-running agent a little after, it will immediately roar back to its old behavior. The prior it learns from this RL is just so strong that you can’t overcome it by just putting very explicit instructions in its context. And that’s exactly the same for the behavior of not writing features in a hacky way. An LLM just has a very strong prior to do that, because writing things in a hacky way that just happens to look good to a simple set of unit tests is what they’ve been trying to do. So it’s very hard for them to break out of that mindset, because they have so many habits and ways of doing things that have been learned in that kind of setting. So it’s sort of like trying to tell a human to walk in a completely different way or something. The human might understand the explicit instruction, but it’s just very hard for them to do that, because it’s very against their instincts.
Max [00:51:49]
Yeah, an LLM might be able to do it for one step, but then if it’s trying to do other things, maybe it’s talking on the phone, and it’ll very quickly just forget and start walking normally.
Ege [00:51:59]
That’s right, that’s right. I think that’s a very good analogy to what they do.
Max [00:52:04]
I think it’s also worth going back to GBA Eval, because the reason that LLMs have this kind of perverse behavior is because there’s this thing that’s easy to test and they’re optimizing for that, instead of optimizing for what you truly want, which is good software, which is a very nebulous concept, you can’t really write tests for it very easily. But GBA Eval is a place where it seems much easier. So why do
Stephen [00:52:37]
models still not do very well on it? Yeah, so again I think the biggest bottleneck for models is this point we touched on earlier, which is that they’re really bad at writing test cases and actually playing the games. And it’s very rarely, at least for the frontier models, an issue with agentic capabilities, an issue with actually understanding the logic of some code that they need to write. Maybe if you look at some of the open source models, they might just have issues with staying coherent over a lot of turns, they might just write code that doesn’t compile. But when you look at the frontier models, this is no longer the bottleneck that we see. The bottleneck is really this issue where they just don’t play the games, and thus they don’t encounter realistic scenarios. And this is why, when we grade GBA Eval, if we just booted games and saw if they boot to the start screen correctly, if we did this, the scores on the eval would be much higher for all the models. But that’s not a realistic use of the emulator, right? When you have an emulator, you want to play games on it. And so the majority of the grade in GBA Eval comes from this replay score, we call it, which is the ability to take these sequences of inputs on various ROMs and then output the correct gameplay. And this is where models perform much more poorly than on booting the game.
Max [00:53:42]
Yeah, there’s also this procedural test subscore, right? Do you see this kind of gaming the grader, like trying to
Stephen [00:53:55]
pass the easy tests there? So models actually perform pretty poorly on the procedural tests. I think one part here is that these are tests that are unlikely to have been optimized against very heavily during training. For example, if you take a more common piece of software, which would be like a browser, there’s a very common existing open source test we’d call the web platform tests, and models tend to be very good at writing features that pass the web platform tests. Presumably this is because they’re subject to optimization pressure on these specific test cases, whereas they were much less likely to have been tested against, for example, the mGBA test suite, which is an example of an open source procedural test suite that we use.
Max [00:54:33]
Wait, why are they subject to optimization pressure on this test suite but not on mGBA?
Stephen [00:54:38]
Presumably there are more RL environments related to writing browsers, because it’s a much more common piece of software, and those RL environments might use WPT tests as the
Ege [00:54:48]
grader. Yeah, and the other thing is that the model, this doesn’t generalize very well. The skill of trying to anticipate which tests will be written, which tests you will be graded against, I think doesn’t generalize very well to other domains, especially because in a lot of cases that skill is implicit. It’s not like the model acquires an explicit habit of constantly reasoning about it. It’s more that it acquires a set of implicit habits that you can’t really tell that’s what it’s doing if you just read the transcript. But then if you look at the final results and you compare what it wrote to how a human might write the same feature who is unfamiliar with the test suite, then you can tell that clearly there was a big effect of the fact that it is familiar with the tests that it expects to be graded against.
Max [00:55:35]
Yeah, why do you think that is? Because I feel like in humans this does generalize reasonably well, or at least something similar, which is, if you are very good at writing one piece of software like a browser or something and anticipating how users will use it and what kinds of issues they’ll run into, what kinds of things they’ll like, I think that does actually generalize fairly well to working in other domains. So why is that
Ege [00:56:03]
different from models? I think there’s a much more general question: why is it that generalization in RL, and also in pre-training, is poor for models? I think it’s hard to answer that question, because even for pre-training, models need to be trained on just so much information to do what they are able to do. The amount of knowledge that a model has about just the most random things in the world is just so vastly incomparable to what a human knows, because a model’s trained on like a hundred trillion tokens, there’s like a hundred trillion words. And a human might see, I don’t know, by the time you’re 30 years old you’ve lived for like a billion seconds, so even if you read like one word every second on average, you only have a billion words. Of course you don’t read anywhere near that amount, so you have such vastly less information than an LLM. But an LLM trained on a billion tokens just doesn’t seem intelligent, even though it might know a bunch of facts. And this is just a sample efficiency issue, where these more general cognitive skills do not seem to be learned efficiently by the way we train the models now, so we just have to put in way more data. And the only way in which you get that generalization is when you train against an extremely broad, diverse set of tokens. And then another thing that is very strange is that adding additional just garbage tokens to the training set of an LLM, and by garbage I mean just really low quality stuff from random website scripts, stuff no human would ever read, that seems to just help the model. It just improves its performance on other things, even if the tokens are kind of low quality. Just adding them into pre-training can often just make the model better, and that’s very different again from humans. Humans don’t really benefit from
Max [00:58:06]
reading these garbage tokens. I mean, don’t they, when they’re like children and learning language?
Ege [00:58:12]
Initially no, because I think those tokens are way higher quality. I mean, probably yes.
Max [00:58:12]
But I think if you had a human learn language from really shitty, low quality Reddit comments, they’d probably still learn.
Ege [00:58:25]
No, Reddit comments are way better than the data I’m talking about. You can just get trained on the raw HTML of some random Reddit. Because there’s actually so few Reddit comments compared to, there isn’t going to be a hundred trillion words of data that is of a comparable quality to Reddit comments on the internet. High quality data is just not that common. For example, if you train on all arXiv papers ever written, you know, all the papers on arXiv, that’s like, I don’t know, five orders of magnitude less data than that or something like that. It’s like a billion tokens, maybe a couple billion tokens. It’s a very small amount of data compared to what the LLM is trained on. But it’s a very large amount: no human would read every single paper ever put on arXiv. And if you could somehow do that, and you could remember all that information, then probably as a human you would just be making so many connections across so many different things. You would see one thing and you’re like, oh, I’ve read a paper about that. But the LLM never does that. The LLM’s learning seems very superficial compared to what a human would get from the same amount of data. And this is a central problem: why is it that the models are so sample inefficient? And I don’t think we know the answer to this question, it just happens to be true. I think LLMs are more sample efficient at learning in context. So if you give examples to an LLM when you’re actually using it, like, here are some, I want you to tell me if a response to this question is good, and here are some examples of good responses and here are some examples of bad responses. If you’re prompted like that, then it’s essentially trying to do this in-context learning of your preferences over what is a good answer to this question. And at that kind of thing it’s much more sample efficient than it is in training. But I think it’s still worse than a good human. It still doesn’t have this taste that a good human would have. But it’s way better than it is in pre-training. And the answer to this I do not know. I don’t know why we need to give models tens of trillions of tokens for them to be as capable as today’s frontier models.
Max [01:00:47]
That just seems like a mystery. The claim you make about humans being more efficient at this in-context learning, how would you measure this? Because I feel like almost any scenario where you would have, say, few-shot examples given to a human, they would have a bunch of other context. Because part of the reason that LLMs need to be better at in-context learning is because they just don’t have very much context. They really do not have that much to go on. If you think about auto-regression, you have the first half of a Reddit thread, it seems very hard to predict the next comment, but this is what they’ve been trained on. I mean, obviously there’s other data, HTML and whatever, but there’s just not very much data to go on compared to all the context that a human would have about a realistic
Ege [01:01:41]
scenario. Yeah, I think that that is part of the reason why a human can perform better. I think it’s difficult to disentangle that, because you would have to put a human where they don’t know anything. You can give a human a few-shot in-context learning task in which they don’t have any context. You can just have humans who are data labelers do a short classification task, and then they have to make classification decisions over the next few things. You can set this up. I think if you do that, though, I think that it is no longer true that humans are better, especially because those humans are also not spending a lot of time on the problem and they have no context. The only thing they have is the prompt. And I think the issue here is that that’s just not a good modality for learning. One example I like, from Dwarkesh, is, he gave this example: imagine the way you had to learn to play the saxophone was that you tried to do it, and of course it didn’t work, because it’s your first time ever trying to play a saxophone, so you can’t do it, so you fail. And then what you do is you write down a bunch of notes about how you failed, and then you hand that to the next guy who comes in, who, it’s also his first time playing saxophone, but he has to read your notes, and then he gets to try again, and of course it doesn’t work again, and then you write more notes and you hand that to the next person. This would just not work. You can’t learn to play a saxophone like this. But when you’re using the models for some kind of custom flow, anything where you needed them to have any awareness of context, where it’s not just something they’ve been trained to do, this is the only way you can teach them. You can ask them to write notes for themselves so that in the next context, that’s the only modality available to them. And maybe from that modality they actually are very efficient in-context learners. But the issue is that that modality just kind of sucks. So it doesn’t matter if you’re really efficient at learning from that modality. It’s like you’re being very efficient at extracting a very small amount of information that is available in this modality. That still means in practice, whenever I use an LLM, I just feel frustrated by its inability to learn.
Max [01:04:00]
Yeah, this is just continual learning. Yeah, that’s right. So continual learning is one of the more important places that LLMs are currently lacking. You usually want to have some eval that you can hill climb against, that you can use to select models, that you can train against. How would you go about making an eval for this problem of continual learning?
Ege [01:04:25]
I think the basic structure would be that there is a context that the model can learn something about, and you ask the model to work in that context for a long time, whatever that means, and then over time you expect it to accumulate some kind of learning which allows its performance on future tasks in the same context to get better. And there is a question of how does the model accumulate more learning. An eval should probably be agnostic about that. You should not care how the model does that. But one way in which current LLMs try to do this, if you’ve ever used coding harnesses in the past month or something you might have noticed this, because labs have been putting more attention into it, is that often when you tell the model, oh, you shouldn’t have done that, or, you did that, that was a mistake, it will go into its memory files and edit a memory file. It will be maintaining a project memory, which is just a directory with all the markdown files, and it will create a new file and then go to its main memory document and refer to the file which it created, saying, oh, if the user asks you to do X then read this file, or follow the instructions file. And currently that’s the way that they’re managing memory. And that’s just like the example of the guy trying to learn how to play a saxophone. You’re just taking these notes, and this is not very effective. It also can’t get the model to really diverge very strongly from its pre-training prior, unfortunately, or RL priors, because it’s fundamentally the same kind of thing as if you put instructions into the system prompt, which you can do. You can put whatever you want in your prompt, but often that’s not that effective at getting the model to change its behavior. So I think this specific way the labs are working on right now, of getting the model to have this memory and getting better over time by taking more notes in this memory, it’s not working very well. I think it’s just because it’s the only modality that they can think of to have this continual learning right now, but I don’t think it works very well. So an ideal eval for this would involve what I said earlier, which is, you have some kind of rich environment with a lot of potential for shared context across different things the model could do, different tasks, and then you just ask the model to work in that environment for a long time, and then you want to see, okay, does its performance on other things it can do in this environment get better, just like a human would. If a human is working at a company, then over a period of months or whatever, they’re going to build up this context, they’re going to understand what this company is doing, they’re going to get familiar with the internal tooling of the company. They won’t need to look at the documentation for some internal tooling every single time they need to use it. Every day you show up to work, they won’t need to spend like two hours reading documentation and then four hours talking to everyone else at the company about what they’re doing, and then in the last two hours of the day they might get some work done. Which, if you’ve used LLM coding agents, that’s sort of what they’re like. You give them a new task in a repository and then they will start exploring all the code and reading a million different files and reading past notes they’ve written, and then they will ask you questions that you probably already answered in past context, but the answer does not get recorded. They will make the same mistakes that they have made in past context, because not all of those mistakes got written clearly into a document that the model can read. Maybe at the end, after it has done all of that, it can do some work, but then it runs out of context. So this is a very common experience. It’s become less common now that the models have like a million tokens of context, but this was very common when they had like 200,000. I would just constantly run out of context, because by the time the model has gotten good enough to try to answer my request, it will just have to compact because it’s out of context. And this is just a very inefficient way in which they learn. So they’re just not good at this task right now. Insofar as the models get better at a given context, they get better because there is a human who is explicitly managing their memory. A lot of companies now have these extremely complicated skill directories for coding agents, where, for example, there’s a company-wide skill for doing something. A skill is essentially a markdown document which is given to the model upfront, and then it refers to other documents which the model should read if it’s asked to do something specific. So it’s essentially this tree of documents that the model should retrieve depending on the request from the user. And many companies build these giant trees full of documentation, so that when one of their employees wakes up in the morning and wants to ask an LLM to do something routine, the LLM doesn’t have to waste an enormous amount of time making obvious mistakes and running into frictions and not knowing what’s going on. There’s just a very simple sort of tree where it starts from the generic company prompt, and then that refers one level down to maybe some specific department, and that refers down to some specific tooling it can use, and then it just explores this tree, and then it has the information it needs to carry out that specific task. But if then the user asks for something else, then it has to read more documents from the tree. And there are a lot of people at a lot of companies now who are just dedicated to managing this growing tree and making sure it’s up to date, because you also need to update it when you change things about your software and your company context, because otherwise this massive web of documentation is going to drift. It’s going to become invalid as people are making changes to the actual underlying thing, and the model will still get confused. So now you’ve added this additional maintenance burden, where every single change you make, you also have to find all the relevant places in this tree of documents and then update them and keep them up to date. So right now this is basically the only way in which the models can learn, and it’s not really learning. What’s happening is that a human is noticing ways in which they fail, and then a human is requesting the model to make specific changes to these documents, and over time they grow. But that’s really not what you want. What you really want is that the model manages this tree. You don’t want to manage this tree, because if you’re managing the tree then it just feels extremely inconvenient, it’s imposing a very high overhead. Every time you notice the model make a mistake, you have to tell it, okay, now stop what we’re doing and go into this document and put a line here. That’s just way too inefficient. Really what you want is something like the model dreaming or something, where the model works for like a day and then at the end it just consolidates all of its learnings from that day, and then it just updates this entire tree of documents all at once. It’s like a batched update that just goes out to everyone at the company. That would be an improvement, if it works, over what we’re doing now. And people are trying to do something like this. I’ve heard Anthropic is trying to have a dreaming feature for the LLMs, but I don’t think it’s going to work very well when it first comes out. Kind of pessimistic. But even that is actually much worse than the way humans learn, because as a human I don’t need to read a bunch of notes every time I need to do something new at a company that I worked at for like six months. I can just, I think I just know what to do immediately. I don’t have to read this huge set of files that eat up my
Max [01:12:55]
context, and then I run out of context. Yeah, and then even if you don’t run out of context, it’s very annoying to manage this as a user. Whenever I’m using Claude or whatever, or Codex, or any other LLM tool, I’m always thinking about, oh, I’m going to do this prompt and it’s going to put a bunch of stuff into context, and now I’m managing this resource, and then it also takes a bunch of time to access, as you’re saying. So there are these dual problems, first of generating the memory, and then also being able to efficiently retrieve it. You need to generate the memory in some very compressed format where it’s all very easily available, and then be able to traverse it very efficiently, and right now LLMs just do not seem to
Ege [01:13:40]
have this ability at all. Yeah, I think this is true. And also it’s just a very small amount of information. This is the other thing that I think should stand out: an LLM, a frontier LLM now, has something like, maybe the largest ones have like 10 trillion parameters, the order-of-magnitude-smaller ones might have single digit trillions. And that seems like a lot of information, but it’s actually not that much, because most of that is just pre-training, it’s just filled up by stuff it sees in pre-training, because this is 100 trillion tokens, so that has to go somewhere. And the actual amount of change that happens to the parameters of an LLM during RL is like a low rank matrix. It’s actually way, way less information than you might expect from a couple terabytes of parameter data, because it’s not literally true that only a small fraction is re-modified, but because it’s a low rank matrix, the total amount of information in the change of parameters is small. You can represent it in a very compressed way without any loss, or with very, very little loss. And as a result of that, during RL the model just doesn’t get that much new information. And in context, the amount of information you can put is just so small. Even a 1 million context window, each token is on the order of, I don’t know, some single digit number of bytes in terms of information. And
Max [01:15:15]
then probably less than that, probably less realistically, because during pre-training the loss might be on the order of half a bit or something. Yeah, that’s true, maybe it’s more information dense
Ege [01:15:25]
but probably not by more than a few times. Yeah, yeah, I agree with that. So it’s, okay, a million times one bit, that’s like 100 kilobytes, 150. It’s just such a small amount of information compared to, and then you look at the human brain, which has like a hundred trillion synapses, which is more than the total number of weights of an LLM, the entire number of weights of an LLM. And then the other thing is that the human does not have this pre-training phase. A human is just able to consistently pick up new skills. They might be faster at it when they’re younger, but even when they’re like 40 or 50 years old, a human can pick up new skills, new contexts. If you had to ask them to write down everything they know about a company after spending six months there, and they really had to write everything that’s in their brain, probably it would be way, way, way longer than 100 kilobytes. So the issue is that the amount of information you can fit in here is so small that, even if you had an optimal learner, it’s just not going to have enough information. So the way you have to manage this is then to have these complicated retrieval mechanisms, where the information exists on disk and then the model has to fetch parts of it into memory, into its context, but then it just becomes too myopic. It doesn’t have the full context of what it needs to know. And then the only way it can access that information, if I make a request which requires a lot of familiarity with the relevant context to interpret, the only way it can manage that is by using ordinary text search. It has to go to the memory directory and grep for keywords, and that’s just so inefficient. That’s not the way a human does it. So I think there’s fundamentally some issue with the way these models handle memory that I don’t know how to resolve. It just seems very different from the way humans do it. And that just seems like one of the fundamental obstacles to getting them to actually replace human workers. Right now I feel like a lot of what I do, that’s not the only thing, but a lot of what I do is being this context manager for an LLM. I just have all of this context in my head, and I know the right things I’m supposed to be doing, I know how to prioritize different things because I have all this context, and the LLM doesn’t have anything. So I have to think about, okay, how do I condense the relevant parts of the context that I have into a short prompt, so that the LLM will be able to go and do the thing that I need it to do without having context. And that’s very much not how you want an employee to be. You don’t want to have to do that with an employee. If you have to do that, then the management overhead of the employee is just way too high. You want an employee to be autonomous and know what you mean when you give them a vague instruction, because they have familiarity with the kind of instructions that you’ve given them in the past, they have familiarity with what the company is doing. But the LLM doesn’t have any of that, so it’s much more mentally taxing to manage an LLM, because you know it’s not going to know anything you didn’t tell it explicitly, or that didn’t at some point end up in a memory document. And even if it did end up in a memory document, it might just happen not to retrieve it, even though it would be relevant. So you feel this need to be much more precise and include so much context in your request that
Max [01:19:28]
it’s kind of exhausting. Yeah, and we could also see this in the labor market even prior to LLMs, because this is pretty similar to the difference between having a full-time employee and hiring a freelancer. The freelancer just doesn’t have all this context, and so there’s all this overhead from communicating it to them. And of course the market for all freelance work is much smaller. Part of that nowadays is because LLMs have replaced a lot of it, because LLMs don’t have this disadvantage when compared to freelancers, or have less of it. But even before then, freelance work was a much smaller market than full-time employees. Continual learning is one place where models are limited right now that makes them much less useful. But I think there are a bunch of others as well. Wouldn’t it be kind of nice to have an eval that just measures all of them? Why can’t you do this? Why is there not just a software engineering eval, the universal one, and when we get to 100% it’ll just fully automate software engineering?
Ege [01:20:31]
I think the reason why this doesn’t exist is very similar to why it’s very hard to make software that has all the functionality you would want without spending a lot of time on it. In practice, software engineering just has a lot of sub-capabilities, sub-dimensions, and if you want to measure some of them, some are very easy to measure. Some get increasingly hard to measure, and the harder ones are also more expensive to make tasks for, for a couple different reasons. One is that it might literally be really hard to set up the environment in a reproducible way. The central characteristic you want out of an eval is that you can just grab it and rerun it on a different model, and you know it’s in the same initial conditions, and the only relevant difference in the scenario is which model is being used. Because if there’s some other scenario difference, then you don’t know if the model did better because the scenario got easier for some reason or whether the model was actually better. And that’s really hard. The other thing is, how do you grade the quality of the output? If it’s a very simple functional correctness thing, like, did you change the maximum length on this text box from this to this, then it’s very easy. But if it’s more like there’s a user, they give you some kind of ambiguous instruction, which is how most instructions are in the real world, they’re ambiguous. And then you have to exercise judgment to interpret it in the right way, and then execute on that, and then maybe check back in with the user. But you can’t check in too much because then the user will get annoyed. You have to ask them about things that are important, but not about every trivial decision that you make, because then the user will be annoyed by you. Just like a normal employee. If an employee doesn’t escalate anything to a manager, that’s bad. But if the employee also escalates every single minor decision, that’s also bad. So how do you come up with a metric of, they escalated the right things, but not the wrong things? That gets very difficult to measure in an automated and precise way. So often the capabilities that we can’t target very well are the ones where either the scenario in which they happen is very hard to containerize and reproduce in a deterministic or sufficiently deterministic way, or there are things where, even if you can reproduce a scenario, it’s really difficult to judge the quality of the output. For example, this is also why we don’t really have good creative writing evals. Because if an LLM just writes a bunch of stuff, you have a prompt for “write a short story about this” or something, and then the LLM writes a thousand words or two thousand words, how do you judge the quality of that? What is the metric of that? It’s just very hard. So I think those are the two reasons why usually we don’t have an eval for something. And the reason why we don’t have an eval of everything is because often you have limited resources, and you have priorities because you have limited resources. You have to decide which capability right now is the most important one to target. Because you could just spread out your resources uniformly over all possible subsets of software engineering. But then what you’re going to get is some environments which are completely unnecessary because the model can already do those things, so you get no signal. And you’re going to get some environments which are either very hard, so you get no signal, or they’re just very low quality. And they have an enormous amount of noise and bias, which makes them basically unusable as proxies of the thing that you would actually want them to measure. You can imagine a creative writing eval where the way we judge the model’s output is by whether it’s avoided specific turns of phrase that we think are annoying or something. That’s a very poor measure of writing quality, but it’s something that you might come up with if you’re looking for a cheap way to evaluate someone’s quality of writing. Sometimes there are these rules of thumb, like, do not use the passive voice in your writing. And you could try to make that a rubric item and then grade an LLM’s writing output on that. But of course, you intuitively realize that there’s just a very bad grader. That’s not really what good writing is about. That’s just one small piece of what it might be about. And even that rule is probably too specific. Probably you can write in a good way using the passive voice if you know what you’re doing. So not only is it this very restricted grader, but it’s also biased against some kinds of writing that might actually be good. And you punish them for no reason. So that’s a very similar thing to what happens when you’re working on software engineering and you’re working on the parts of it that are much more based on taste and judgment. You often just can’t come up with a sufficiently good grader to test a lot of these capabilities. So in practice, the reasonable compromise ends up being: pick the things that you can measure with sufficiently low bias and sufficiently low noise, and that are not either too easy or too hard, because both of those mean that the eval will give you very little signal at this moment. Because you want an eval to really be decision relevant. If an eval always gives the same score, no matter which checkpoint or which model you test, then it’s useless. So you want some variation in how different models perform at the current level of capabilities or in the near future level of capabilities. And in practice, these are the evals that get built at any given time. You don’t target other capabilities because they’re too easy, too hard, too difficult to grade, or too difficult to reliably reproduce in a containerized or simulated environment.
Max [01:26:38]
So you always need to get signal, which means that your scores are always increasing. I think this is something that people don’t necessarily appreciate. This is part of why AI progress looks so fast on evals always, because it always needs to look fast in order to be decision relevant. And then what this means is that for any given fixed eval or fixed benchmark, you’re going to get very fast progress and then eventually it’ll saturate and you’ll need a new benchmark. And this means that you can’t use any particular benchmark to say, oh, once we reach 100% on this, it’ll mean that AGI is solved or something. I mean, there are some benchmarks which are better for this. For example, lab revenue is a very, very good benchmark. It’s probably the best benchmark that exists. But unfortunately, it’s very difficult and time-consuming and noisy to run. So how exactly do you try and select things that will get good signal on this ultimate criterion of lab revenue? Maybe lab profit is a better term. Lab profit, because you don’t want to spend a million dollars per token or whatever. Yeah, that’s right.
Ege [01:27:57]
So I think it is difficult because often this final outcome of how much economic value was created by the models is affected by so many things that it’s even hard to attribute to specific decisions that you make. For example, if you are at a lab and you’re seeing your revenue grow very fast, you can’t necessarily tell if that’s because of something you did recently, or it’s just because the models happen to be seeing more adoption, or whether you increased the number of thinking tokens that the model generates. So the model now thinks longer before taking any action. So you’re effectively billing people for more thinking tokens. And you can’t know if people go along with that because you recently improved the capabilities of the model, or maybe they would have always gone along with it because you didn’t try to experiment three months earlier. Maybe if you had had the compute to support that, maybe it would have worked. So it’s very hard for you to tell, when you’re seeing this final metric, what exactly is influencing that number. So often there have to be a lot of decisions made based on taste and human judgment, which is also true for other products. When you’re at Apple and you’re designing the next model of the iPhone, it’s often very hard to know which improvements consumers will actually like. And even if you spend a lot of resources on it, Apple made the Vision Pro. And the Vision Pro is technically an impressive product, but it just happened to not get that much traction. And then it’s not clear how Apple could have foreseen that. How do you foresee that before you actually produce a product? Sometimes it’s just very hard to observe. So I think in practice, when you’re making the evals, you try to base them on trying to measure real pain points that you have as a user. I think that’s a very good way to do this. If you’re doing an eval and then your theory for why it is good is that, oh, some other people care about this capability, but I don’t, I don’t really care about this but probably some other people do, that’s really bad. That’s the same kind of mistake as when you make a startup. You think, oh, some other people will want this, some other people will want to spend money on this, not me, I wouldn’t spend money on it, but some other people might. That’s usually a very bad sign, because you don’t have that good of a model of other people to know that they’re different from you in this respect. But if you have a very particular pain point, that’s a good sign that there’s at least a large population of users who are sort of like it, because you’re probably not that unusual, and who are suffering from the same problems. And then you can try to create evals that target the capabilities that you are aware are bad. And another way to do this is to get user feedback. So if you’re at a lab, you can ask users, maybe large enterprise customers or whatever, what are you trying to do with the model that you can’t do? What is lacking? And they might even answer it, and maybe their answer is not always reliable because they might not understand the actual capability that’s lacking. They might just say, oh, we tried to do this and it didn’t work or something. And they might tell you, oh, could you just make this work? And then you have to have some people on the lab side who think about, okay, what is the reason why the models can’t do this thing? And you try to distill it down to some actual more abstract or more fundamental capabilities that are lacking. And then you try to create evals that you think target those capabilities. So this is important because you can’t actually reproduce the conditions in the real world which people are using the models under. That’s just way too hard. So in practice, you have to do some kind of simulation where you try to determine what are the relevant details of a situation that you can abstract away, and what are the details that are really making a task hard. And you retain the things that are important and that are really making a task hard in the real world that are important to measure. And then you can discard, for example, some of the things that are very hard to replicate in a faithful way that you don’t think are relevant. And that’s a judgment decision. It’s not always easy to make that decision. If you feel that you have good enough judgment, then you can do this. And I think in practice, the fact that the models are getting much better, in terms of the final benchmark of how much revenue the labs are making, does suggest this process works, because this is fundamentally the process that is used to improve the models. There isn’t some secret way you can do it that is not this. And so it works. It’s not perfect. But I think this is a basic approach you have to take. And I agree it’s a hard problem because there isn’t a deterministic recipe for it. But I think that’s also true for any other kind of consumer product that you’re trying to improve.
Max [01:33:07]
It’s interesting that you’re talking about all these things that are important for making good evals that are exercising judgment and exercising taste. Because I think there’s a popular perception that creating data is kind of very low taste. And you have these very low-skilled contractors or whatever making data. Why do you think this perception exists?
Ege [01:33:30]
I think part of it is that the taste is more important the higher level the decisions you’re making are. So, for example, definitely the researchers at the labs who are ordering the specific data products need to have a bunch of taste. Now, maybe you think the people who are creating the data products don’t need that much. But I think over time, even that’s changing, because the models are getting better. So you sort of have to think about, how do the models get better? Well, we train them only on capabilities that we have evals or train data for, which is: we need our environments for training and we need evals to know if they’re getting better or not. So the only way really that the models can get better, other than just relying on the magic of scaling where you just make the model bigger and it just happens to get better, and that does help, but it’s a really expensive way of making the model better. In practice, if you want to be competitive, you want to have a more efficient way of making the model better than that. So you need to rely on all of these data products, and the more data you have of this kind, the better you can make the model. So the process that results is you improve the model first on the capabilities that it is easy to create data products for. And those are precisely the products where you don’t need a lot of taste or judgment, where it’s extremely straightforward to measure what is right and what is wrong and get a lot of volume of data and then train on it. But then the models just get too good at that sort of thing. So then the lacking thing increasingly becomes harder and harder to measure, just because you’ve trained on all the easy stuff. Over time, the level of spending on training these models is going up. So you’re saturating all of the easy-to-measure things, and you’re only leaving behind the hard-to-measure things. So why can’t something be hard to measure? Well, maybe it’s just really hard to come up with the right design for the eval. Maybe it’s easy to produce it once you have the design, or maybe it’s really hard to come up with the design. And that would suggest that maybe you can still have people who are producing data who are lower taste. It’s just that the person coming up with the design has to have taste. But I think that’s only one reason why something can be hard to produce. And I think it’s not the most common reason. The two other reasons why it can be hard are that, as I mentioned earlier, it can either be that the scenario which would expose the lack of capability is very difficult to containerize and simulate in a good way. And that becomes more and more true the more sophisticated the environments get. If all you have to do is make a simple code change to a repo and then you push it to a local GitHub clone and then we run some unit tests against it, that’s very easy to containerize. There isn’t really anything difficult about that. Maybe the only problem is that you have to figure out the build environment setup of each different repo or something. But that’s not a very hard problem. But now imagine that the problem you’re trying to solve is, I want realistic data for a company which has dozens of different services internally, and they have some weird incident where different services interacted and something went wrong, and the model was asked to debug that, and it just made some stuff up and took bad actions and destroyed some data trying to fix the incident. Now, trying to containerize this in a good way is much harder, because you not only have to place this entire setup in a way maybe inside a virtual machine, or set up virtual machines where it can be deterministically simulated over and over again, but you also have to make sure that the way you have containerized it is fair, so the model has enough affordances to do what it needs to do, but it doesn’t have more than that, so it can’t hack the environment, and also it has all the information that it needs to have to solve it, but not too many hints that just make the task completely trivial. So often there is this big problem of, how do you set up this entire environment, and how do you get people to iterate on it until it’s in a good enough shape that you think it’s a good task? And even if you’re basing it on real-world data, it’s just an extremely hard problem to take a real company and try to containerize them and try to build a simulation of their processes in this way that allows this task to be deterministically run on. So this is one thing that’s really hard. I think in practice my guess is what people do is they try to hack this by not actually reproducing the entire environment faithfully, but maybe just doing like a single-turn training where you have the transcript and then you’re just training the model on, okay, in the next turn do you take the right action or not? Or maybe in the next few turns we start you from a prefill and then in the next few turns we have a model simulate the responses to your tool calls or something, instead of us having to fully reproduce the entire state of the environment. And that’s a proxy. I don’t think that works particularly well for an eval. And this is usually called a world model. Yeah, this is a world model. And I think this is something you can do, but you can understand it’s dangerous here, because if the world model is bad, then you’re training the model on a bad signal. And also the longer you go in a trajectory with a world model, the more you will diverge from what would have actually happened in that scenario. So I think this is what is typically done for training, just because you can collect this data at much higher volume, but it’s not really good for the purpose of building an eval.
Max [01:39:13]
Sorry, what exactly does it look like to collect data for a world model? Why is it easier?
Ege [01:39:18]
Oh, you’re not directly collecting data for a world model. You’re collecting the transcripts where someone asks the model to, for example, do a migration or something, and then there is maybe a critical turn in that transcript where the model took some destructive action. And if the user flagged that action as bad, and you have that signal in the transcript, maybe they got mad at the model or something, then you can go back to that turn and you can train the model to not take that action, basically. And you can do that across a lot of different transcripts and a lot of different turns, and you just hope that over time that just makes the model better, and you have a lot of data for that. So I think it’s plausible that that can work. But it’s not a very long-horizon task. It’s very hard to construct long-horizon tasks with world models, just because the world model will get increasingly incoherent and start giving responses to the model that are inconsistent with each other, and then the student model, the model doing the task, will get very confused about what’s going on. So that’s one reason why creating a task might be hard. The other reason is that it is just very hard to come up with the scenario and the grading idea for specific tasks. And sometimes even if you have a high-level design for an eval, a researcher is like, oh, we have this problem that our models are bad at understanding user intent, so we want to make them better at this. We want to have a bunch of tasks where the user gives some kind of ambiguous instruction to the model, and there is a correct answer, it’s not like a user preference, there is a correct answer that can be figured out from the context if the model is aware enough and intelligent enough, but it’s not obvious, because if it’s obvious then the model will already be doing it. And that description is a kind of high-level description, but okay, how do you actually create tasks which meet all of those criteria? Because what will often happen is that someone will think that a particular behavior was bad, like they gave an ambiguous instruction and the model interpreted it in one way, and they will be annoyed by that and they will be like, oh, the model should not have interpreted it like that. But that’s not necessarily true, because the model doesn’t know anything. So if your instruction was genuinely ambiguous, and the model was not given any signal, for example, that it should consult with you about some ambiguity, or there’s no way of finding out about the ambiguity, or the instruction you gave it is actually naturally interpreted in a way you didn’t intend, if we train the model to respond in the way you would like to such a request, then we might make the model worse for other users who would give the same instruction and mean something else. For example, some people might be okay with the model pushing to the main branch of their repo without asking them for permission, and some people might not be okay with that. And then if you’re one of the people who is not okay with that, you might be like, oh, the model pushed to main without my permission and that was bad, shouldn’t have done that. But then if you train the model to not do that, then other users will get mad who are like, why is the model just always asking me for permission to do it? Why can’t it just do it? And then they will feel the model needs too much hand-holding. So you can see in this kind of situation, the correct behavior is very context dependent. It’s not obvious. A single change of a detail in the environment can alter what is correct for the model to do, and it’s very difficult to set that up in the right way, especially if you’re trying to get diverse data about examples of this kind. It’s not that every example in an eval like this about user intent will be about pushing to main. That’s one thing. Some other things might be about implementing a new piece of software. Some other things might be about communicating something to the user. For example, I asked the model to investigate something, and then I might naturally expect it to tell me about something if it notices, but not everything. I don’t want the model to tell me everything it notices, just some things that I think are important. And then there’s this decision of, okay, what should the model communicate and what should it not communicate, which is again the same kind of decision which is very hard to make without having good taste and judgment. And if every single task in your eval is like this, then it’s just very hard to scale the creation of that kind of thing while working with low-context and relatively lower-scale contractors, because they will just not make the right calls, and then you will train the model to be much more annoying to work with and take more destructive actions, or it will be too conservative and ask you for permission to do things that are very trivial, or it will be too aggressive and take destructive actions when clearly the user would not have wanted them to. So this kind of issue of how should the model navigate these very complicated situations in which there isn’t necessarily a single correct answer, that kind of thing is very difficult to build an eval for.
Max [01:44:52]
Yeah, and I think this is in part because it’s just very difficult to know when something is ambiguous or not as a human, because as a human using the model, you just have all of this context that you know about, and it’s very easy to accidentally say something which is ambiguous and have this illusion of transparency where you think, oh, well, obviously I mean this, but in fact the model knows nothing. It’s dropped into a context it knows basically nothing about, and it just has to cope with that. And you really have to learn this skill of figuring out what the model actually has access to and modeling what it can reasonably infer from the circumstances. And this is just a very unusual skill which contractors are not going to have.
Ege [01:45:39]
Yeah, and also it’s a very difficult skill to train, because you don’t get good feedback on it. The way you might get feedback is that someone else who has this skill reviews your output or something. It’s not like, for example, making your environment easy to reward hack, in which case it might be easier to get signal, because you can ask the model to actually hack it, and then the model might succeed. So then over time you might be able to learn this by looking at the ground-truth signal of, okay, was the model able to hack the environment to get a perfect score or not? But in this case, if you’re wrong, how do you even find out? The actual feedback cycle is very long, because the model gets optimized against these evals or other environments, and then it’s deployed, and then after deployment users complain, but then their complaints are usually non-specific. Users can be annoyed by one thing and then complain about something else, that’s very common. Or the user doesn’t really know what thing about the model is annoying them, they just feel it’s annoying to work with the model. And then you have to figure out, okay, why are people suddenly saying this model is annoying? And then it will be very not obvious to you as a researcher what went wrong. So that feedback has to get all the way back to the actual contractor, and that’s just very hard, a very long feedback loop. So I think the skill is also very difficult to train, which is generally true for things that are about taste or judgment. They’re just very hard to train, as opposed to things like, can you do competitive programming problems or something like that, where you get very clear feedback. If you can’t do it, you either fail to solve the problem, you submit a solution that didn’t pass test cases or that was too slow, and you can just do a lot of those problems and get a lot of feedback, and then that allows you to get better. But taste is just
Max [01:47:29]
something that’s much harder to get better at. I mean, for this specific thing of interpreting ambiguous instructions, why can’t the model that’s doing the task just tell you? Because if I get ambiguous instructions and then do it the wrong way and then get a bad review or something, I will know and I’ll be annoyed. But why can’t models just tell you when you have these ambiguous instructions? I think sometimes the models just don’t
Ege [01:48:02]
understand it’s ambiguous. Or they think that it is ambiguous, but there is a natural interpretation that would be correct, and raising it to the user would be too annoying. So the models always have to make this trade-off. As I said, if an employee asks you about every single decision in their work that is technically ambiguous, then they would create way too much overhead and they would not really be very useful. If they don’t ask you about anything, that’s also bad. So in some sense they have to have this judgment of what is important enough to escalate. And you might hope to get training signal on that, on when does the user get mad because we did not ask the user a question. But the issue is, how do you even know? If all you have is a bunch of transcripts from a bunch of users, it’s just very hard to tell that a downstream user annoyance was actually caused by you not resolving an ambiguity correctly earlier. That’s not a very easy judgment to make in a lot of situations. And another thing that can happen is that the model can just not realize a fact about the environment that makes the user request ambiguous, or a fact about the task. So for example, if you ask the model to build some app and run a bunch of tests against it to see if they all pass, the model might just choose one interpretation of what it means to build the app. Often apps have different build configurations, and whether tests pass or fail can depend on that. But of course, if you ask that question every single time someone makes such a request to you, then that would be too annoying. Every single time I tell a model, can you build this, and every single time it told me, oh, can you tell me exactly how to build it, I would be kind of annoyed. I would want that to be something the model figures out, and I would want it to ask me if there is actually a materially relevant ambiguity that changes the answer to my question. So it has to have this judgment. It has to figure that out by exploring the code base, and then if it notices that, then it should come to me and say, oh, actually, I just noticed this ambiguity, can you clarify? And that’s much harder to do than just every time asking the user, can you clarify what you said because it was ambiguous, or just never asking the user. So I think specifically the skill of escalating the correct things in the correct way and not escalating the bad things, that is hard to train.
Stephen [01:50:43]
I think something important for tasks like this is that if you have bad taste, you might not just introduce noise into the grader, right? And this is a problem, because if you have noise in your eval graders, to some extent that’s somewhat acceptable, because you can just scale the number of eval tasks that you’re running and just average over those. But if there’s just a systematic bias, then scaling the number of eval tasks doesn’t help with that. And this is something that’s very easy to happen with not necessarily bad taste, but for example, if people’s taste is off distribution, then essentially their graders will just be training the model towards the opposite of what most users
Ege [01:51:19]
actually want from the model. I mean, I think that’s not even the only issue. So another issue is just that you don’t grade the right behavior. The behavior you reward, there is a way to achieve it that a user will be very unhappy with. Even if in some sense your taste as a user, as the person creating a task, is in distribution, you can just end up writing the grader in such a way that the easy way to get a high score is to do a behavior that a user would consider very bad. Because you have to think about everything the model could do and how it would get graded, and these behavioral things are kind of subjective. So often you need to have some kind of agent grader to judge these things, because they’re too subtle for an automated grader, a procedural grader, that’s just a program, to judge. And then it’s just very easy to not give it sufficiently clear instructions about how to judge behaviors that you haven’t exactly seen in distribution. And then a next-generation model could be iterated against this eval and get a high score, but the way it got a high score could be exactly the same as the way a previous-generation model might have gotten a high score on SWE-bench, which is that it just became really good at understanding which unit tests it will be tested against and then doing something kind of hacky which just passes the unit test. That’s always the danger with this kind of thing: you’re going to train the model to have this behavior which is superficially good but actually not good, it’s actually something that a user would still be annoyed by. It’s just superficially good in a way that looks good to your grader. And that’s much more dangerous, because it’s going to be much more common than people being literally
Max [01:53:13]
biased in the wrong direction, yeah. And this is just Goodharted, to be clear. You’re overfitting to this bad grader and you just get unintuitive and bad results from it. In fact, I think there’s something probably even more perverse than guessing the right unit tests with SWE-bench Verified, which is that it’s pretty plausible that the only way to improve from current scores is to just already know what you’re going to be tested on, from having just memorized it. Yeah.
Ege [01:53:43]
Because it’s a public eval, even though people try to exclude these evals from the training sets of models, that exclusion is not always very reliable. I think when you test the models, they often do just have it memorized. Also, there’s just a lot of discussion of SWE-bench Verified problems online. There are people who post on Twitter problems that they have found, they have manually investigated, and they have found they’re broken. So even if you exclude the repositories where the eval’s full contents are stored, the model can just see these Twitter threads, and it’s just really hard to filter out of your entire training data any examples at all that might be related to this benchmark. So I agree that, because we were seeing for a long time that the scores for SWE-bench Verified were evening out around 80, 81, and then recently recent Anthropic models have gotten 88, 92, whatever, they’ve just gotten way better, and I think the most plausible explanation is that they’re memorizing the benchmark in some way.
Max [01:54:53]
I mean, they do try and mitigate this. In the Mythos system card they describe a technique that they use to try and figure out whether or not the benchmark was contaminated.
Ege [01:55:09]
I think that technique is not very good. The technique they used is they gave the problem to another LLM judge and they asked it, okay, on a scale from one to four or something, how likely is it that this was memorized, which I do not think is a particularly good approach. That’s just relying on the LLM being well calibrated about this question, on which it has no training and no particular expertise. It’s this ability of LLMs to model themselves,
Max [01:55:39]
which they’re very bad at. You can also see this if, I don’t know if you’ve seen while using a model, you’ll ask it to do some large change and it’ll say, oh, this would take me three months, I can’t do that, and then you’re like, okay, we’ll do it anyway, and then it takes 20 minutes or something. It’s just not really able to model itself. Even humans are somewhat bad at this, but LLMs are really bad. Because humans,
Ege [01:56:08]
you see other humans and you have past experience of yourself to draw upon of how long you took to complete various tasks, but the LLM doesn’t have that because it’s not trained on that. So before the LLM exists, it literally can’t know how long it would take it to do something. It might know how some other LLMs in its pretraining distribution would take to do something, but even that’s very optimistic, because usually the pretraining cutoff is many months, sometimes a year, before the LLM got released. So its knowledge is going to be just outdated, because it will be thinking about the reference point of LLMs of earlier generations, which were much worse. There was actually a similar thing like this when there’s a recent open Erdős problem that got solved, which I think was in combinatorics or number theory, I don’t actually remember the field, well, I think it was number theory, where if you ask an LLM, you ask Claude, how likely is it that this problem would get solved by an LLM in the next two months or whatever, it gives an estimate like one percent. It’s very pessimistic. Even if you say in three years, it’s still like, oh, you know, ten percent, very low numbers. It doesn’t believe that it can do it. It’s very funny. I think Anthropic tried to get Mythos to solve it, and obviously OpenAI got their own internal model to solve it, and some people have gotten GPT-5.5 Pro to solve it, and the models don’t believe that they can do it, because in their training data models couldn’t do that, so they’re like, oh, I am an LLM, so I probably also can’t, but actually they can, but of course there’s no way they would have found that out. So I think they’re very miscalibrated about their own abilities, which is why I don’t trust that particular filter, because it requires them to have a very good model of what they would have memorized. There’s also a reason why it’s just very hard to scale the creation of good eval tasks in a synthetic way, because the model won’t know what tasks are actually hard and it won’t know how to design the scenario in a way to avoid leaking information to the student model. It’s just bad at knowing what is hard, what is useful, what is fair for the student model to do and what is unfair. So if you try to do it, then you just end up with tons of bad tasks. I actually wonder how other data vendors who work with large numbers of contractors deal with this, because how do they avoid the contractors just submitting bad LLM-generated content? LLM-generated writing and content is just so widespread now. I just see it
Max [01:59:04]
everywhere, you know. And you can’t use an LLM to verify, you can’t even use another contractor to verify, because they might just be using an LLM. So it just seems like this very hard problem of figuring out which contractors are actually doing good work and ensuring that they keep doing good work. I would love to talk to somebody from Mercor about how they handle this or something, but it seems like a very difficult problem. I don’t know how to do this, I have no idea how to do this. To zoom out for a second, I think it’s interesting that almost none of the conversation that we’ve been having is actually about coding directly. We as a company are really focused on creating data to make models better at software engineering, but when you think of software engineering you probably don’t think about the stuff that we’re talking about. Why is that? Why is what we have to be focused on all these things about modeling users and interpreting ambiguity and learning from previous contexts and all of these different
Stephen [02:00:13]
things? I think part of it is that, like we’ve been talking about, there are lower-hanging fruit in software engineering, and this also happens to be the type of work that a non-technical person might associate more with what a software engineer does. For example, given a very detailed spec, a model is already basically superhuman at implementing the feature. A human could never implement it as fast as an LM does. Or at tracing code in a code base that you’ve never seen before, an LM will just grab very efficiently and read lots of files and immediately understand, even without running the code, the behavior of this system much better than a human could in a similar amount of time. So these capabilities are no longer the bottleneck, which is why when you need to make evals now for software engineering, you no longer want to focus on these capabilities. To be clear,
Ege [02:00:56]
models could always get better, even better at these capabilities than they are now. It’s not like they’re perfect oracles at tracing code through reading files or whatever. But at any given point in time, for real-world use, you need a lot of capabilities to exist in the system at once, and at any given point some of those capabilities will be lagging behind relative to the others, and some of them will be the bottleneck to real-world use. For example, a model that is superhuman at anything to do with text but that can’t see images at all, it’s not really worth it to push the model to be even more superhuman at doing text-based tasks, because it’s already really good at that. At that point the users will just be complaining about, I just can’t show it a screenshot of, I have a problem and I can see it on my screen, but I can’t communicate it very well to the model because it can’t see the visual output that I’m seeing on my screen. At that point, for example, the lack of vision might become a very important bottleneck, and then you need to solve that. So I think it used to be that the model’s lack of ability to be agentic and do a lot of tool calls over long context and maintain coherence and maintain the awareness of what it was working on, and work on a single task for a long time to try to improve its work, was a bottleneck. Models would just give up, they would forget what they were doing before, they would start flailing and doing random things that don’t actually advance the goal that was given to them at the start of the context. But now they’re actually much better at this, so now I don’t think this is the bottleneck anymore. I think the bottleneck now is, again, all these other things that we’ve been talking about, and also to some extent something that we haven’t talked about, which is just the willingness of models to make stuff up, which is also a bottleneck. That makes them untrustworthy. If an LLM gives you an answer, it is very hard to tell if the answer is actually true or not if you don’t have a ton of relevant knowledge and expertise, because a wrong answer from an LLM sounds about as superficially plausible as a right answer. There isn’t an obvious tell that the answer is wrong, and so that just makes you trust the LLM less in all situations. It doesn’t really matter what situation, because you can’t discriminate the relevant situations where it is more likely to be lying. This is also true, for example, when you ask the LLMs to write math proofs which you’re not putting through a formal verifier. The LLM, even if it comes up with a wrong argument, the wrong argument will just seem very plausible, and as the LLMs get better, their mistakes get harder and harder to notice but not necessarily much less serious. That’s another capability we haven’t talked much about, and it’s also very hard to design an eval for that, because what’s happening now is that people are designing evals for that and they are optimizing against it, but the effect that’s having is not making the model generally more honest or well calibrated. It’s just pushing the models’ make-stuff-up tendency towards cases where it’s much harder to tell, which in some cases might be even worse, because you might trust the model more because it does give you good answers in simple situations where it’s easy to check if it’s telling the truth or not, but then in cases where it’s more subtle and complicated, it just reverts back to its old behavior from a couple of years ago when it would just constantly make stuff up. LLMs still do that all the time, and they will do it in a way that no human ever would. For example, they will see a short version of a commit hash in context somewhere, and then they will need the long hash as input to some other thing they need to do, maybe they need to write it to a file, maybe they need the long hash for some purpose, and they will just make up the continuation of the hash. They just completely make it up. And then when you ask them, why did you do that, they’re like, oh yeah, I made that up, I didn’t check if that’s true. A human would never just make up a continuation for a commit hash. It’s just not something a person would do. You would immediately realize that it’s a stupid thing to do. But a model just keeps doing that, and I think this capability, in some sense it’s gotten better, because they have extended the training to increasingly harder cases, but as the models get better you want to use them for more complicated tasks, so the incidence of the LM saying a bunch of nonsense to you might not have actually decreased by that much, because as the model gets better at the easier things, you start using it for the harder things, and then you go up to the point where the LM is making up so much stuff that it’s very annoying to manage and know if it’s telling the truth or not. I feel like this is another very important bottleneck for any domain where it’s really important for the LM to be reliable. Another funny example of this we’ve seen, which I remember now, is we sometimes use LMs to go over transcripts of meetings and then summarize them, and sometimes in the meeting one person will have joined from a meeting room where they’re not joining from their personal account, so their name is not visible in the transcript, it just says meeting room one or something, and then the LM will just hallucinate a name for that person. For example, Max was doing this for some meeting and it just started calling him Michael or something. There was no reason, the name is just completely made up. A human just wouldn’t do that. Even if a human doesn’t know, or they would make false claims, they would not make something up in this way. This is just too simple, almost. It’s very hard to imagine a circumstance under which a human would do that when there was no pressure for it to do that. The LM could have just
Max [02:07:31]
said, oh, whoever this person is, or something, but it doesn’t do that. One way of framing this is that there’s kind of this adversarial game that’s being played between the model that’s trying to come up with plausibly correct bullshit and a verifier that’s trying to check all the bullshit, and it’s kind of interesting that in almost all of these circumstances it seems like the generative side wins. Even, as you mentioned, math proofs, unless it’s literally something that can be checked with a very simple program, even if it’s a line-by-line math proof, it still will manage to make some plausible, hard-to-catch mistake in the argument. Do you know why this is?
Ege [02:08:18]
There’s an intrinsic property of a lot of these spaces, which is that it’s very hard to catch. You can make errors just extremely subtle and hard to detect, and you can just add words which suggest that, if you’re giving an ordinary answer as in the math proof, you can just make yourself sound appropriately uncertain or something, and it’s just very hard for a verifier to detect an extremely subtle error. What’s happening, of course, is that people are using LMs to verify the outputs of LMs, and then you can try training on that signal, but because this task has a very high skill ceiling to be able to discern bullshit, at some point the verifier’s skill is just exhausted, and the best way to get a high score becomes to trick the verifier into thinking that you said something true when you didn’t. And then that is very bad, because then there’s a case where you’ve Goodharted against this verifier too much, and what is actually a good response and what looks good to the verifier come apart. That’s very dangerous, because what looks good to an LM verifier that is reasonably intelligent also looks good to a human. So then you end up in a situation where the model starts producing really plausible-sounding bullshit in a way that a human would not do. Usually human math proofs, for example, have this property that if they’re wrong, usually it’s very easy to tell they’re wrong, because they’re written by someone who’s just incompetent or who doesn’t know what they’re doing, so it’s very easy to tell the math proof is wrong from superficial signals that are very easy to detect. But for an LM that’s not true, because they’ve been trained in exactly this way, partially, where they’re trying to fool a verifier who is also an LM, and then the output that selects for is producing something that seems extremely plausible but just has a very subtle error somewhere. Then you give that to a human, and the human also can’t see the error, even if it’s something really dumb, because it’s just covered up and obscured in such a way that it’s very hard to detect. So why does this happen? Why is it so hard to catch these kinds of mistakes? I don’t really have a good answer. I do think reviewing a math proof for issues of correctness is really expensive. There are a lot of errors that are discovered in math proofs. Probably a majority of math papers which have long proofs have just errors in them. Now, that is usually not a problem when the papers are written by a human, because humans usually have this good intuition for what mathematical arguments will survive and what ones won’t, so usually even if a human makes an error, it’s usually almost always recoverable. It doesn’t make the proof invalid, it doesn’t invalidate the entire proof, it’s just an error that can easily be patched. That’s almost all human errors in math papers. But LM errors are not like that. LM errors are, because the LM has been trained so hard on the goal of succeeding at having appeared to prove something, it just produces an argument which seems extremely plausible except for a really dumb error, and then that error is just load-bearing. So if that assumption is false, the entire proof falls apart. That’s why LM-made errors are so much worse than human-made errors. I think this is probably also true in other domains, where humans have this intuition for how a system works, or have a domain where they have this domain expertise which allows them to tell which solutions are likely to work and which are not likely to work, even if they can’t get all the details right, while for an LM, if they give a bad answer, it will be that the error they have made is load-bearing in the bad answer. That’s much more common in LM responses than it is in human expert responses to this kind of thing. Why is it that you can cover up these errors so well? I don’t know. In the ideal case, if you had a verifier that people had way more resources for, then of course you would expect this to work, but I think there might be some kind of competition here where, when the verifier and the generator are about the same level of capability, the generator can just put in an error in this adversarial way, and then the verifier needs to check everything, while the generator, it’s sort of like the needle in the haystack problem: the generator can put the needle wherever they want, and it’s very easy to do that, while the verifier has to check everything to see if there is a needle. So it’s this kind of problem where finding an error is much harder than strategically placing an error in a really obscure point of an argument, and I think that’s probably why this dynamic exists when the generator and the verifier are at a similar level of capability, which is usually the case. If you want your frontier model to get better at something, then usually of course you don’t have access to a verifier that is better than your frontier model, because that’s the best model you have. I mean, you do if you’re a trailing lab. Yeah. Nice place to be in. That’s right. So in fact they do do this. Trailing labs often do do techniques like this, and that is part of what has allowed the gap between frontier models and open source models or trailing labs to not widen over time, because there’s always this forcing function that they can always train on the outputs of the frontier models. Even if they can’t train on the reasoning traces or whatever, even if the labs try to hide thinking tokens, it doesn’t really matter, because you can still get the final output, you can still use their tastes and judgments and intelligence to judge outputs, and that’s really all you need. You don’t really need to see the thinking tokens, you can just do your own RL once you have that signal, it doesn’t really matter. So once you have that judge as a thing you can rely upon, but of course a frontier lab doesn’t have that, so they are, for example, much more bottlenecked by this lack of good data. The fact that their verifier and their generator are about the same capability causes a lot of these synthetic pipelines to just break down, because they can’t overcome this problem. So they have to find other verifiers. Either they have to, in the case of math, use Lean or whatever, just formal proof-checking systems, just programs, but that doesn’t work in a domain which is very taste-based, judgment-based. You just can’t do that. And those domains are often also very bad because spending more inference compute, spending more reasoning compute, doesn’t even improve your performance by that much, which is also what you would expect for a human. If a human just has bad taste, then just giving them a hundred times the amount of time to think is not going to suddenly make them have much better taste. That’s not really the way taste works. As opposed to, if a human got a math problem wrong when you give them one minute but then you give them one hour, that has a big impact. They’re going to make way more progress on the problem if you give them one hour. So that’s, I think, exactly how reasoning models work right now. If you give them more reasoning tokens, then they get better at the kinds of things that humans would also get better at if they were given more time to think, and they don’t really get better at the kinds of things that humans wouldn’t get better at.
Max [02:16:13]
Which I think makes a lot of sense. I think this makes sense to me in part because the reason that you can do better on a math problem if given an hour versus given a minute is that you can go back and check your work and see, oh, I got this wrong, and then try again. Whereas you really can’t do this if you’re doing something that’s very high taste. It’s just much harder to check, oh, I wrote this short story, is it good? There’s no procedure you can do. Or, I had this investment thesis, you know. And
Ege [02:16:45]
you can lose a bunch of money and then realize you were wrong, but that’s sort of too late.
Max [02:16:49]
Yeah. I also think it’s kind of counterintuitive, at least to me, that when we’re talking about this game of verifying versus generating, it seems to be easier to generate, because in computational complexity theory it’s the exact opposite. I don’t really know why this is, which is kind of interesting that it seems to be this way for real tasks rather than these theoretical arguments about abstract computers. I mean, I think it is not, I think
Ege [02:17:23]
always true. There’s a large class of problems for which indeed generating a correct solution is a lot harder than verifying if a given solution is correct or not. But I think there are problems where that would not be true. I can’t think of any concrete examples right now, but I think the problems where verification is very easy just fall into a specific complexity class of problems, and there are just problems outside of that class. Well, I think in
Max [02:17:51]
in complexity, right? Like you could never have the verifier be more complex or more expensive than the generating process, because you can always just run the generating process and then check whether the result is the same. And that’ll always be the same complexity. I think the issue
Ege [02:18:08]
here is that you’re assuming you have access to an ideal verifier. In our case, we don’t.
Max [02:18:14]
Oh yeah. Obviously there are a bunch of ways in which this is disanalogous to the real world.
Ege [02:18:20]
But I think that that is the important issue. If you had a perfect verifier, I don’t think there would be a problem. The issue is that our verifier is only a proxy. And then if that’s the case, I don’t know if it is still true. Basically, you can imagine a theory of this actually. I don’t know if this has been studied, but imagine that you are in this generator versus verifier situation, but you don’t actually have access to the verifier. For some reason you can’t, you can only run a bad approximation to it or something. And then what happens? I think probably the results might be quite
Max [02:18:58]
different. I also think it’s interesting to think about the more general adversarial game of using an LLM as your verifier. Like, how does it fail? For some arbitrary task, take GBA Eval for example, how does it work to use an LLM as the judge? If you just tell an LLM, grade this on a scale of 0 to 100, or maybe you do pairwise comparisons, maybe you say, compare these two results, which is better, and you don’t give it very specific instructions on
Stephen [02:19:34]
what to do. Yeah, so I think LLMs would fail horribly if given this task, right? And so there’s a few bottlenecks here. The part that they would be okay at is they would be okay at testing whether the code compiles, whether it’s able to load games at all, or if it’s just a black screen, right? But then, in the first place, remember that these LMs are not capable of generating realistic input sequences to play the games in the first place. So even given two emulators and given ROMs, they won’t be able to actually play the games on each emulator and compare them. And then even if you take another step and you give it this affordance, which is that you give it pre-generated human replays, right, and ask it to compare images, then the bottleneck becomes the visual capability of the model. Where the model can look at two images where a human would obviously say that emulator B produced a much better result than emulator A, and a model could just say the opposite result because it just lacks visual abilities. And it’s something that’s similar to, for example, the UI of a website. And this is why models aren’t just superhuman at UI. If you could just show two screenshots to a verifier model and ask it to compare them, and that gave you a perfect signal, then it’d be very easy to train models to make better UI. But this is visual lack of capability. Yeah, this is kind of
Max [02:20:42]
interesting, because I think there’s kind of a popular conception that computer vision is basically solved. But at least when you’re trying to do these kinds of tasks, it very much is not. LMs are not very good at comparing two images in the same way that a human might. I also think it’s worth thinking about the other side of this game, which is improving a generative process via RL or via the researchers optimizing the training methods. Because I do think that you can get some signal. If you take a frontier model and you use it as a verifier for a trailing model, you would get some signal. But then when both models are the same capability, the generative process has already been trained to fool the verifier. Or not fool it, but just to do the right thing, but to do whatever the verifier is looking for. And so you can’t really ever push forward the frontier of capability by just using a very trivial verifier.
Stephen [02:21:40]
Yeah, one way you might be able to make a little bit of incremental progress is if you just spent much more compute on the verifier than the generator. Because the generator has a lot of responsibilities to think about when it’s attempting this task. And if you have the verifiers, you sample a hundred verifiers in parallel, and each one is told just to generate an input sequence for a particular game. For example, I tested this as well. If you give some of the frontier models five or six hours and their job is to generate an input sequence that makes it as far in a platforming game as possible, they’ll usually be able to get to the first platform of the game. Now, they won’t be able to get to the second platform. But the fact that they can get to the first platform is more than the generator can do while it’s testing its own work, just because more time was given. So in this sense you can make some incremental progress, but obviously there’s limitations to this, and it would be extremely expensive to run the verifier in this way.
Max [02:22:30]
Yeah, roughly how much would it cost to sample that input sequence using this method? Either with an LLM or with a human.
Stephen [02:22:38]
With a human it might take only like 30 minutes or something to get a replay sequence that plays through the entire game. Because a human is actually skilled at the game and doesn’t need a lot of iteration time. But an LLM is essentially brute forcing random input sequences and observing the results from the frames. And so an LLM could take millions of tokens, right? Dozens of dollars or even hundreds of dollars to generate an input sequence for a single game. And you need to think about doing that for dozens of games. It just becomes very expensive.
Ege [02:23:09]
Hundreds of dollars for a single game seems too optimistic to me — to get to like the first platform.
Stephen [02:23:11]
Yes, that I agree. And then for a human, this would be like maybe a dollar. But probably less.
Max [02:23:15]
So this kind of goes against the intuition that many people have, that LLMs are way cheaper. Because they are way cheaper for certain tasks that they’re very good at. But there are a lot of tasks where you just can’t really use an LLM.
Ege [02:23:34]
Yeah, I mean, if they’re not way cheaper, then you don’t use them. So this is almost like a tautological claim.
Max [02:23:40]
Well, yes. But I think a lot of people just have this heuristic that you should just always use an LLM because it’s cheaper. Because there’s this spectrum of how good an LLM is at a task: either it won’t be able to do it at all, or it’ll be able to do it for cheaper than a human. And in a lot of cases you can forget about the transition, but there is sometimes this transition where it’s actually significantly cheaper to use a human.
Ege [02:24:23]
Yeah, this is especially true for interactive tasks that have video — video modalities, for example. Like a video game.
Max [02:24:25]
Yeah, that’s true. Whereas it’s less common for text-based tasks or something. I think this is in part just because video is so much information, and LLMs are really slow at processing video, and they haven’t been trained on it for that reason.
Ege [02:24:42]
Yeah, that’s true.
Max [02:24:44]
So we’ve established that there are some of these tasks that are relatively easy to train against — like just basic coding, where you get a spec for the things you need to program, and you need to figure out what users would need exactly, or interpret these ambiguous instructions and these other things. But when you think about the jobs that are associated with these tasks, like the job of being a software engineer, these haven’t been fully replaced. Could you talk about why that is — why have software engineers not been replaced even though coding is basically solved?
Ege [02:25:24]
So one thing I would point out is that this is not necessarily a very new trend. In fact, even the job of a software engineer itself only was possible to exist because we created this layer of abstraction which allowed people to write some abstract representation of what they want the computer to do. While previously they would have actually had to design an explicit circuit that did the thing they wanted. They would have literally had to design the physical hardware to do what they wanted. And over time in software engineering we’ve had the progression towards higher levels of abstraction. People used to write explicit instructions that would get executed by a processor. And then we developed higher level programming languages like C or whatever, and then they would be compiled down to these machine instructions. But now you didn’t have to think about machine instructions anywhere. And most programmers today probably don’t even know how machine instructions work really, because they never have to think about that. And then over time we progressed up this abstraction ladder, such that what people write now, and what is called programming, even before LLMs, in 2020 or something, it’s extremely different from what would have been called programming in 1970. Because we have so many modules and libraries that do things for us. We work with such high level languages where we never have to worry about a lot of things, like exactly which physical address some data is being stored in, and exactly, if you want to multiply matrices, in which order we multiply various things. We just have a library that does that on a GPU. Modern programming operates at a very high level of abstraction, because we’ve just built up this abstraction stack. But that has actually not reduced the demand for software engineers. In fact, the demand for software engineers has constantly gone up. And that’s because as we automate the parts of the job that were more tedious. For example, writing machine instructions is just really tedious. You can’t really write any complicated software if you have to write out all the machine instructions for that software by hand. If you can just use abstractions that wrap around all of that, that just handle that bookkeeping for you, like compilers, like existing libraries and so on, then you can just focus on making higher level decisions about what the software should do. And then your time is much better spent. You just generate way more software and way more complicated software per unit time. So in fact you become more productive, and then there is more demand for your services, because now you can create this increasingly sophisticated software that satisfies more use cases for the end users. You can create, for example, a really compelling and fun to play video game. Well, if you had to write every machine instruction by hand, you couldn’t really do that. People tried, and some of them — there’s like RollerCoaster Tycoon, right?
Max [02:28:17]
Yeah, yeah. There’s probably the one compelling game that has actually been written in assembly. Which is very impressive, but there’s not very much capacity to make games that way, and it’s very expensive.
Ege [02:28:28]
Yeah. So I see, so far, the impact of LLMs on programming, I think, has just been the same. It’s just been the same trend. We are now moving towards an even higher level abstraction where instead of writing in these really abstract and high level languages like Python and TypeScript and whatever, which are not even very similar to the programming languages that people, it’s very different to write Python compared to writing machine instructions. It’s probably about similarly different to writing Python or writing English text about what a program should do. Because there is a massive gap between what you’re writing when you write a Python script and what will actually get executed on the machine. In fact you can’t even predict that that well. There’s so many decisions that are made by so many layers of abstraction about how things will get executed that it’s very hard for you to predict that. So it’s a continuation of the same trend, and I think it’s right now having the same effect, which is it is automating parts of the job. Being a software engineer involves a lot of different skills. And it used to be that one of these was writing machine code, and now that’s not one of them. And so far, before LLMs, it used to be that one of them is writing in high level programming languages. But now that’s also not going to be a key one. Like now you can write a Go program, with Go as a programming language, without ever having learned Go. Because now the LLM just knows it. So the LLM just handles the routine parts of writing it for you. And you can make the high level decisions about how the application should work, how it should be designed, what the data flow should be like, which modules should have which responsibilities, and how they should internally work at a conceptual level. You can design the system from the point of view of a product manager and a computer scientist. Or you can think about the algorithmic properties, the complexity properties of the system, data access patterns, data storage models, UX, user experience, design. As opposed to thinking about, okay, I have this map in my head of how this application should work, now I have to write this in this programming language in a way that the compiler of that language will understand. So that is the skill that is getting automated. Other skills, as we have discussed in the podcast, are not actually getting automated at the same pace. These skills of making good design decisions, of understanding what users want, and understanding what features they will need in an application to be satisfied with the application, these are not really getting automated at the same level.
Max [02:31:03]
Also though, we have to manage the LLM’s context for it. All of these things that we’re saying are bottlenecks for LLM usage are where currently human engineers are stepping in.
Ege [02:31:12]
Yes, that’s right. And if you take a very simple model of this, which is not exactly accurate but I think gives the right picture, you can view the productivity of the software engineering industry as a whole as being like a minimum function taking over a bunch of different inputs. Any one input can be the bottleneck. And what the LLMs allow us to do is automate parts of this. So the parts where the input intensity is very high are no longer the bottleneck. And then we can reallocate the time we were spending on those things and instead do the things that are still bottlenecks. And that raises the overall output of the industry. That also has the effect of raising the marginal value of time that engineers are spending on creating software, as long as the market is not saturated. So at some point, the world might just have so much software, and so much intricate software, that they don’t have any more demand for software. And in that case, the price of software might start falling, and that might offset its effect. That’s the point where we would expect to see software engineer jobs getting lost. But as long as that doesn’t happen, in fact you could expect software engineering employment to increase, because the marginal value of an hour of software engineer time is increasing. Now they have more leverage. They have more parts of their job they can automate, and that allows the parts that are not automated to be spent more productively. Just like, I don’t know, if you’re doing agriculture. I think agriculture is actually a good example for this. Per unit time, someone doing agriculture today produces far more, and earns far more, than someone who was in agriculture a hundred years ago, because the job is much more automated. With agriculture, what has happened is that we have to some extent satiated the world’s demand for food. And as a result, instead of agriculture going up as a share of GDP, it went down, because people were like, oh, we have enough food. And now what has become the bottleneck in the experience of consuming food? Well, it’s things like delivery, presentation, ambience of a restaurant, people who deliver the food to your door so that you don’t have to go to a restaurant anymore, people who are keeping it warm and whatever. It’s not the raw material. It’s not the things you grow in the fields that are the bottleneck. Those are a very small fraction of the cost of food. So if you take a narrow view of agriculture in that sense, then it has gotten smaller. But over time, for example over the past 20, 30 years, it’s possible that this broader industry of eating food in general, if you include restaurants and luxury dining and all of this stuff, food delivery, it’s possible it has gone up as a fraction of GDP. I don’t know that. But just because there are so many luxury components that people spend more money on as they are richer, like getting their food delivered to their house, that might have grown. And I think it’s a similar effect right now with software, where the parts that are not automated, because we have so much more software, we can afford to do more of. And at some point the entire software market might become saturated, in which case we would start to see the relative value of software decline relative to other things that are involved. For example, sales might become more important. If everyone can make some software, then maybe it’s more important to get a lot of users, because a lot of software has network effects. So maybe it’s not really the design that becomes the bottleneck, but marketing it to users and convincing them to join and building up this brand that people can recognize. So it’s easily possible that if software as a whole becomes way less valuable, because it’s a saturated market, then these other things become valuable. But I think we’re not in that regime yet. We’re still in the regime where the world has a lot of demand for software, and that demand is not being satiated. So what I expect to happen as we automate parts of software is the remaining parts get more valuable. The software engineers working on those things actually get more compensation. And the world’s fraction of output spent on software goes up instead of going down. This is sort of the Jevons paradox thing. It only holds up to a point, because if the rest of the economy just stays completely stagnant, then at some point we will have satiated our demand for software. But I think that point is still kind of far away.
Max [02:35:41]
Yeah, unless you believe in a software singularity. There’s just going to be some point at which there’s no longer that much need for software, and other things become more important. Kind of like how, I think this happened with manufacturing, for example. At some point there was a lot of demand to increase the number of cars made or something. But then eventually there were just enough cars, pretty much. And then cars still improve, but in other ways.
Ege [02:36:12]
Or the experience of having a car improves. Yeah. I think that’s right.
Max [02:36:18]
Yeah, I mean, how does this look for other professions than software engineering?
Ege [02:36:24]
I think the basic trend I would expect is not very different. Unless the profession is something that can be completely automated, then it’s possible that the human doesn’t have any absolute advantage, and it’s better for them to do something else. For example, travel agents might end up getting automated. The number of people working as travel agents has been falling quite steeply over the past 10, 15 years. I don’t know the exact time frame, but that makes sense. Because what is the job of a travel agent? They book hotels for you. They figure out what is a good hotel, what is a bad hotel. A lot of that can actually just be handled by ordinary software, not even by AI. You can build an entire software platform whose only purpose is to recommend hotels to you and handle bookings and make things very convenient for you. And it’s often better than having to talk to a person and explain your specific situation to them. This is why also, often, if you have an automated system for resolving customer issues that works well, it can be better than talking to a customer representative. And if anything, I would expect AI to make that even worse. I personally use AI tools to just, say I’m going to a city that I’ve never been to and I don’t know which hotels are good, I just tell it I want to visit this place and this place, I have some job I need to take care of in this location, which hotels should I stay at, which hotels are good, and these are my criteria. And it will just research and then give me some recommendations, and then I can pick one of them and just book. And I don’t need a travel agent to help me. So I can understand why that particular job, for example, would be getting automated. But I think that only is true if very close to all of the tasks that are involved in doing that job are getting automated at once. If it’s only half of the tasks, or 20% of tasks by time spent or whatever, then what’s going to happen is that in fact the job will probably get more valuable and we will have more people do it, because the productivity of the people who are doing that job will go up. And if this is a good that we have high demand for, some industries like this are software, medicine, law, finance. I think all these are sectors which have grown as a fraction of the economy as we have become richer. So our demand for them seems to be not satiated at any reasonable level. With medicine, it makes a lot of sense why it would not be. But I think some of the other sectors also, like software, is true, I think, because there are so much efficiency gains you can get from just having better tracking of things and bookkeeping and accounting. Software is just this massive thing that helps us coordinate large-scale activities, which is very hard to do.
Max [02:39:27]
Yeah, all knowledge work. I agree.
Ege [02:39:29]
Some knowledge work is not about coordination, but I think a lot of it is. And software is a very crucial part of that. It’s very hard to do that without software. If you’re just using pen and paper and whatever, like people used to do, it was very inefficient. And right now I think there is still so much more software that will be useful, because right now people just have to use solutions that were built at scale to serve the needs of many different customers at once. Because that’s the only kind of software that can really be made high quality, where the economies of scale work such that you can invest a lot of effort to make it high quality. But often that just means that the software doesn’t fit your particular needs. It was built for an average of a very large number of users. And if you just had software that exactly fits your needs, if you could spin it up somehow, that would be really valuable. Probably, if it was very cheap, most companies would just have tons of internal custom software tooling, because each company is a different context. They have something different they’re doing, and the software product that looks best for them would not be the same as it would be for some other company. They just have to use the same software products now because that is the only feasible thing. It’s infeasible to make a really high quality software product for the use of a company of 50 people. That used to be infeasible. I think as AI increases software productivity, it is going to become more and more possible to do that. And as a result, we’re going to see way more custom software be built at smaller scales that is much higher quality than what we could afford in the past.
Stephen [02:41:09]
Yeah. And this is not just hypothetical, right? Even now we see a lot of companies vibe coding internal tooling, where maybe the objective quality of the software is lower than the general, very popular option. But just because it’s more suited to your use case, you’re willing to take this compromise of lower quality, maybe less reliable, but it’s just so well suited to your use case. Even internally at Mechanize, this is something we do.
Max [02:41:31]
Yeah. And this is kind of the flip side of the SaaS apocalypse. A lot of people are worried about all these SaaS companies failing, but the flip side of that is that we’re going to get all of this really personalized or individualized software, which I think is going to massively boost certain sectors of the economy, even if it’s at the cost of these more generic SaaS offerings. Stephen, Ege, thank you so much for coming on the podcast. I think it was a great conversation. And thank you for listening.


