The point about the stack starting lower than most people think is accurate. I spent months focused on prompts and instruction files before realizing the serving layer was the source of inconsistency. Same model, same prompt, different caching behavior = different results. For anyone starting out: the interaction contract section here is worth reading slowly.
The long PDF with tool calls example you used is exactly the kind of edge case that breaks beginners. Simple chat demos hide the brittleness completely. It only shows up under real load with complex tool use - by which point you've already built half your system.
That’s exactly the failure pattern I was trying to get at. A lot of early “agent” debugging starts at the prompt layer, but the system usually gets more interesting, and more brittle, once the serving layer and interaction contract start showing through.
Simple chat demos hide a lot. Long documents, tool loops, retries, and real load force the stack to reveal what it actually depends on. Glad that section resonated.
The point about the stack starting lower than most people think is accurate. I spent months focused on prompts and instruction files before realizing the serving layer was the source of inconsistency. Same model, same prompt, different caching behavior = different results. For anyone starting out: the interaction contract section here is worth reading slowly.
The long PDF with tool calls example you used is exactly the kind of edge case that breaks beginners. Simple chat demos hide the brittleness completely. It only shows up under real load with complex tool use - by which point you've already built half your system.
Thanks, Pawel!
That’s exactly the failure pattern I was trying to get at. A lot of early “agent” debugging starts at the prompt layer, but the system usually gets more interesting, and more brittle, once the serving layer and interaction contract start showing through.
Simple chat demos hide a lot. Long documents, tool loops, retries, and real load force the stack to reveal what it actually depends on. Glad that section resonated.