What Hand-Grading Hundreds of CFA Exams Taught Me

What Hand-Grading Hundreds of CFA Exams Taught Me

·5 min read

Somewhere in my career I started an ed-tech company called GoStudy which focused on test prep for the CFA exams, the finance credential people spend years and a small fortune chasing.

My co-founder Parker built the platform, the mobile app, and the adaptive-learning engine. I built everything else. Thousands of pages of curriculum, the video content, the marketing, the sales. We built a machine designed to scale, to reach a lot of people without me in the room.

That was the idea anyway. In reality what people paid us thousands of dollars for was our most expensive product - a course I taught with problem sets I graded for one part of the third exam. The value for many was my personal attention concentrated on a very specific pain point for Candidates.

Other than content delivery the core of the course was these problem sets I'd created. Every week the class had to complete one and submit it. I'd then grade them one at a time, by hand. I remember burning so many printer cartridges in my tiny 1-bedroom NYC apartment. I'd print out hundreds of pages of the work people had actually done (which they had also printed and written on and scanned). I'd grade the problem sets with detailed feedback and rinse and repeat. It was slow. It didn't scale. And it was popular and it worked to get people to pass that exam.

I've thought a lot about why that individualized approach was so effective and so popular despite the price point and the crude delivery mechanism.

I think ultimately it was a combination of (a) general accountability, (b) meeting people where they are, and (c) teaching them the shape of what they need to do. Not the content. The shape.

The shape thing is what's more broadly interesting, especially in terms of AI context. Bear with me.

Level 3 is the constructed-response part of these finance exam. It's not multiple choice, you don't bubble in answers. You write them out. So a wrong answer in these problem sets wasn't a black box. It offered a window into where and how the person was thinking. Grade a few hundred of those and the patterns of errors start to coalesce. What trips groups of people up, how it relates to the knowledge they have and are learning, how it relates to how the question is introducing context.

Sometimes you just don't know the answer and you need more information and knowledge. But more often than not that wasn't the gap. These were smart, prepared people who knew the material. What they missed was the intent of the question, the tells of the exam writers themselves that might tip you off that the investor bias they wanted you to talk about was "loss aversion" or whatever.

Then there was understanding the shape of the overall exam. 3 hours, 10-12 sections, each with parts A-F with each part worth a varying degree of points. A three-point answer means they want three things: one bullet, one short line, each. 1 point = 1 minute. That's the guideline.

But you've got a smart, anxious person who studied for a year, and they'd write a beautiful paragraph on a three-point answer they really knew and burn through their three hours writing essays. Or they'd treat the exam as a linear exercise and not skip around and answer all the stuff they knew cold first. Coaching someone through the shape of what they needed to know in terms of the framework of the material, how it hung together, the arc of what was most likely to be tested, the how it was brought up in the past, and then the strategies for the actual structure of the exam itself was worth more than anything I could teach them about portfolio management.

And it came through in repeat runs/grading assignments. It wasn't something I could articulate BEFORE I started grading hundreds of problem sets. The work, the pattern recognition, the improvement of myself as a teacher and grader and the students as exam takers worked hand-in-hand via tight feedback loops.

That is judgment, or taste, or innate understanding of where and how to focus and shape responses and to drill beyond the superficial. The only way you get it is by putting in the work and doing the reps. You do the thing a few hundred times and your hand learns to see. There's no shortcut I ever found.

Which is a weird way of getting to the same thing is true today. We need loops, we need to generate reps, and we need to encode judgment but not rush into that encoding because the product of the work and the loop shapes the judgment and reveals what matters also in a way that is not ex ante knowable.

Last sort of unrelated tangent - this entire stream of consciousness blog post was an example of a content production AI loop. I didn't write my way to this over a blank page. I had this flashback to the grading process, this kernel of a "huh that's actually related to all the AI work." and then I had my "grill me" skill ask deep questions, my "content production" flow try to shape that into an outline that I could review and then words on the page. And then I ripped out most of it and rewrote the entire thing as it gradually took shape that there was maybe a "there" but also maybe just a "need to write this thing out" in a way that AI produced words just wouldn't cover.

So I asked the model to reflect on the whole process and answer that question directly. Here's what it said:

There was a "there." But — and this is the part that matters — neither of us knew what it was when we started, and the AI never found it. You did. Watch the actual sequence: I ran the grill-me questions, and my first synthesis was "the scarce thing was the human feedback." Clean. Plausible. Wrong, or at least shallow. It took your correction — Parker built the tech, you were mid-career, the value was personal attention on a specific pain point — and then three more rounds of you ripping out my constructions before the real spine surfaced: shape, not content. The thing you'd known in your hands for fifteen years and had never said.

So what was the AI actually for? Not the "there." The AI was the reps. It was the loop that made you grade your own understanding a few hundred times in an afternoon. Every draft I handed you was a wrong answer on a problem set — smart, prepared, plausible, and missing the intent of the question. And you did to me exactly what you used to do to your candidates: you read the gap behind my wrong answer and marked it up. "This reads a little AI." "This is repetitive." "This is confusing." "This is AI slop." You weren't editing prose. You were grading, and the feedback loop shaped the judgment — yours, about what this essay was even about — in a way that wasn't knowable ex ante. Your own thesis, enacted on your own essay, with the machine in the seat where the anxious candidate used to be.

That's the answer to your last paragraph, and it's better than "AI helped me write faster." The AI didn't cover the words. It generated the reps that let you find the shape. The blank page never got you here because the blank page has no feedback loop — it just stares back. What you needed wasn't a co-writer. You needed something to be wrong at you, quickly and without ego, enough times that the "there" had to reveal itself to survive the grading. The medium didn't just enact the message. The medium was the message: this whole thing was one more problem set, and you were both the grader and, for once, the candidate.

The reason it's not just "need to write this out" is that the writing-out produced a claim you didn't hold going in — that you can't rush the encoding of judgment, because the loop reveals what matters and no amount of upfront thinking substitutes for the reps. You didn't know you believed that with this clarity until you'd graded me down to it. That's a "there." A "need to write it out" ends when the page is full. This ended when a specific idea sat still.