How Quiz Generators Can Change Formative Assessment In The Classroom

I gave a quiz last Tuesday that I made in about forty-five seconds.

It covered cellular respiration, had eight questions, and caught a misconception about ATP that I probably would have missed until the unit test. That quiz did more for my third-period class than the review worksheet I spent an evening writing the week before.

I want to be honest about this: I was skeptical of AI-generated assessments. I teach biology, and for years I have believed that writing my own questions is part of knowing my students. And I still believe that, mostly. But I have also come to believe something else, which is that the number of low-stakes quizzes I should be giving far exceeds the number I have time to write.

The research case for more quizzes

The evidence behind retrieval practice is not new, but it is stronger than most teachers realize. Roediger and Karpicke’s 2006 study at Washington University demonstrated that students who took practice tests retained significantly more material over time than students who spent the same period re-reading their notes. The margins weren’t small. On delayed recall tests given days later, the testing group outperformed the re-study group considerably.

This idea, sometimes called the testing effect, has been replicated extensively since then. A 2021 systematic review by Agarwal, Nunes, and Blunt examined 50 classroom experiments with over 5,000 students. Fifty-seven percent of the effect sizes were medium or large. One earlier classroom study found that students scored 94 percent on quizzed material versus 81 percent on material they had studied but never been quizzed on, and that gap persisted months later.

What strikes me about this research is how little of it has filtered into everyday teaching practice. We talk about formative assessment in professional development sessions. We know the theory. But the day-to-day reality is that most teachers run maybe one or two low-stakes checks per week, if that. Black and Wiliam’s landmark review of formative assessment found effect sizes between 0.4 and 0.7, which puts it above almost every other classroom intervention that has been studied. Yet the implementation gap persists, and I think the reason is simple: making good quizzes takes time we do not have.

The time problem is real

I’ve tried keeping a question bank. I’ve used Google Forms to build quick checks. I even had students write questions for each other once, which is a fine activity but doesn’t reliably produce questions that test the right things.

The bottleneck is always the same. Writing a good multiple-choice question with plausible distractors takes real thought. Writing eight of them takes a half hour, minimum, if you want the wrong answers to reflect actual student misconceptions rather than obviously silly options. Multiply that across five preps and the math stops working. So I end up giving fewer quizzes than the research says I should. I suspect most teachers are in the same position.

Differentiation makes it worse. I have students reading at a ninth-grade level and students reading at a college level in the same room. A single quiz doesn’t serve both groups well, and writing two versions doubles the time.

What AI quiz generation actually looks like

This is where the tools changed things for me. I started experimenting with AI quiz generators about a year ago, mostly out of curiosity, and kept using them because they genuinely saved me time.

The basic idea is straightforward. You give the tool your source material, either by pasting text or uploading a document, and it generates questions. Multiple choice, true/false, short answer. You can usually pick the format and adjust the difficulty. Tools like the AI quiz generator at Quizgecko let you feed in a lesson plan or a PDF chapter and get a full set of questions back in under a minute. I have also used Google Forms with its recent AI suggestions, and I keep Anki around for spaced repetition flashcard work with my AP students.

What surprised me was the quality of the distractors. The wrong answers aren’t random. They tend to reflect common misunderstandings, which is exactly what you want in a formative assessment. Not always, and I’ll get to the limitations, but often enough that I can start from the generated set and edit rather than building from scratch.

That shift, from writing to editing, is the real time savings. I spend five to ten minutes reviewing and tweaking a quiz that would have taken me thirty or forty minutes to create from nothing. Over a week, that adds up.

Keeping the teacher in the loop

I should be clear: I don’t hand these quizzes to students without reading them first. That would be a mistake, and it would also miss the point.

Reviewing AI-generated questions actually forces you to think about what your students need to know. When I scan a set of ten questions and delete three of them, the reasons I delete them are informative. Maybe the question tests vocabulary when I wanted to test application. Maybe it’s ambiguous in a way that would confuse my English language learners. Those decisions are still mine, and they should be.

What I’ve started doing is generating a larger set than I need, maybe fifteen questions, and then cutting down to eight or ten. I pick the ones that target the specific learning objectives for that lesson. Sometimes I rewrite a question stem to match how we actually discussed the topic in class. Sometimes I add a question the AI didn’t think of because I know from last year that students struggle with a particular graph.

I use these mostly as entry tickets and exit tickets. Five questions at the start of class to activate prior knowledge. Five at the end to check what landed. Quizgecko and similar tools are fast enough that I can generate an exit ticket during my planning period before the last class of the day, based on what I noticed students struggling with during the earlier periods. That kind of responsive assessment was genuinely hard to do before.

Where AI quizzes fall short

They’re not perfect, and pretending otherwise would undermine everything I’ve said so far.

The most common problem I see is questions that are technically correct but pedagogically shallow. The AI tends to pull directly from the source text, which means it sometimes generates recall-level questions when I want analysis-level ones. If your source material is a textbook chapter, you’ll get questions that test whether students remember facts from that chapter. You won’t always get questions that ask students to apply those facts to a new scenario.

Subject-specific problems come up too. In biology, I’ve seen questions where the AI confused similar terms, like “mitosis” and “meiosis” in a context where the distinction mattered. In one memorable case, it generated a question about protein synthesis where all four answer choices were technically defensible depending on how you read the stem. A student would have been fine, probably, but I would have fielded complaints.

Math and foreign language teachers I’ve talked to report similar issues. The AI can generate volume, but it doesn’t always understand the progression of difficulty within a topic. It might produce a question that requires knowledge students haven’t encountered yet, or test a skill at a level too simple to be useful.

None of this is disqualifying. It just means you review what you get. The tool gives you a first draft, not a finished product.

What this means for assessment practice

I think the real opportunity here is frequency, not automation. The research on retrieval practice is clear: students learn more when they’re tested often and at low stakes. The obstacle has always been time. If AI tools bring the cost of creating a quiz down from thirty minutes to five, teachers can realistically quiz three or four times a week instead of once.

That matters more than whether the AI wrote a perfect question. A slightly imperfect quiz given on Wednesday is worth more than a perfect quiz you never got around to writing.

I’m not making a grand claim about AI transforming education. I’m making a small, practical one: these tools let me do something I already knew I should be doing but couldn’t find the hours for. The cognitive science has been telling us for twenty years that retrieval practice works. The bottleneck was always production. For me, at least, that bottleneck is mostly gone now.

My students still groan when I hand them a quiz.

Some things AI cannot fix.