Labelling Demo


Mark Betnel


May 7, 2022

In the high school class I teach about AI and the Future, I’ve been struggling to figure out which concepts we’ve covered the students really get and which they don’t. In my physics and math classes I have an easier time with this formative assessment, because at least some of what I want from them is the ability to perform certain skills. So I assess those skills and I have some (imperfect) sense of what I need to do next.

This class, being an elective, means at my school that I can’t assign (much) homework. It having no pre-requisites means I can count on no background knowledge about computing. This is by design, as I want everyone in the school to take the class. I’m getting about 80 kids per year in it, so maybe 50% of the school will have taken it at some point – I feel good about that. But that still means it’s challenging to teach.

What I want is for them to come away with an informed curiousity about the technology and its implications, so they can ask good questions about it and make good decisions about it in their own lives and for society. That’s a good goal, but it’s kind of nebulous for defining the performances I want them to be capable of during the class. What ends up happening is a bunch of reading, playing with little toy examples, and discussions that are a little one-sided – I have a lot to learn about how to teach a class like this.

So one of the concret-ish things I want from them is to understand the workflow of supervised learning and deployment: data gathering, curation, and labelling; model training; model testing; deployment; maintenance; end-user usage; iteration. With each of those steps I want them to know a little technical detail and to think about some of the fraught questions that can arise (and that they should ask!)

For example: Why do we need an ML tool for this task? Is it plausible that there is signal in the available data to power the ML tool? If the desired tool really works exactly as designed, does it pass the “Is it creepy?” test? What are the broader implications of success or failure with the project? Where does the data come from? How does it get labelled? How is privacy of data subjects protected? Is informed consent obtained? What are the training costs and issues? What does it mean to overfit vs underfit? How does one test the results? What are the error rates? Is there more than one kind of error? Do the different kinds of errors have costs (for the user and the target)? Do the errors (and successes) fall on population subgroups equitably? How is the model used in deployment? How is the performance monitored once deployed? Are there plans to maintain the system and to bring it down if results are not as expected? What would the next generation tool look like? Will it be more data, better labelling, more training compute, more sophisticated testing, better deployment infrastructure, better user education…

That’s a lot of questions. I asked the students this week to make a flowchart of what they thought the steps were – as a kind of formative assessment for me of what ideas had landed and which hadn’t. One thing I noticed is that a lot of them went straight from “Identify problem” to “Train computer to solve problem” without any kind of “acquire labelled data” step – that’s definitely a step that I thought we’d spent enough time discussing, so I was thinking about finding a demo of how labelling works – something like the workflow of a Mechanical Turk-er, and then present some stats on the quality of the labelling so they can think about issues of reliability of the labels, the effect of the labels on the training process, and how well the data set covers the relevant universe of things. I couldn’t find anything like that.

So I’ve been trying to make it. Progress reports to follow, but it turns out that I’ve never done server side programming before, and it’s a fine rabbit hole.