ChatGPT in evaluation – An opportunity for greater creativity?

As debate rages over the possibilities and risks to higher education of artificial intelligence (AI) tools such as ChatGPT, evaluators are also asking what role AI and machine learning can play in their field.

Speaking at a virtual symposium hosted by the Centre for Research Evaluation at the University of Mississippi in the United States on 24 March, independent evaluation consultant Silva Ferretti described ChatGPT as the perfect bureaucrat: pedantic and by the book.

She argued that rather than concern around AI substituting humans in the field of evaluation, the concern should be around the extent to which humans have begun thinking like machines.

The symposium was titled “Are We at a Fork in the Road?” and explored implications and opportunities for AI in evaluation. It was hosted by Dr Sarah Mason of the University of Mississippi and Dr Bianca Montrosse-Moorhead of the University of Connecticut, co-editors of New Directions for Evaluation, a publication of the American Evaluation Association.

They said that disciplines around the world were grappling with the question of whether ChatGPT heralded a fork in the road with respect to powerful new generative AI. “This potential fork emerges because generative AI is distinct from earlier AI models in that it can create entirely new content.”

Multiple roles for ChatGPT

For speaker Silva Ferretti, using ChatGPT in her work has been an excellent time-saver, offering ideas and even serving as a “fantastic sparring partner for thinking”.

“It can take on the role of feminist evaluator, conventional evaluator, technocratic expert and many other different viewpoints,” she says. “The AI can provide ‘by the book’ approaches offering [the human evaluator] free time to go ahead, explore the details, the alternatives and the possibilities.”

Ferretti described how she uses ChatGPT for tasks like developing concept notes, questionnaires, log frames and evaluation criteria.

“For me,” she says, “using ChatGPT really pushes me to another level. Because why should people pay me if I am going to produce something a machine can do. For people who go by the book, next year, ChatGPT can do it better.”

Fellow speaker Dr Tarek Azzam, director of the Centre for Evaluation and Assessment at the University of California, Santa Barbara, argued that AI has a long way to go before being any real threat to human evaluators.

“We have been trained for many years and through many experiences to think about our results, our outcomes, and to be able to make a validity argument to support the conclusions we come up with,” he said. “Primarily because of all the things that go into developing a valid argument or a valid conclusion.”

But, like Ferretti, Azzam sees a space for AI in research evaluation, in what he describes as collaborative intelligence.

“So rather than have an either-or: AI versus human, how can we actually leverage AI for both training and understanding?”

He points to one example put forward by the National Academies of Sciences to consider an automated scoring scenario where an AI engine takes the place of one of the two human scorers. “This is something that can be a huge advantage from a time and cost standpoint,” he said.

Risks of ChatGPT

Ferretti noted that using ChatGPT does not come without risks, one of which is its tendency to hallucinate, and simply make things up. This issue was echoed by several speakers across the event.

Dr Izzy Thornton from the Centre for Research Evaluation at the University of Mississippi, noted the issue of algorithmic discrimination, which really comes down to machine learning models being only as good as the data they are trained on.

As an example of how far off the models can veer, Thornton spoke about a machine learning model that was trained to discern images of malignant skin tumours from benign skin tumours. The model had approached near human grade accuracy in discerning one type of tumour from another.

But when researchers took the model apart to see how it worked, they discovered the primary factor it was using to determine malignancy in the image was the presence or absence of a ruler in that image.

“The problem here,” she explained, “was in the training data. Most malignant tumours were photographed with a ruler alongside for scale while most of the benign tumours were not.”

There are also more sinister ways in which the training of these models reflects the biases and prejudices of those who develop and train the models. Here, Thornton pointed to the well-known example of self-driving cars being less able to detect people with darker skin tones than with lighter skin tones.

She also spoke to validity and reliability concerns in AI models.

“Machine learning models are trained to spit out language that sounds correct according to other things they know are correct. It doesn’t fact check. And so, because we can’t get the patterns of how they generate their answers, we cannot say for sure whether the results are actually valid.”

Ferretti, too, noted experiencing this in her work with ChatGPT, that it needs to be fact-checked at every point, and its confidence in its own correctness can and does cause a false confidence in humans too.

Issues of equity in use of AI

With every new advance in technology, there are accompanying concerns, said Dr Aileen Reid of the department of educational research methodology at the University of North Carolina at Greensboro.

There has been much evidence showing how algorithms and AI can contribute to new modes of racial profiling and gender insensitivity, advantaging the privileged in society but harming the poor and perpetuating injustice.

“Knowing this,” said Reid, “to what extent can a computer make an ethical decision for us, and to what extent should we trust it to?”

Where all speakers concurred however, is that AI is not going anywhere. The choice is to work actively with AI implementation in the field of evaluation, or have it enter the field in an unfettered, unmanaged way, with unintended and potentially harmful consequences.

Reid described this as an opportunity for evaluators: “AI is unregulated. The internet is unregulated. And this is a public policy issue that will eventually come under scrutiny. And so we need to, as a field, foreshadow and create conversations to anticipate these intended and unintended effects, and really put ourselves at the table while things are developing.”