Facing facts: ChatGPT can be a tool for critical thinking

There once was a student in need,
Of an essay, a difficult deed,
So, they turned to ChatGPT,
For help, you see,
And aced the assignment with speed!

(Produced by ChatGPT to the prompt: “Write a limerick about using ChatGPT for a university essay.”)

The prompt that Professor William Kolbrener, who teaches in the department of English literature and linguistics at Bar-Ilan University in Israel, typed into ChatGPT was simple enough: “Homer is obsessed with body parts. Discuss.”

So, too was the chatbot’s answer that begins, “I am not aware of Homer being obsessed with body parts,” which, for anyone who’s ever cracked open the Iliad would seem odd, given the number of arms, heads and hands hacked off in battle – to say nothing of the exquisite description in the Odyssey of Polyphemus’ eye boiling into ooze as Odysseus drives a red-hot stake into it.

The second part of the chatbot’s answer makes a usage error before falling into self-referentiality: “Homer is a fictional character from ancient Greek literature, specifically from the epic poem [sic] the Iliad and the Odyssey written by Homer.”

“These errors,” said Professor David Joyner, who, in addition to being a senior research associate, is executive director of online education and the online master of science in computer science at Georgia Institute of Technology in Atlanta in the United States, “are indicative of ChatGPT’s limits”.

He added: “Unless they fundamentally change the way it operates, it’s going to make some really silly mistakes. It’s going to be wrong about a lot of things for the foreseeable future because it doesn’t build ‘knowledge’; it builds relationships between words in much the same way predictive spelling does between letters.

“But it doesn’t ‘know’ what the words mean. In searching its huge database of books and websites, all it can find is that these words tend to happen together. And, as a result, it’s good at finding answers about something that has been written by someone in the past.”

Moral panic

The publication of ChatGPT by the San Francisco-based artificial intelligence company OpenAI last November touched off a ‘moral panic’ among many professors and academic integrity officers, reminiscent of the one that followed the advent of cheap and easily used calculators in the 1990s.

Stephen Marche, who has a PhD in early modern drama, and has written sensitively on Shakespeare and trenchantly on the possibility of a civil war breaking out in the United States, declared in The Atlantic in early December, that “the college essay is dead”.

Marche’s central argument is that since ChatGPT can produce fluid prose and, if the right prompts are used, can gather correct information, it undermines the undergraduate essay.

The prototypical undergraduate essay is the literary essay that shows that students not only understand, for example, Edgar Allan Poe’s The Raven, but can also explain the poem’s structure, rhyme scheme and imagery to produce an interpretation of the poem that begins with the ominous words: “Once upon a midnight dreary, while I pondered, weak and weary,/Over many a quaint and curious volume of forgotten lore.”

Coincidentally, just as ChatGPT was garnering its first reviews, Chicago University Press published Dr John Guillory’s Professing Criticism: Essays on the organization of literary study. In it, the retired chair of English at New York University provided a genealogy of the literary essay. The major item that professors use to determine grades, it dates to the late 1940s, Guillory shows, and is intricately linked to two rather disparate occurrences.

The first was the GI Bill of 1944, which allowed almost eight million US Second World War veterans to go to college and university. Though the politicians in Washington expected most to study trades and farming, a surprisingly large number signed up for humanities courses, including English literature courses.

The second event – really events – occurred in classrooms and lecture halls across the nation in which New Criticism was the order of the day.

Associated with IA Richards (Principles of Literary Criticism, 1924), William Empson (Seven Types of Ambiguity, 1930) and WK Wimsatt (The Verbal Icon: Studies in the Meaning of Poetry, 1954) among others, the New Critics argue, broadly speaking, that the literary text should be examined without reference to the author’s biography or historical and-or philosophical issues of the time in which it was written. Poems, plays, novels or short stories were objects to be analysed via ‘close readings’ of their internal features: for example, rhyme, repetition, alliteration, irony, thematic structure, imagery and metaphor.

Not only could this method be taught to large classes, but students’ essays were roughly comparable both within a class, between classes and across the discipline in other universities.

According to Boris Steipe, professor emeritus in the department of biochemistry at the Temerty Faculty of Medicine of the University of Toronto in Canada, to grasp how ChatGPT will likely affect the literary essay, we must first understand how universities came to rely on this essay and the ramifications of having done so.

The growth of the university in the 19th century led to the need to “commodify education, make it scalable and standardised so we could compare educational results between different universities and for things like professional certifications. To do all this, the universities developed proxy measures”, he said, before noting that the influx of millions into America’s universities after the Second World War compounded the need for a more or less standardised grading matrix.

“The literary essay is typically one of these proxy measures. We're not interested in that essay itself. We are interested to teach whoever writes the essay to think, and we're taking that and saying: here is an indication that that thinking has been achieved.

“That’s why we used to be able to determine from the structure of the first paragraph of an essay what its grade was likely to be,” Steipe added, with a nod towards my three decade-long career as an English professor.

Although all of the experts I interviewed for this article were, as I was, trained using the literary essay, all thought that since the vast majority of the students in undergraduate studies are not going to become literature or history professors or, say, musicologists, the eclipse of the undergraduate essay need not necessarily be lamented.

The most welcoming was Professor Michelle Miller, who teaches in the department of psychological sciences and is the President’s Distinguished Teaching Fellow at Northern Arizona University in Flagstaff, Arizona, as well as being the author of the recent Remembering and Forgetting in the Age of Technology: Teaching, learning and the science of memory in a wired world (West Virginia University Press).

“If that [undergraduate literary essays] is what it [ChatGPT] is killing, then that’s kind of fine with me,” she told University World News. All too often, professors assign essays, she said, because they were trained on them and, worse, without thinking about what ‘cognitive revolution’ has shown about how people learn.

“We hope that writing an essay about a poem will transfer into writing a really persuasive legal brief or pitch when you’re a venture capitalist or something. But that’s not what the research shows. If you want skill ‘X’, you’ve got to train on skill ‘X’ and get feedback from an expert on it. And that is what’s really missing.”

For his part, Joyner thinks that despite the hype, we are some time away from the point at which either ChatGPT or Bard – Google’s entry into this field unveiled earlier this month – is able to generate a good essay in response to an arbitrary prompt dropped into it by a student.

He tested ChatGPT by asking it to write biographies of 10 video game composers and found the chatbot repeatedly used the description, “So and so’s music is known for its ability to perfectly capture the mood and atmosphere of the games he/she works on.”

A further error a teaching assistant marking the essays would easily spot: six of the 10 biographies said that their career highlight was being the first video game composer whose compositions were played by live orchestras. “Six of them,” Joyner added with a smile, “can’t all be the first to have had that happen.”

Yet, Joyner told me that ChatGPT’s ability to produce essays that in many cases are in the B-range raises important questions about the essay form, questions that will become even more important as the engineers at OpenAI work out the kinks that produce senseless answers.

“You have to go back to what your learning goals are for the actual class. Imagine a good learning goal; it should not be specified at the level of ‘a student should be able to write an essay about something’. The essay is the student’s way of demonstrating understanding of some knowledge. But you have a higher goal. It’s a higher level and that goal may get treated in different ways.

“Assessment tends to get linked to the plagiarism angle. I’ve seen that with professors who say you have to ask students things that can’t be put into the system, which is a natural kind of reaction, but I think it’s the wrong reaction.”

Weaknesses in math

Curiously, while ChatGPT can write computer code and, according to Steipe, is good at debugging it, mathematicians have found it wanting. In a Twitter post on 6 February, University of Warsaw mathematics Professor Andrzej Kozlowski wrote, “Look guys, stop telling me how bad this thing is! It is incredibly bad! I just finished grading a math exam (only one problem). I thought some of the ‘solutions’ were dumb, but now I know I didn’t know the meaning of dumb until I tried this thing.”

In a message to me, Kozlowski said: “It will produce ridiculous statements like: ‘n is divisible by 3 because n is an integer’ or ‘an even number divided by 2 is an even number’, which are just ‘imitations’ of mathematical statements without any relation to truth.”

When the topologist asked ChatGPT to compute the area of a triangle, the sides of which were 4, 7 and 10, it correctly chose the two millennium-old Heron’s formula – and then made a mistake in multiplication.

Understanding how it works

Both professors who want to incorporate ChatGPT into their pedagogy and their students’ activities, and those who want to design assignments that are as ChatGPT-proof as possible, need to understand how it (and similar systems) work.

ChatGPT’s training data was established at the end of 2021. It is a large corpus of electronically available books, all of English Wikipedia, Reddit, comments on Twitter feeds, information scraped from millions of webpages from trivial user manuals to philosophical blogs, as well as news items.

This data was first parsed as ‘tokens’, word-stems, suffixes and the like, and punctuation, but not only from English sources – dozens of other languages and writing systems are represented as well. There are about 500,000 tokens in all. In a sense, said Steipe, these can be thought of as Lego bricks because anything you might be missing can be composed from the things you have.

These tokens were used to train a ‘neural network’, a kind of very large spreadsheet that keeps track of relationships between tokens.

“During this training, a piece of text from the database is fed through the network, and the result is a prediction: what is the next token that follows in the text,” Steipe explained. “Since we know what the next token in the training data actually is, we can tune the network to make a good prediction for all of the training data. We are not copying this data, but interpreting it by learning the relationships between the tokens.

“When the system is given a prompt, that is, a question like, ‘Please explain to a child what was new about the Renaissance idea of Enlightenment’, this goes into a complicated neural network.

“The network responds by computing for each token how likely it is to be the next word. The first word of the response could be ‘The’ [idea of Enlightenment . . . ] or ‘The’ [new Renaissance idea . . . ], or even ‘Sure’ [I can explain . . . ]. It randomly picks one of the likely words, and then feeds the whole sentence, that is now one word longer, back into the beginning to predict the next word after that,” Steipe told University World News.

While similar in some ways to predictive spelling that would suggest p, t, n, r after the letters ‘ca’, ChatGPT is infinitely more powerful. It puts dialogue together, word by word, recomputing the best choice to make after each word. The answers are never quite the same because each is made up when asked. It will tailor its language to a child if asked to, or in rhyme, biblical style or like a college student would write.

“That’s not at all how we think. We always have some kind of a hierarchical mechanism, where we have concepts and then we break them down into ideas. Then the ideas are broken down into words and new phrases, and nouns, and pronouns and whatever happens in language. ChatGPT takes it word by word.

“If there’s any grammar in there – and apparently there is, otherwise generating language would not be possible – it’s not explicit rules; it’s all implicit in the connections. It’s a kind of an interpretation of the human language and not something that is stored in the conventional way,” said Steipe.

“This is rather like how our thinking works as well. We also have single neurons that, taken individually, don't mean very much. But the meaning can be represented in the many, many, many connections between the neurons in our brains.

“As ChatGPT and related large language models have shown, if you have enough of these artificial neurons, they work in surprising ways and can produce something emergent. That is, they are now more than the sum of the individual components; it becomes something that’s qualitatively new,” he said.


One workaround some professors who are concerned that ChatGPT undermines academic integrity could use takes advantage of the fact that the database dates to 2021. Accordingly, they assign essay topics drawn from articles or books published since then.

However, according to Sarah E Eaton, professor of education at the University of Calgary in Alberta, Canada, this is a short-term fix; already Google’s Bard aggregates data from the internet in real time.

Drawing on my own experience as an English professor, I asked Allyson Miller, an academic integrity officer at Toronto Metropolitan University, whether reading journals is a way around the plagiarism problem. A professor could ask students to write how their experiences in middle or high school informed their understanding of Romeo and Juliet; this couldn’t be faked, I suggested, because ChatGPT wouldn’t know your high school social life.

To my surprise, Miller said: “It can be completely faked because it’s really good at creating stories. It will just take assumptions [found in the database, which includes innumerable stories about teenagers’ student life] about the high school social life of a young person and create a narrative from that.” The more detailed the prompt, and the more prompts, the more detailed the story it will spin.

(The lab journals Steipe has his students in medical school write are significantly less liable to be ghosted by ChatGPT because of the information they report about lab work.

“I have my bioinformatics students write lab journals all the time. It’s a skill they need for the laboratory because they need to document what they’ve been doing. I tell them that they are writing for themselves. Write in a way in which you record what you’ve been thinking, what was exciting or painful about what you are learning.”)

Programmes that purport to identify ChatGPT texts, such as Edward Tian’s GPTZero, are also not the answer to preserving academic integrity, at least at the moment. Designed over three days last Christmas when Tian, a Princeton University computer science major was home in Toronto, GPTZero identifies two variables in texts.

The first, ‘perplexity’, is essentially a measure of the idiosyncratic aspects of the text – for example, word choice or punctuation – as compared to the texts the programme trains on. The second measure, ‘burstiness’, is the ratio of long to short sentences.

On 13 January, Tian told the BBC: “If you plot precisely over time, a human written article will vary a lot. It would go up and down, it would have sudden spikes.”

Although Tian claims that GPTZero has only a 2% false positive rate, Joyner said: “I wouldn’t put the results of any of those scans in front of our office of student integrity as evidence of student cheating because the false positive and false negative rate is so high.”

Nor will statements in programme handbooks or course outlines prohibiting the use of AI work, said Eaton and the other experts I interviewed.

At Eaton’s University of Calgary, “A First Response to Assessment and ChatGPT in Your Courses”, prepared by the Taylor Institute for Teaching and Learning, puts “Academic Misconduct” under the heading “Working with your teaching team”. Though academic misconduct is implicit in several parts of the section headed “Working with your students”, the university’s emphasis is on how to integrate ChatGPT and use it as a learning tool.

For her part, Allyson Miller told University World News that a blanket ban on AI programmes would quickly become unenforceable, even incoherent.

“You can put a statement in programme handbooks prohibiting the use of AI, but there’s murkiness around what exactly that means. When you’re writing an email on Gmail, it will predict your next word. There’s the AI programme Grammarly that students use to correct their grammar. That’s AI. So are translators.”

A more constructive response

What, then, do the experts suggest professors do now that ChatGPT and other such programmes are easily accessible?

The experts agree that concentrating on the issue of plagiarism is counterproductive. Not only have text spinners, which produce paraphrases, and paper mills been around for a number of years but long before the internet, students had access to ways of cheating. Students have been cribbing from Coles Notes for almost 65 years, for example.

Michelle Miller urges professors to think themselves out of the “adversarial battle of who can outsmart whom in the giant game of presumed cheating” and suggests that professors who require essays concentrate on the writing process.

Taking the role of an English professor for a moment, she told me: “I don’t want to read what they wrote 10 minutes to midnight the night before the essay is due. Instead, two months in advance, I want to see your ideas. I’m going to give feedback on it. A month before the due date, I want to see an outline and some of the citations you have come up with. Then, I want to see a rough draft that I will comment on. The final product, the essay, is then really a revision.”

Peter C Herman, an English professor at San Diego State University, offered another option: crafting the essay question so that it requires both higher level analysis and is specific enough to confound ChatGPT. Writing earlier this month in the Times of San Diego, Herman gave the chatbot’s 1,000 word essay on the iguana a B minus.

The app’s essays on Shakespeare were solid Fs.

In one case he asked it: “Tell me about the relationship between Shakespeare’s The Taming of the Shrew and the Homily on Matrimony,” the homily being “a sermon about marriage that is really a warning against domestic violence”.

ChatGPT answered: “While there may be some similarities in the ideas presented in the two works, they are not directly related. The Homily on the State of Matrimony is a religious text and The Taming of the Shrew is a work of fiction.” All true, Herman averred: “But the point is to see how one part of the culture influences the other. That, ChatGPT cannot do.”

Both Joyner and Eaton think that ChatGPT can act as a guide to help students who have difficulty writing.

In Eaton’s example, professors would have students start their essays with an answer to a question, which can be thought of as an armature – but one that would first have to be fact-checked.

In a class on Shakespeare for example, Eaton suggests, a professor has students ask their favourite AI app what happens in Hamlet. Then, they would bring the AI generated essay into class where students would work in groups.

“Let’s compare and contrast the essays. What did it get right? What did it get wrong? In order to answer those questions they have to know the plot of Hamlet, which means they would have to have read it. So, we’re not asking them to regurgitate what they may have learned or cribbed or seen in a movie, but rather to increase their fact-checking skills and their critical thinking skills. I suspect that that will become more of the focus of assessment rather than having students try and show us what they’ve learned through writing an essay.”

Since ChatGPT’s “confabulation rate”, as Eaton calls it, is 10% and when it doesn’t know an answer it fills in the blanks as best it can, each of the experts I interviewed emphasised the need to train students to identify its errors.

“We’ve been talking for ages about technological literacy,” said Joyner. “AI is not going away and we are going to have to teach students how to interact with it, how to find its errors.”

Surprisingly, given that he was talking about a computer programme, Joyner also stressed the social aspects of ChatGPT and how it might be leveraged for some students.

“Let’s say, you have an English class of 20 and six of them are really good friends. So, when they have a paper to do, they talk about it and help each other. And that’s wonderful for them. But what do you do about that loner, the student who is new to the school and doesn’t have any friends. Their learning is hindered by the lack of social structure around their learning. You can target that using such things as ChatGPT.

“I think it would be really nice if they [these programmes] can fill that role if you are learning on your own. Now, it can be the ‘person’ that you learn with, that you have a conversation with about the material.

“I like this analogy because it emphasises that ChatGPT has a lot of the same constraints you would have if you were learning with a peer whom you know. Your classmate doesn’t know everything about the subject. If they say something, it might be wrong. And ChatGPT is often wrong about things. You have to learn the ability to vet what other people say, put them in the context of your understanding, go out and do some fact-checking.”

Eaton made a similar point with regard to English second language students, reluctant writers as well as those with dysgraphia. They, she suggested, could use ChatGPT’s essay response as a sort of armature around which a more nuanced essay could be constructed.

When I asked Steipe, “Wither the essay?”, he began by again underscoring the fact that it is often a proxy from which professors say they can assess thinking.

“Well, if you want to teach thinking,” he said, “we should not be looking at a proxy. We should be looking at the real thing. Perhaps the real thing is data represented in a list of bullet points or in a diagram or something like an idea map. What can be said about the topic? What supports it? What contradicts it? Where are things refuted? How do they interact with each other?

“Once we’ve established that, we can give it to a chatbot and it will give us a beautiful essay. And that, actually, is very wholesome. Because, as you know, Harry Frankfurt has talked a lot about this beautiful idea of ‘bullshit’ (On Bullshit, 2005) in which we eloquently express something that has no factual content. Being able to get away from that and going back to the argument, and clearly delineating the factual content is what’s important.”

Intelligent, but not sentient

At several points in our discussions, Steipe, Joyner and Eaton used the phrase ‘ChatGPT thinks’. When I asked if they were being metaphorical, they said no. They also said that while it can be said to ‘think’, it would be a mistake for us to think that it is sentient, a fallacy that is easy to fall into because we communicate with it in natural language, that is, by asking questions, and because it responds in prose; indeed, it even apologises if it cannot answer a question.

According to Steipe, ChatGPT passes a type of test renowned Stanford University computer science Professor Terry Winograd proposed to define thinking: disambiguation of pronouns. In the sentence, “The cat ate the mouse; it was tasty”, what was tasty? The cat or the mouse? The correct answer is the mouse. In the sentence, “The cat ate the mouse, it was hungry”, which was hungry? The correct answer is the cat, even though to get its value, the pronoun ‘it’ has to jump over the mouse, so to speak, to the cat at the beginning of the sentence.

“This is really quite striking for something that is just probabilistic reasoning. There’s a level of abstraction built in there that allows you to dereference pronouns, and it comes implicitly from the knowledge that is contained in the training data. This is why, rather than calling it ‘artificial intelligence’, I would think of it as ‘implicit’ intelligence.”

For Joyner, it comes down to how you define thinking.

“I think that in the way I would define thinking, it does do that – in the sense that it makes connections between things that previously may not have been connected. This is different from a Google search because it’s basically regurgitating from skimming. ChatGPT is doing something called thinking, but it falls far short of what defines consciousness or sentience.”

What would ChatGPT say?

ChatGPT could have written both Steipe’s and Joyner’s answers about whether it is thinking and about whether it is sentient.

To the question “Does ChatGPT think?” it answered: “As an artificial intelligence language model, ChatGPT does not have subjective experiences or consciousness in the way that humans do. However, ChatGPT is designed to simulate human-like thinking by processing and analysing text data, generating responses based on patterns and associations in the data. So, in that sense, you could say that ChatGPT engages in a type of ‘thinking’ when it generates responses to user input.”

And it answers the question “Is ChatGPT sentient?” this way:

“No, ChatGPT is not sentient. Sentience refers to the capacity for subjective experience, awareness and consciousness, which are traits that are currently only attributed to living beings. As an artificial intelligence language model, ChatGPT is a computer programme that is designed to process and generate text based on patterns and associations in large datasets, but it does not have subjective experiences, emotions or consciousness. While it can simulate human-like responses and carry out certain tasks, it is not capable of independent thought or self-awareness.”