How AI can help make HE campuses more diverse

An algorithm developed by a PhD student at the University of Pennsylvania and nine other researchers is able to ‘read’ college and university application essays and determine pro-social and leadership qualities, says a new study released in Science Advances, the journal of the American Association for the Advancement of Science.

The study, “Using artificial intelligence to assess personal qualities in college admissions”, has the potential of helping to solve a major problem that admissions officers have encountered since 2020, when, because of the COVID-caused shut down, graduating high school students were unable to take the Scholastic Aptitude Test (SAT) or the American College Testing (ACT).

Without these standardised tests, which for almost a century have served as the de facto national college admissions tests in the United States, admissions officers had to put more weight on students’ high school grades and, especially, admissions essays.

These essays were both time consuming to read and the sheer volume of them – San Diego State University received 100,453 applications for this year’s incoming class, for example – is daunting. Even more importantly, admissions officers’ reading is subject to all-too-human ‘noise’.

“We know that colleges care about personal qualities such as teamwork, leadership and pro-social orientation. The question then is, how are they trying to measure these things? How are they collecting this information? It all comes down to a qualitative judgement by an expert [who has] years of expertise,” says Benjamin Lira Luttges, the PhD student who led the study and was lead writer on the paper.

“But human judgement is subject to noises and biases … [If] you’re an admissions officer, and you’re reading the 10th file right before lunch, you might not evaluate it in the same way as the first file after lunch.

“So what we set out to see is: if you could have a human with an artificial intelligence (AI) collaborating, could you reduce this kind of noise, because the algorithm doesn’t get tired? The algorithm uses the exact same judgement call for every essay; in that sense, there is no ‘noise’,” said Luttges.

The function of admission essays

Several previous approaches that focus on essay content and style, algorithms that ‘read’ admissions by focusing on their content and structure, Luttges et al write, “have been shown to correlate more strongly with household income and SAT scores”.

Critics of SAT scores, like Dr Saul Geiser, a research associate at the Center for Studies in Higher Education at the University of California, Berkeley, have shown they are biased towards white students.

In his contribution to the book The Scandal of Standardized Tests (2020), using data from University of California system, Geiser concluded that “race has a large, independent, and growing statistical effect on students’ SAT-ACT scores after controlling for other Norm-Referenced Tests and Race-Blind Admissions 15 factors. Race matters as much as, if not more than, family income and parents’ education in accounting for test score differences”.

Others have shown that since American public schools are mainly funded by property taxes, those who live where property taxes are higher, that is, wealthier areas, attend better equipped high schools and are taught by better paid teachers. These students also come from families financially able to enrol students in courses designed to teach how to take the SAT.

A century ago, Luttges and his co-authors note, ‘antisemitic holistic’ admissions practices instituted by universities such as Columbia University used essays to limit the number of Jewish students enrolling in Columbia.

These essays, in the words Luttges and his colleagues quote, were used by admissions officers to judge what they considered to be “good character”, which was defined as including “students who come from homes of refinement” – and not, as historians have shown, the Lower East Side of Manhattan, where the majority of New York’s Jews then lived.

Ironically, today, university and college admissions officers look to these essays as part of their ‘holistic’ judgement of an applicant because it is believed that these essays give the student a chance to explain through their life experiences why they are a good fit for this or that higher education institution.

This function of applicants’ essays was recognised last June in the United States Supreme Court decision that ruled that affirmative action in college and university applications was unconstitutional.

Writing for the majority, Chief Justice John Roberts noted: “At the same time, as all parties agree, nothing in this opinion should be construed as prohibiting universities from considering an applicant’s discussion of how race affected his or her life, be it through discrimination, inspiration or otherwise,” meaning that the information gleaned from these essays provides colleges and universities with a constitutional means of creating a diverse student body.

Fine-tuning a language model

The algorithm Luttges and his team developed is based on RoBERTa, an open-source language model developed by Facebook AI that assembled over 160 GB of texts, including books, news, and English Wikipedia entries designed to predict words and sentences, not unlike ChatGPT does.

This model was fine-tuned (trained) to then produce probability ratings for personal qualities based on ratings by trained research assistants (RAs) and experienced admissions officers.

They read 3,131 150-word essays in which applicants briefly elaborated on their extracurricular or work experiences. The RAs and admissions officers coded the responses on a binary scale (0 being ‘No’ and 1 ‘Yes’). These ratings were the data used to fine-tune RoBERTa.

RoBERTa was trained by having it look at a large amount of text from the internet and guess a masked word (that is, masked language modelling), and to pick which sentence was likely to follow (that is, next sentence prediction).

This task provides some level of ‘basic language’ understanding, which allows the models to express text as numbers. “We can then use that pre-trained model to fine-tune it for a particular task, such as classification, in this case,” said Luttges.

Assessing correlations

The correlation between the seven personal qualities RAs found in the essays and the results determined by the algorithm is much more than statistically significant, meaning the computer algorithm produced results that agreed with human ratings. For example, for pro-social purposes (helping others) the correlation was 0.86, for leadership it was 0.81, for learning 0.77, while for goal pursuit, intrinsic motivation, teamwork, and perseverance it ranged from 0.73 to 0.59.

Correlations measure the association between two sets of numbers on a -1 to +1 scale. A figure such as 0.86 implies that the algorithm tended to give high ratings to essays the humans gave high ratings to, and vice versa.

The correlation for pro-social purpose and leadership between the algorithm and admissions officer ratings was 0.80 and 0.73 while the others were between 0.65 and 0.05.

According to Luttges, the most plausible explanation for the difference between the ratings by the RAs and the admissions officers is that RAs convened and trained together to achieve consensus; this was impossible to do with the admissions officers.

Word clouds

The most visually arresting part of using artificial intelligence to assess personal qualities in college admissions are the ‘word clouds’ that contain words the computer model tends to attend to when producing ratings for personal qualities. For example, ‘pro-social purpose’ contains words such as charity, children, helping, AIDs, patients, orphan, volunteer, elderly, cancer, volunteer service and fundraising.

The words in these word clouds are, Luttges explained, the key to understanding what the algorithm was ‘looking at’ when extracting personal qualities from the essays.

By being trained to identify words that indicate leadership, perseverance, teamwork or intrinsic motivation, the algorithm extracts these values and activities from the essays irrespective of other aspects of the writing and assigns each essay a number that is, for the most part, unrelated to demographics.

The results produced by Luttges’ algorithm are, therefore, insulated from ‘noise’ such as theme, syntactics, grammar or other aspects of the mechanics of language that have been found to indicate the writer’s socio-economic status – which can trigger biases in the reader.

Luttges et al cite two articles that show important differences between how students from wealthier families write and how those from disadvantaged families write.

One is AJ Alvero et al’s study, “Essay content and style are strongly related to household income and SAT scores: Evidence from 60,000 undergraduate applications” (Science Advances, 2021), which found that students from wealthier families tend to write about certain essay topics (for example, human nature), whereas disadvantaged students tend to write about others (for example, tutoring groups).

Accordingly, while a high school student from a higher income level might write about leadership by talking about teaching swimming at a YMCA summer camp, a student from a lower socio-economic background who grew up in an American ghetto might have written about how he taught his friends to sink a basket (that is, play basketball).

Or, as Luttges put it, one student might show teamwork by writing about their polo team while another might do so by writing about helping their parents to pay rent.

Contributing to diversity

While it would be a mistake to assume that you can read backwards from Luttges’ algorithm and deduce the class and-or racial-ethnic background of any one or even a group of students, Luttges’ algorithm can aid in producing more diverse student bodies, as he explained using a thought experiment.

“The model takes text as an input and produces a number between zero to one for each personal quality, indicating the probability that the personal quality is present in the text, so you get seven numbers out,” said Luttges.

“These computer-generated scores were uncorrelated to demographics. That is, there were not meaningful differences in the scores given by the algorithm to different demographic groups. Thus, personal qualities represent information useful to colleges (they are valued by colleges and students with higher personal qualities are more likely to graduate), that is unlike standardised test scores, mostly independent of demographics.

“If I were to start my own college and I were to accept only people with the highest SAT scores, I would get a certain kind of demographic mix that would likely be skewed towards a whiter and more economically advantaged group.

“If I were in a different universe and started a college and asked you to write 100 words about your activities outside of school and I am going to pick the ones with the highest computer-generated scores for personal qualities, this one will have a more diverse campus,” explained Luttges.