The Holy Grail of learning outcomes

The search for the Holy Grail one-size-fits-all test to measure learning gains started in the US with the Collegiate Learning Assessment, but the Organisation for Economic Cooperation and Development (OECD) wants to take it global.

In 2008 the OECD began a process to assess if it might develop a student assessment test for use internationally. A project emerged: the Assessment of Higher Education Learning Outcomes programme.

AHELO would assess the feasibility of capturing learning outcomes valid across cultures and languages, and in part informed by the OECD’s success in developing the Programme for International Student Assessment (PISA) – a widely accepted survey of the knowledge and skills essential of students near the end of the compulsory education years.

The proclaimed objective of the AHELO ongoing feasibility study is to determine whether an international assessment is “scientifically and practically possible”. To make this determination, the organisers developed a number of so-called study ‘strands’.

One of the most important is the Generic Strand, which depends on the administration of a version of the Collegiate Learning Assessment (CLA) to gauge ‘generic skills’ and competences of students at the beginning and close to the end of a bachelor degree programme.

This includes the desire to measure a student’s progression in “critical thinking, the ability to generate fresh ideas, and the practical application of theory”, along with “ease in written communication, leadership ability, and the ability to work in a group etc.”.

OECD leaders claim the resulting data will be a tool for the following purposes:
  • • Universities will be able to assess and improve their teaching.
  • • Students will be able to make better choices in selecting institutions – assuming that the results are somehow made available publicly.
  • • Policy-makers will be assured that the considerable amounts spent on higher education are spent well.
  • • Employers will know better if the skills of the graduates entering the job market match their needs.

    Between 10,000 and 30,000 students in more than 16 countries take part in the administration of the OECD’s version of the CLA. Full administration at approximately 10 universities in each country is scheduled for 2011 through to December 2012.

    AHELO’s project leaders admit the complexity of developing learning outcome measures, for example: how to account for cultural differences and the circumstances of students and their institutions? “The factors affecting higher education are woven so tightly together that they must first be teased apart before an accurate assessment can be made,” notes one AHELO publication.

    By March 2010, and at a cost of €150,000 (US$200,000) each, the ministries of education in Finland, Korea, Kuwait, Mexico, Norway and the United States agreed to commit a number of their universities to participate in the Generic Strand (that is, the OECD version of the CLA) of the feasibility study.

    Validity in question

    However, the validity and value of the CLA is very much in question and the debate over how to measure learning outcomes remains contentious.

    Many institutions, including most major US research universities, view with scepticism the methodology used by the CLA and its practical applications in what are large institutions, home to a great variety of disciplinary traditions.

    A product of the Council for Aid for Education, the CLA is a written test that focuses on critical thinking, analytic reasoning, written communication and problem-solving administered to small random samples of students, who write essays and memoranda in response to test material they have not previously seen.

    The council is technically a non-profit but has a financial stake in promoting the CLA, which has emerged as its primary product, much like the Educational Testing Services that hawks the SAT.

    Prominent higher education researchers have challenged the validity of the CLA test on a number of grounds. For one, the CLA and the SAT are so highly correlated. The amount of variance in student learning outcomes after controlling for SAT scores is incredibly small. Most institutions’ value-added will simply be in the expected range and indistinguishable from each other. Hence, why bother with the CLA.

    The CLA results are also sample-dependent. Specifically, there is a large array of uncontrollable variables related to student motivation to participate in and do well on the test. Students who take CLA are volunteers, and their results have no bearing on their academic careers. How to motivate students to sit through the entire time allotted for essay writing and to take seriously their chore?

    Some institutions provide extra credit for taking the test, or provide rewards for its completion. At the same time, self-selection bias may be considerable. On the other hand, there are concerns that institutions may try to game the test by selecting high-achievement senior year students. High-stakes testing is always subject to gaming. There is no way to avoid institutions cherry-picking – purposefully selecting students who will help drive up learning gain scores.

    Other criticisms centre on the assumption that the CLA has fashioned a test of agreed-upon generic cognitive skills that is equally relevant to all students. But recent findings suggest that CLA results are, to some extent, discipline-specific.

    As noted, because of the cost and difficulty of evaluating individual student essays, the design of the CLA relies upon a rather small sample size to make sweeping generalisations about overall institutional effectiveness. It provides very little if any useful information at the level of the major.

    To veterans in the higher education research community, the ‘history lessons’ of earlier attempts to rank institutions on the basis of ‘value-added’ measures are particularly telling.

    There is evidence that all previous attempts at large-scale or campus-wide assessment in higher education on the basis of value-added measures have collapsed, in part due to the observed instability of the measures. In many cases, to compare institutions (or rank institutions) using CLA results merely offers the ‘appearance of objectivity’ that many stakeholders of higher education crave.

    For the purposes of institution-wide assessment, especially for large, complex universities, we surmise that the net value of the CLA’s value-added scheme would be at best unconstructive, and at worst would generate inaccurate information used for actual decision-making and rankings.

    The alternative

    So what's the alternative? In a new study published in the journal Higher Education, we examine the relative merits of student experience surveys in gauging learning outcomes by analysing results from the data from the Student Experience in the Research University (SERU) Consortium and Survey, based at the Center for Studies in Higher Education at the University of California, Berkeley.

    There are real problems with student self-assessments, but there is an opportunity to learn more than what is offered in standardised tests.

    Administered since 2002 as a census of all students at the nine undergraduate campuses of the University of California, the SERU survey generates a rich data set on student academic engagement, experience in the major, participation in research, civic and co-curricular activities, time use and overall satisfaction with the university experience.

    The survey also provides self-reported gains on multiple learning outcome dimensions by asking students to retrospectively rate their proficiencies when they entered the university and at the time of the survey. SERU results are then integrated with institutional data.

    In 2011, the SERU survey was administered at all nine University of California undergraduate campuses, and to students at an additional nine major research universities in the US. A SERU-International Consortium has recently been formed with six ‘founding’ universities in China, Brazil, The Netherlands and South Africa.

    The technique of self-reported categorical gains (for example, ‘a little’, ‘a lot’) typically employed in student surveys has been shown to have dubious validity compared to ‘direct measures’ of student learning. The SERU survey is different. It uses a retrospective post-test design for measuring self-reported learning outcomes, which yields more valid data. In our exploration of those data, we show connections between self-reports and student GPA and provide evidence of strong face validity of learning outcomes based on these self-reports.

    The overall SERU survey design has many other advantages, especially in large, complex institutional settings. It includes the collection of extensive information on academic engagement as well as a range of demographic and institutional data.

    Without excluding other forms of gauging learning outcomes, we conclude that designed properly, student surveys offer a valuable and more nuanced alternative in understanding and identifying learning outcomes in the university environment.

    But we also note the tension between the accountability desires of governments and the needs of individual universities that should focus on institutional self-improvement. One might hope that they would be synonymous. But how to make ministries and other policy-makers more fully understand the perils of a silver-bullet test tool?

    Blunt tool

    The CLA is a blunt tool, creating questionable data that serve immediate political ends. It seems to ignore how students actually learn and the variety of experiences among different sub-populations. Universities are more like large, cosmopolitan cities full of a multitude of learning communities, as opposed to a small village with observable norms.

    In one test run of the CLA, a major research university in the US received data that showed students actually experienced a decline in their academic knowledge – a negative return? It seems highly unlikely.

    But how to counteract the strong desire of government ministries, and international bodies like the OECD, to create broad standardised tests and measures of outcomes? Even with the flaws noted, the political momentum to generate a one-size-fits-all model is powerful. The OECD’s gambit has already captured the interest and money of a broad range of national ministries of education and the US Department of Education.

    What are the chances the ‘pilot phase’ will actually lead to a conclusion to drop the pursuit of a higher education version of PISA? Creating an international ‘gold standard’ for measuring learning outcomes appears too enticing, too influential and too lucrative for that to happen – although we obviously cannot predict the future.

    It may very well be that data and research offered in our study that uses student survey responses will be viewed as largely irrelevant in the push and pull for market position and political influence. Governments love to rank, and this might be one more tool to help encourage institutional differentiation – a goal of many nation-states.

    But for universities who desire data for making actionable improvement we argue that student surveys, if properly designed, offer one of the most useful and cost-effective tools. They also offer a means to combat simplistic rankings generated by CLA and similar tests.

    * John Aubrey Douglass, Gregg Thomson and Chun-Mei Zhao are researchers with the Center for Studies in Higher Education at the University of California, Berkeley. This is an edited version of their recent article, "Searching for the Holy Grail of Learning Outcomes". It is republished with permission.