Sponsored Article
bookmark

How the LLM evolution is driving AI application

This article is promoted by the IEEE Computer Society.

As artificial intelligence (AI) continues to grow in accessibility, companies are seeking out ways to efficiently leverage large language models (LLMs), video language models (VLMs) and other algorithmic approaches to support a more cost-effective and efficient deployment of the technology. In fact, this year’s IEEE Computer Society Technology Predictions Report indicated that new forms of LLMs will level the AI playing field.

New research drives LLM development

Consider, for instance, that today’s open-source communities create ways for developers to efficiently tap into successful models, and cloud services are providing LLM solutions with integrated prompt engineering. Add to that the fact that hardware evolution will continue to specifically accommodate for optimal run of LLMs, and model compression will continue. All of these factors lead to an intensive research focus on language models as the industry works to scale.

Case in point: This year’s Computer Vision and Pattern Recognition Conference (CVPR), which takes place from 11-15 June in Nashville, Tennessee, in the United States, features a number of oral presentations focused on ways in which language models are evolving and how those developments will contribute to the broader commercialisation of technology.

Enabling a more human experience

For example, in the CVPR awards finalist paper, “Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models”, a team from the Allen Institute for AI and the University of Washington introduce Molmo (Multimodal Open Language Model), a new family of state-of-the-art, open VLMs.

According to the paper, these VLMs leverage a highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions.

As the researchers share: “The success of our approach relies on careful modelling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models but also outperforms larger proprietary models.”

Working toward efficiencies through generalisation

A team from Apple and Georgia Tech also are advancing LLMs for industry via their CVPR research, “From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons.” In this work, the team explores embodied AI, games, UI control, and planning – tasks that spread beyond usual training parameters.

To do so, they transform a Multimodal Large Language Model (MLLM) into a Generalist Embodied Agent (GEA). While the GEA achieves strong generalisation performance to unseen tasks, the team points out that additional work in extending reinforcement learning may support better performance on specific efforts.

Remembering spatial relations

Another CVPR paper explores how MLLMs address visual-spatial intelligence and memory. In “Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces”, a team from New York University, Yale University and Stanford University presents a benchmark that demonstrates how MLLMs exhibit competitive – though subhuman – visual-spatial intelligence, proving that they can be trained to “think in space”.

The team found that utilising linguistic reasoning techniques to train the models was not as successful as cognitive maps; this knowledge provides a key insight for supporting more intelligent AI in commercial applications.

Generating more efficient computational performance

On a similar note, the May issue of IEEE Transactions on Pattern Analysis and Machine Intelligence features the paper, “Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts”. This work unveils a pioneering attempt to develop a unified MLLM with a Mixture of Experts (MoE) architecture as a way to enable more efficient computational performance.

A team of researchers from the Harbin Institute of Technology, Alibaba Group, and Hong Kong University of Science and Technology generated results that yielded in a significant reduction in performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalisation. As the paper notes, the research team hopes this work could, “spark the research of utilising the MoE architecture to scale up MLLMs”.

The near-term for LLMs

LLMs continue to be a focus of technology news, and as research shows, recent advancements are powering new potential. Research initiatives continue to drive down computational and energy requirements and address training models with carefully curated data sets, and the industry can expect continued breakthroughs in these areas this year.

Because, as the IEEE Computer Society Technology Predictions Report pointed out, more efficient LLMs may just be the great equalizer for AI. With these new innovative approaches enabling lightweight and highly specialised LLMs, industry-specific versions for fields like healthcare, finance and law will emerge, improving accuracy and reducing computational overhead.

What’s more, with the efficiency gains this work will enable, custom language models will become available to companies of all sizes, helping to drive a future of accessible and scalable AI.

For more information on the evolution of LLMs and the latest research driving the future of AI, visit the IEEE Computer Society Digital Library (CSDL) at computer.org/csdl.

This article is promoted by the IEEE Computer Society.