Tuesday, May 27, 2025

The Dismal Failure of LLMs as EV Search Aids

For six years, I've been in the market for an electric compact SUV. I haven't yet found one with the features I want at a price I'm willing to pay. (As a rule, missing features have been a bigger impediment than price.) My last review took place about a year ago, so I decided it was time to look again.

This time I experimented with seven LLM-based chatbots as search assistants. I gave each the following prompt:

List all the fully electric compact SUVs for sale in the United States that have all-wheel drive, an openable moonroof or sunroof, an all-around (i.e., 360-degree) camera, an EPA range of at least 250 miles, and are no more than 180 inches in length.

My assistants were the unpaid versions of these systems (listed in the order in which I happened to test them):

The results were eye-opening. None of the systems listed the only vehicle that fulfills the criteria (the Volvo EX40), and all but one listed vehicles that violate the requirements. Worse performance is hard to imagine. The false positives waste your time pursuing dead ends, while the false negatives imply that no qualified EVs exist, even though one does.

Complete failure was averted by one system (You.com) mentioning, almost as an afterthought, the Volvo XC40 Recharge. That car was renamed the EX40 last year, but searching for the old name will quickly lead you to the new name, and that will finally put you on the trail of the only car that satisfies my criteria.

[Update 5/28/25: Per the comments on this post, paid versions of at least ChatGPT and Claude produce much better results than the ones I experienced. I've added links to the conversations I had with the various unpaid chatbots in discussions below.] 

The chatbots failed in a variety of ways (the links are to my conversations with the chatbots):

  • ChatGPT said "here are the models that meet all requirements," then listed five EVs and their specs. For four of the five, the displayed specs were contrary to the requirements, meaning ChatGPT "knew" (to the extent that LLMs "know" things) that these cars shouldn't have been listed. For the fifth car, one of the specs it listed was simply incorrect.There was no mention of the Volvo EX40.
  • Perplexity's behavior was similar to ChatGPT's: it claimed to list cars fulfilling the criteria, then "knowingly" listed ones that don't. The twist was that two of the three EVs Perplexity listed--the Rivian R2 and the Toyota C-HR EV--don't exist yet. (If they did, the Rivian would exceed the length constraint, and it looks like the Toyota would likely fail the openable-roof test.) Perplexity made no mention of the Volvo EX40.
  • Claude listed only one car, Tesla's Model Y, saying it "clearly meets all your requirements. It offers ... a panoramic glass roof ... and measures approximately 187 inches in length. However, this exceeds your 180-inch length requirement." Its dithering on the car's length was disappointing, and its failure to distinguish a panoramic glass roof from one that opens was worse. There was no mention of the Volvo EX40.
  • Gemini's response began with "Here's a breakdown of current and upcoming electric compact SUVs and how they stack up against your requirements," which was not what I had asked for. It listed five cars and their specs, noting for each vehicle the specs that violated my criteria. It ultimately concluded, "there may not be any currently available fully electric compact SUVs that precisely meet all conditions." Except there is, of course.
  • Copilot produced a refreshingly short response featuring a very nicely formatted table that summarized the two EVs it said satisfied my requirements. Neither does, though Copilot showed no specs that reflect that, so it may not have "known" it was wrong. There was no mention of the Volvo EX40.
  • You.com started with this rather confusing statement: "Based on the search results provided, there is no direct information listing fully electric compact SUVs in the United States that meet [your] criteria." It then launched into an explanation of how to perform my own search (<eyeroll/>). Then came a surprise. It introduced the Volvo XC40 Recharge and showed how it satisfied my requirements, though it seemed unsure of itself: "The Volvo XC40 Recharge appears to meet all your criteria. However, I recommend verifying [everything]." Ultimately, You.com found the rabbit in the hat and pulled it out, but its response was confusing and disjointed, and it referred to the rabbit by an obsolete name. 
  • Mistral followed Copilot's lead in producing a short, clear response built around a well-formatted table of information that was often incorrect or inconsistent with my requirements.There was no mention of the Volvo EX40.

As a group, the systems produced responses rife with claims that were incorrect, inconsistent, and/or incomplete. The last of these is the most disturbing. Six of the seven systems didn't mention the only SUV fulfilling the requirements. The one that did hid it after an explanation of how to do your own search, and even then it referred to that car by an outdated name.

There is a lot of work to be done before LLM chatbots are reliable search assistants.

Monday, May 26, 2025

Three Experiences with Video and AI

Finding an Old TV Episode

I recently found myself wondering about a TV episode I saw decades ago. I had only the haziest memory of it, so I threw this at Gemini:

I'm looking for an episode from the original TV show The Outer Limits or The Twilight Zone. The story is about a man with a robotic hand that he has to add fingers to in order to increase its ability to help him figure out what is happening. Do you know this episode? 

Gemini did, correctly identifying it as "Demon with a Glass Hand" from the 1960s series, The Outer Limits. Googling for that yielded a link to the episode at The Internet Archive, which I downloaded and added to my Plex server. 

Less than 15 minutes elapsed between the time I thought about the episode and the time I had it in my video library. It's not the best television content in the world, but I marvel at how easily I was able to track down and watch a show from 60 years ago based on only a very sketchy memory.

Upscaling the Episode

"Demon with a Glass Hand" isn't terribly compelling, but that doesn't mean it shouldn't look good. Unfortunately, 1963 TV was SD, and these days we're used to a lot better resolution than the 496 x 368 I got from the Internet Archive. 

Earlier this year, I purchased a copy of VideoProc Converter AI to experiment with upscaling low-resolution 8mm family videos I'd had digitized. The results were impressive on everything except faces, which the upscaling process tended to turn into grotesque caricatures of the people behind them. But hope springs eternal, so I decided to see what VideoProc could do with "Demon with a Glass Hand." 

Invoking the program yielded a message excitedly telling me that a new version was, you know, faster and better, and I should upgrade immediately. It was free, so I did, but I didn't expect that V3 would be noticeably better than V2. When was the last time a program upgrade lived up to its PR?

In this case, I think it does. Check it out:

Upscaling is an interesting challenge, because it involves fabricating information (pixels) not present in the original images. Simple interpolation doesn't do a very good job, and VideoProc's V2 AI-based approach fell apart on faces. V3's faces aren't perfect, but I think they're good enough for casual viewing, and that's an impressive accomplishment.

Looking Forward

A few days ago, Andrei Alexandrescu brought my attention to this reddit post featuring a synthesized video by Ari Kuschnir using Google's Veo. The clip takes advantage of Veo's new ability to generate audio tracks, including dialogue and singing. I find the clip pretty amazing. There are legitimate questions about how Veo was trained and how its output could be used for ill, but I prefer to focus on the technical progress it represents and the creative promise it offers. 

Incongruously, I was reminded of the Veo demo after viewing another old TV episode I barely remembered, one Gemini identified from this prompt:

I'm now thinking of a different episode, again from The Twilight Zone or The Outer Limits. It involves a man who goes to a store to custom-order a woman. He chooses eye color, etc. Any idea which episode this is?

Again Gemini knew what I was looking for ("I Sing the Body Electric" from the original The Twilight Zone), Google found a downloadable link to it, and I had it on my Plex only a few minutes after issuing the query. 

The episode is quite terrible (much worse than "Demon with a Glass Hand"), but I liked the hopeful ending. Not the part summarizing grandma's data collection and sharing policy ("Everything you ever said or did, everything you ever laughed or cried about, I'll share with the other machines"), but the optimistic sentiment behind Rod Serling's closing voiceover:

Who's to say at some distant moment, there might be an assembly line producing a gentle product in the form of a grandmother, whose stock in trade is love?

For countries with an aging population requiring increasingly attentive personal care, I'd expect that gentle, loving robots rolling off an assembly line could be a pretty attractive prospect.