The View from Aristeia: The Dismal Failure of LLMs as EV Search Aids

Tuesday, May 27, 2025

The Dismal Failure of LLMs as EV Search Aids

For six years, I've been in the market for an electric compact SUV. I haven't yet found one with the features I want at a price I'm willing to pay. (As a rule, missing features have been a bigger impediment than price.) My last review took place about a year ago, so I decided it was time to look again.

This time I experimented with seven LLM-based chatbots as search assistants. I gave each the following prompt:

List all the fully electric compact SUVs for sale in the United States that have all-wheel drive, an openable moonroof or sunroof, an all-around (i.e., 360-degree) camera, an EPA range of at least 250 miles, and are no more than 180 inches in length.

My assistants were the unpaid versions of these systems (listed in the order in which I happened to test them):

The results were eye-opening. None of the systems listed the only vehicle that fulfills the criteria (the Volvo EX40), and all but one listed vehicles that violate the requirements. Worse performance is hard to imagine. The false positives waste your time pursuing dead ends, while the false negatives imply that no qualified EVs exist, even though one does.

Complete failure was averted by one system (You.com) mentioning, almost as an afterthought, the Volvo XC40 Recharge. That car was renamed the EX40 last year, but searching for the old name will quickly lead you to the new name, and that will finally put you on the trail of the only car that satisfies my criteria.

[Update 5/28/25: Per the comments on this post, paid versions of at least ChatGPT and Claude produce much better results than the ones I experienced. I've added links to the conversations I had with the various unpaid chatbots in discussions below.]

The chatbots failed in a variety of ways (the links are to my conversations with the chatbots):

ChatGPT said "here are the models that meet all requirements," then listed five EVs and their specs. For four of the five, the displayed specs were contrary to the requirements, meaning ChatGPT "knew" (to the extent that LLMs "know" things) that these cars shouldn't have been listed. For the fifth car, one of the specs it listed was simply incorrect.There was no mention of the Volvo EX40.
Perplexity's behavior was similar to ChatGPT's: it claimed to list cars fulfilling the criteria, then "knowingly" listed ones that don't. The twist was that two of the three EVs Perplexity listed--the Rivian R2 and the Toyota C-HR EV--don't exist yet. (If they did, the Rivian would exceed the length constraint, and it looks like the Toyota would likely fail the openable-roof test.) Perplexity made no mention of the Volvo EX40.
Claude listed only one car, Tesla's Model Y, saying it "clearly meets all your requirements. It offers ... a panoramic glass roof ... and measures approximately 187 inches in length. However, this exceeds your 180-inch length requirement." Its dithering on the car's length was disappointing, and its failure to distinguish a panoramic glass roof from one that opens was worse. There was no mention of the Volvo EX40.
Gemini's response began with "Here's a breakdown of current and upcoming electric compact SUVs and how they stack up against your requirements," which was not what I had asked for. It listed five cars and their specs, noting for each vehicle the specs that violated my criteria. It ultimately concluded, "there may not be any currently available fully electric compact SUVs that precisely meet all conditions." Except there is, of course.
Copilot produced a refreshingly short response featuring a very nicely formatted table that summarized the two EVs it said satisfied my requirements. Neither does, though Copilot showed no specs that reflect that, so it may not have "known" it was wrong. There was no mention of the Volvo EX40.
You.com started with this rather confusing statement: "Based on the search results provided, there is no direct information listing fully electric compact SUVs in the United States that meet [your] criteria." It then launched into an explanation of how to perform my own search (<eyeroll/>). Then came a surprise. It introduced the Volvo XC40 Recharge and showed how it satisfied my requirements, though it seemed unsure of itself: "The Volvo XC40 Recharge appears to meet all your criteria. However, I recommend verifying [everything]." Ultimately, You.com found the rabbit in the hat and pulled it out, but its response was confusing and disjointed, and it referred to the rabbit by an obsolete name.
Mistral followed Copilot's lead in producing a short, clear response built around a well-formatted table of information that was often incorrect or inconsistent with my requirements.There was no mention of the Volvo EX40.

As a group, the systems produced responses rife with claims that were incorrect, inconsistent, and/or incomplete. The last of these is the most disturbing. Six of the seven systems didn't mention the only SUV fulfilling the requirements. The one that did hid it after an explanation of how to do your own search, and even then it referred to that car by an outdated name.

There is a lot of work to be done before LLM chatbots are reliable search assistants.

7 comments:

Anonymous said...: https://chatgpt.com/share/6836cec8-4984-800c-9002-020f932f7b1e; May 28, 2025 at 1:52 AM
Anonymous said...: The link above is the response of the ChatGPT o3 model; May 28, 2025 at 3:46 AM
Scott Meyers said...: That's very interesting. I don't know which ChatGPT model I was using, but my chat looked quite different: https://chatgpt.com/share/6837232f-08b4-8009-8483-9feaab2b02d0; May 28, 2025 at 7:56 AM
Anonymous said...: Most probably it's 4o, I think that o3 is only available in the paid plan.
I have also checked that Claude Opus 4 (also in their paid plan) gives a correct answer.
These are the so-called thinking or reasoning models. They are slower, more token hungry and expensive to run, but they produce better results.; May 28, 2025 at 11:07 AM
Scott Meyers said...: That's really useful information, thank you. It looks like LLM-based searches aren't as dismal as I thought, provided you are willing to pay for them.; May 28, 2025 at 12:14 PM
Anonymous said...: https://chat.deepseek.com/a/chat/s/8d3c6d71-46bc-45f8-bb30-9ead6faaa1fb
this is the response of the deepseek-V3 (non-thinking).
Still no EX40, but some 'Volvo EX30' appeared; June 25, 2025 at 6:41 AM
Scott Meyers said...: Thanks for checking deepseek. I wasn't able to view your chat transcript without logging in, and deepseek won't allow me to register using an email address at my domain. Sigh.

The Volvo EX30 lacks an openable moonroof or sunroof, so it doesn't satisfy my criteria. This suggests that the deepseek version you tested doesn't do any better than the LLMs I checked. More evidence that there is a lot of work to be done before LLMs are reliable search agents.; June 25, 2025 at 1:26 PM

The View from Aristeia

Tuesday, May 27, 2025

The Dismal Failure of LLMs as EV Search Aids

7 comments:

About Me

Blog Archive