Saturday, July 20, 2024

Anthropic's Claude Aces my German Grammar Checker Tests

Yesterday I published the results of my latest testing of German grammar checking systems. Unlike my first round of tests, I included LLMs, in particular ChatGPT, Gemini, and Copilot. I had a nagging feeling that I should include Anthropic's Claude, too, but I didn't find out about Claude until near the end of my testing, and I wanted to be finished, so I decided to worry about Claude later. This was a terrible decision. Later turned out to be only a few hours after I'd published the article, when I unexpectedly found myself with enough free time to play around with the system. Claude proceeded to not only outperform every other system I'd tested, it aced my set of tests with a perfect score

My test set is hardly exhaustive, but no other system has managed to find and correct all 21 errors in the set. Kudos to Claude and Anthropic!

Here are the updated results of my testing after the addition of Claude to the list:


Friday, July 19, 2024

German Grammar Checkers Revisited

Update: I have since added Anthropic's Claude to my tested systems, and it outperformed every system I discuss below. Details here.

 A couple of months ago, I blogged about how I'd (superficially) tested a number of grammar checking tools for German. A comment from jbridge introduced me to the idea of using ChatGPT as a grammar checker, and in the course of exploring that option, I expanded my set of tests to make it a little less superficial. That led to new insights, so it seems like it's time for a German grammar checking tool follow-up.

If you're not familiar with my original post, I suggest you read it.

As before, I'm testing only free tools. I generally test without signing up for or logging into any accounts, but for ChatGPT, I created a free account and logged in so I'd have access to the more powerful ChatGPT 4o rather than the no-account-required ChatGPT 3.5.

Update on the Tools I Looked At Last Time

I remarked last time that Scribbr and QuillBot are sister companies using the same underlying technology, but Scribbr found more errors. That is no longer the case. In my most recent testing, they produce identical results, so we can speak of Scribbr/QuillBot as a single system. Unfortunately, the way this uniformity was achieved was by bringing Scribbr down to QuillBot's level rather than moving QuillBot up to Scribbr's. Even so, Scribbr/QuillBot remains the second-best grammar-checking tool I tested (after LanguageTool). It continues to find errors that LanguageTool misses, so my default policy remains to use both.

My prior test set consisted of six individual sentences. This time around, I added a letter I had written. It's about 870 words long (a little under two pages), and, as I found out to my chagrin, contains a variety of grammatical errors. Checking such a letter is more representative of how grammar checkers are typically employed. Just as we normally spell-check documents instead of single sentences, grammar checkers are usually applied to paragraphs or more.

That had an immediate effect on my view of DeepL Write. I noted in my original review that it's really a text-rewrite tool rather than a grammar checker. On single sentences, you can use it to find grammatical errors, but when I gave it a block of text, it typically got rid of my mistakes by rephrasing things to the point where the word choices I'd made had been eliminated. I no longer consider it reasonable to view DeepL Write as a grammar-checking tool. It's still useful, and I still use it; I just don't use it to look for mistakes in my grammar.

Adding the letter to my set of tests also led me to remove TextGears and Duden Mentor from consideration, because they both have a 500-character input limit. My letter runs more than ten times that, about 5300 characters. Eliminating these systems is little loss, because, as I noted in my original post, GermanCorrector produces the same results as TextGears, but it doesn't have the input limit. As for Duden Mentor, it flags errors, but it doesn't offer ways to fix them. I find this irritating. Using it is like working with someone who, when you ask if they know what time it is, says "Yes."

These changes don't really matter, because when I submitted my augmented test set (i.e., sentences from last time plus the new letter) to the systems from my original post, LanguageTool and Scribbr/QuillBot so far outperformed everybody else, I can't think of a reason not to use them. I'll provide numbers later, but first we need to talk about LLMs.

ChatGPT and other LLMs

In his comment on my original blog post, jbridge pointed out that ChatGPT 4o found and fixed all the errors in my test sentences. That was better than any of the systems I'd tested. Microsoft's LLM, Copilot, produced equally unblemished results. Google's Gemini, however, found and fixed mistakes in only four of the six sentences.

I asked all the systems (both conventional and LLM) to take a look at my letter. ChatGPT found and fixed 13 of 15 errors, the best performance of the group. LanguageTool also found 13 of the 15 mistakes, but ChatGPT fixed all 13, while LanguageTool proposed correct fixes for only 11.  

Given ChatGPT's stellar performance, I was excited to see what Copilot and Gemini could do. Copilot kept up with ChatGPT until it quit--which was when it hit its 2000-character output limit.  This was less than halfway through the letter, so Copilot found and fixed only four of the 15 errors. 

Gemini's output was truncated at about 4600 characters, which is better than Copilot, but still less than the full length of the text. Gemini was able to find and fix eight of the 15 errors. This is notably fewer than ChatGPT, and not just because Gemini quit too early. In the text that Gemini processed, it missed three mistakes that ChatGPT caught. 

I had expected that the hype surrounding ChatGPT was mostly hype and that the performance of ChatGPT, Copilot, and Gemini would be more or less equivalent. If their performance on my letter is any indication, ChatGPT is currently much better than the offerings from Microsoft and Google when it comes to checking German texts for grammatical errors. It also does better than LanguageTool, the best-performing non-LLM system, as well as the combination of LanguageTool and Scribbr/QuillBot together. It's really quite impressive.

However, just because it's better doesn't mean it's preferable. Read on.

Learning from LLMs

My interest in grammar checkers is two-fold. Sure, I want to eliminate errors in my writing (e.g., Email and text messages), thus sparing the people who receive it some of the kinks in my non-native German, but I also want to learn to make fewer mistakes. Knowing what I'm doing wrong and how to fix it will help me get there. At least I hope it will.

Traditional (non-LLM) grammar checkers are good at highlighting what's wrong and their suggestion(s) on how to fix it. This is what LanguageTool looks like on the second of my single-sentence tests. The problematic text is highlighted, and when you click on it, a box pops up with a suggested fix:

The other grammar checkers work in essentially the same way.

LLM systems are different. They can do all kinds of things, so if you want them to check a text for German grammar errors, you have to tell them that. This is the prompt I used before giving them the letter to check:

In the following text, correct the grammatical errors and highlight the changes you make.

For the most part, they did as they were told, but only for the most part. Sometimes they made changes without highlighting them. When that was the case, it was difficult for me to identify what they'd changed. It's hard to learn from a system that changes things without telling you, and it's annoying to learn from one that disregards the instructions you give it. But that's not the real problem.

The real problem is that LLMs can be extraordinarily good at doing something (e.g., correcting German grammar errors) and unimaginably bad at explaining what they've done. Here's ChatGPT's explanation of how it corrected the last of my single-sentence tests. (Don't worry if you don't remember the sentence, because it doesn't matter).

"die gute Flugverbindungen" is incorrect, as "Flugverbindungen" is plural and therefore "die" must be used, but it must also be "gute" instead of "gute".

That's nonsense on its own, but it also bears no relation to the change ChatGPT made to my text. Explanations from Gemini and Copilot were often in about the same league.

Lousy explanations may be the real problem, but they're not the whole problem. My experience with LLMs is that they complement their inability to explain what they're doing with a penchant for irreproducability. The nonsensical ChatGPT explanation above is what I got one time I had it check my test sentences, but I ran the tests more than once. On one occasion, I got this:

"Ort" is masculine, so the relative pronoun should be "der" to match the previous clause and maintain grammatical consistency.

This is absolutely correct. It's also completely different from ChatGPT's earlier explanation of the same change to the same sentence. 

Copilot upped the inconsistency ante by dithering over the correctness of my single-sentence tests. The first time I had it look at them, it found errors in all six sentences. When I repeated the test some time later, it decreed that one sentence was correct as is. A while after that, Copilot was back to seeing mistakes in every sentence.

Using today's LLM systems to improve your German is like working with a skilled tutor with unpredictable mood swings. When they're good, they're very, very good, but when they're bad, they're awful.

I view the instability in LLM behavior as a sign of hope. The systems are evolving, and I'm confident that, over time, they'll get better and better. ChatGPT already finds and fixes more errors in my test set than any other system (see below). Its descriptions of what it's doing and why are sometimes delusional, but I have faith that its developers will get that under control. I have similar faith that other LLM systems will also improve, offering better results and increased consistency. For now, I'm sticking with LanguageTool and Scribbr/Quillbot, but I have no doubt that LLM technology will assume an increasingly important role in language learning. (LanguageTool says it's "AI-based," so for all I know, it already uses an LLM in some way.)

Systems and Scores

Update: As noted above, I have added Anthropic's Claude to my tested systems, and it outperformed everything in the list below. Details here

I submitted the six single-sentence tests from my first post plus the new approximately-two-page letter to the grammar checkers in my first post as well as to ChatGPT, Gemini, and Copilot. As noted above, I disregarded the results from DeepL Write, TextGears, and Duden Mentor. I did the same for Studi-Kompass, which I was again unable to coax any error-reporting out of. I scored the remaining systems as in my original post, awarding up to four points for each error in the test set. A total of 84 points was possible: 24 from the single-sentence tests and 60 from the letter. These are the results: