Uncovering ChatGPT Usage In Academic Papers Through Excess Vocabulary

June 22, 2024

Frequencies of PubMed abstracts containing certain words. Black lines show counterfactual extrapolations from 2021–22 to 2023–24. The first six words are affected by ChatGPT; the last three relate to major events that influenced scientific writing and are shown for comparison. (Credit: Kobak et al., 2024)

That students these days love to use ChatGPT for assistance with reports and other writing tasks is hardly a secret, but in academics it’s becoming ever more prevalent as well. This raises the question of whether ChatGPT-assisted academic writings can be distinguished somehow. According to [Dmitry Kobak] and colleagues this is the case, with a strong sign of ChatGPT use being the presence of a lot of flowery excess vocabulary in the text. As detailed in their prepublication paper, the frequency of certain style words is a remarkable change in the used vocabulary of the published works examined.

For their study they looked at over 14 million biomedical abstracts from 2010 to 2024 obtained via PubMed. These abstracts were then analyzed for word usage and frequency, which shows both natural increases in word frequency (e.g. from the SARS-CoV-2 pandemic and Ebola outbreak), as well as massive spikes in excess vocabulary that coincide with the public availability of ChatGPT and similar LLM-based tools.

In total 774 unique excess words were annotated. Here ‘excess’ means ‘outside of the norm’, following the pattern of ‘excess mortality’ where mortality during one period noticeably deviates from patterns established during previous periods. In this regard the bump in words like respiratory are logical, but the surge in style words like intricate and notably would seem to be due to LLMs having a penchant for such flowery, overly dramatized language.

The researchers have made the analysis code available for those interested in giving it a try on another corpus. The main author also addressed the question of whether ChatGPT might be influencing people to write more like an LLM. At this point it’s still an open question of whether people would be more inclined to use ChatGPT-like vocabulary or actively seek to avoid sounding like an LLM.

23 thoughts on “Uncovering ChatGPT Usage In Academic Papers Through Excess Vocabulary”

Dan says:

June 22, 2024 at 3:10 pm

Interesting! I wonder if this can even distinguish different LLMs?

Report comment

Reply
Steven-X says:

June 22, 2024 at 3:18 pm

So far I’ve seen using Chat GPT to obtain “outlines” to ensure I dont miss anything.

Report comment

Reply
1. TG says:
  
  June 22, 2024 at 9:24 pm
  
  Yeah, it’s pretty good for reading mountains of text very fast and giving you cliff notes. Not so great for writing things
  
  Report comment
  
  Reply
  1. Dude says:
    
    June 23, 2024 at 10:58 am
    
    I find it often misinterprets or makes up stuff that wasn’t in the text, or leaves important parts out, and the summaries are sometimes vague to the point of uselessness.
    
    Report comment
    
    Reply
    1. adobeflashhater again says:
      
      June 23, 2024 at 1:08 pm
      
      Sounds a lot like the majority of the news readers on any network these days.
      I’ve watched one local channels staffers stumble over the same script error for two days.
      
      Report comment
      
      Reply
      1. Dude says:
        
        June 23, 2024 at 2:18 pm
        
        A great deal of news articles around the web were already written by AI scripts prior to Chat-GPT becoming a thing.
        
        It’s mostly about swiping articles form reputable news sites and then re-arranging them to appear like original text.
        
        Report comment
HaHa says:

June 22, 2024 at 3:23 pm

Word tells aren’t new.

Anybody who uses ‘hullabaloo’ regularly learned English by listening to Pravda’s English language propaganda.

Also anybody who types or says ‘late stage capitalism’ is an indoctrinated credulous moron repeating derp!

Report comment

Reply
1. BT says:
  
  June 22, 2024 at 3:42 pm
  
  Late stage capitalism
  
  Report comment
  
  Reply
2. . says:
  
  June 22, 2024 at 3:53 pm
  
  Nyet
  
  Report comment
  
  Reply
3. MmmDee says:
  
  June 22, 2024 at 6:46 pm
  
  Or they went to Texas A&M University where hullabaloo is literally the start of their war hymn:
  
  Hullabaloo, Caneck! Caneck!
  Hullabaloo, Caneck! Caneck!
  
  Report comment
  
  Reply
4. Sok Puppette says:
  
  June 22, 2024 at 6:46 pm
  
  In these final, frenzied days of late stage capitalism, I’m happy to be reminded of the useful word “hullabaloo”, which I shall try to remember to use use to characterize the derp I hear so very often.
  
  Report comment
  
  Reply
Hirudinea says:

June 22, 2024 at 3:54 pm

Well with the analysis code available people will just know how to cheat better, but consider they’re using ChatGPT to cheat in the first case, most are probably to lazy to go to tall that work.

Report comment

Reply
Ostracus says:

June 22, 2024 at 5:56 pm

“According to [Dmitry Kobak] and colleagues this is the case, with a strong sign of ChatGPT use being the presence of a lot of flowery excess vocabulary in the text.”

Like a love letter to the grant giver.

Report comment

Reply
TG says:

June 22, 2024 at 9:25 pm

The real paperclip maximizer was in academia the whole time!

Report comment

Reply
1. Me says:
  
  June 23, 2024 at 1:38 am
  
  Im curious why you’d associate existential risk to a glorified Redditor. Of course excessive flowery language is present. Look at the foundation
  
  Report comment
  
  Reply
2. bwmetz says:
  
  June 23, 2024 at 8:06 am
  
  Their research results show a lot of potential but it’s crucial that they delve into less technical areas of academia, e.g. fine and literary arts, before we accept their methodology. Seeing similar patterns in these, or any area which encourages the use of flowery language, will be crucial to proving whether their method is universally applicable or must be tuned to each one. Given word usage changes over time and the outsized influence of the culture, school curriculum and like the persons under which one studies, shadows or is trained has on one’s own language, I think they have a significant challenge to overcome with a broader dataset.
  
  No, this was not written using any form of AI but rather NI or PI, i.e. natural or personal. Speaking…in the AI world we find ourselves in of late, does this mean that “common sense” will soon be known as CI?
  
  Report comment
  
  Reply
Perry says:

June 23, 2024 at 8:16 am

What happens to outliers who just happen to have a large vocabulary? Do they start getting accused of using AI even when they wrote the thing themselves?

Report comment

Reply
1. bwmetz says:
  
  June 23, 2024 at 8:35 am
  
  That was the thought that originally sparked my comments above. Not sure why they appear as a reply to “Me” when I didn’t do that.
  
  Report comment
  
  Reply
2. spaceminions says:
  
  June 25, 2024 at 6:42 am
  
  Most likely. Anyone who opened many books as a kid has been accused of sounding like a dictionary/thesaurus; accusing a studious academic type of sounding like a robot seems like a no-brainer. Which is fortunate for the accusers, as that leaves them a surplus of one-half.
  
  Report comment
  
  Reply
r00sty says:

June 23, 2024 at 8:40 am

As an individual who has always had an excessive vocabulary and has always enjoyed using it, I feel targeted.

Report comment

Reply
Greg A says:

June 23, 2024 at 8:40 am

the thing is, we’re not so different from chatgpt. we will radically restructure our language usage in the presence of a new pattern of speaking. people have made a big deal about the fact that all future AI systems will be trained on data in part generated by AI. but natural intelligence also will be trained on data in part generated by AI.

at the margins, some people are surely avoiding chatgpt directly but nonetheless changing their vocabulary because of it.

Report comment

Reply
Thinkerer says:

June 24, 2024 at 6:20 pm

Great, just great. I’m sure there will be punishment for those of us who indulge in ebullient persiflage.

Report comment

Reply
Emery Batchellor says:

August 23, 2024 at 4:43 pm

Hi, just a question: What’s the best way to optimize for voice search?

Report comment

Reply