2023-11-22
This is a fun one.
I’ve spent more time than I’d care to admit staring at LLM output. And there’s something that I’ve noticed: LLM-generated prose has a kind of… vibe. It’s difficult to describe, but in this initial era of LLMs, it tends to be fairly obvious when you’re reading an AI-generated piece of prose.
One giveaway I've noticed is this particular turn of phrase:
“Culture is a complex and multifaceted ...”
“Intelligence is complex and multifaceted ...”
“Technology is a complex and multifaceted ...”
In the true Dawkinsian sense, the phrase 'complex and multifaceted' has become a meme. I've seen it again and again in outputs from GPT, but to double-check, I did a bunch of GPT-3.5 generations (code here). Here's what I found when generating completions for a prompt of 'complex and ...'
:
There is a bizarre prevalence of the term 'multifaceted' specifically. Why?
I wanted to understand whether this phrase and the specific word 'multifaceted' was newly popular or had existed for a while. As a first port of call, I had a look at Google Trends. And I observed a very shocking increase within the last year:
At this point I wanted to get an indication of whether this was an online-only trend. It's hard to establish this but I thought I'd try Google Books' N-gram viewer. Maybe it would show me. And, as suspected, we see no notable inflection, although one can see there's a gentle increase over time.
Tangent: For what it's worth, I find it a bit of a weird phrase. It's a tautology, as 'complex' and 'multifaceted' are almost synonomous. It reminds me of legal doublets like 'null and void' and 'cease and desist'. It's a rather nice and affirmatory way of saying something. I guess it sounds clever and informed, which is, after all, the vibe LLMs are going for.
Anyway, I wanted to go a bit further in order to ensure this was actually a newly prevalent phrase online. Google Trends isn't very convincing by itself. So I went digging for other places where linguistic trends over time might be queryable. I discovered that web archive helpfully retains various PDFs over the years, ranging from whitepapers to general reference material from accross the web. It allows you to search for specific keywords as well.
I carried out a bunch of searches from 2006 to 2022 As well as the word 'multifaceted'. Oh and I was also interested in another viral word I'd spotted: 'intricate'. To ensure some level of scientific prudence, I compared these words with other terms as experimental controls.
As we see, from 2021 onwards, just around the time when GPT and other LLMs started to take the world by storm, the prevalence of our word 'multifaceted' increased significantly, from being in only 0.05% of PDFs to 0.23%.
Now, to zoom out a bit. I discovered the entire phrase, 'a complex and multifaceted', exists in around 800,000 places online.
If narrowed down, we see it composed of some particular domains ahead of others:
Quora.com: 48,000
LinkedIn.com: 30,700
Facebook.com: 9,500
Instagram.com: 7,330
Medium.com: 6,250
Reddit.com: 1,370
CourseHero.com: 7,340
jstor.org: 1,320
wikipedia.org: 400
twitter.com: 798
classace.io: 842 (*notably an essay bank*)
chegg.com: 930 (*notably an essay bank*)
Quora has 5.7% of all occurances online! If it isn't the birthplace of this meme, it is definitely its breeding ground.
N.B. FWIW we can see what proportion Quora ~should be taking up, all things being equal. An arbitrary word like "systemic" appears 445 million online, yet only 272,000 times on Quora. That's 0.06% of all occurrances. So Quora's 5.7% share of our meme-phrase is completely disproportionate. Are we even surprised? Quora does have a reputation for its spam-bots. They are, at this point, mere regurgitation machines:
I also couldn't ignore the fact that Quora has lately been embedding a ChatGPT widget on almost every page, and this widget's content is pre-generated, static and available for crawling. It is thus liable to being used as additional training material for this and other LLMs.
ChatGPT specifically seems to absolutely adore the phrase, using it at every opportunity to explain higher level concepts. The most prevalent pattern seems to be '[noun] is a complex and multifaceted [concept|theory|process]'. Some common ones and their relative quantities across Quora:
4590
4420
3550
2230
1650
1560
(these values vary across locales)
If we pick one of these and do a general search across the web, once again we observe incredibly sharp increases across time. The phrase 'a complex and multifaceted phenomenon' has 74,900 occurances across the web according to Google. However, only 73 prior to 2010. That's a 1000x increase in only 13 years.
You get the idea. ChatGPT has taken this meme and and rolled with it. This silly LLM has assumed the phrase a core part of our language when it was only ever a narrowly used and awkward turn of phrase.
What's the conclusion to this absurd rabbit hole? Have we learned anything?
We know that initial versions of GPT were trained quite significantly on Reddit, and it's probably also the case that a small selection of other websites have been used since then to build and bolster additional models.
Focusing the training on any particular website will lead to strong biases. For example, fixating too much on academic material or websites like Quora where bots formulaically re-use certain phrases (this occurred even in the era before LLMs).
Furthermore, since these models have taken off in popularity, and people have then been publishing their outputs back onto the internet. As this occurs, it's likely produced a feedback loop. LLMs are unknowingly training on their own regurgitated outputs. It's unavoidable.
So, by those very tiny initial training decisions, just a handful of engineers have begun a unstoppable chain of incestuous linguistic evolution. It is fascinating how powerful these models are becoming in shifting the nature of language itself.
Thank you for reading! I hope it you found it interesting. If you want, you can read more of my posts here or find out more about me here.