Why ChatGPT is bad for open psycholinguistics

Be skeptical of this "revolution"

Jun 13, 2023

The rise of ChatGPT has led to a proliferation of people and organizations scrambling to use it in their products. OpenAI, a company widely known for their lack of transparency and famous disengagement from the scientific literature, released this system to the public in November 2022. Following a series of massive generative models built on prompting-as-retrieval from a large database, ChatGPT extends this with a public-facing API and an everyday normal web page that lets you query it with human-interpretable prompts.

The potential relevance of ChatGPT for psycholinguistics, as an instance of a prompting model trained on very large datasets of web text and transcriptions, is its ability to generate templates. I know folks who have used it for making `nginx` templates for hosting websites, as well as `yaml` specs for building machine learning pipelines. Naturally, folks have started to ask, “Can I get ChatGPT to make stimuli for me?” Since we as cognitive scientists build stimuli that are meant to be relatively natural but still grammatically constrained, this might serve as an effective prompt for creating sentences. No more need to train pesky RAs who need to first learn to recognize tricky syntactic structures! No need to query norm datasets containing word counts or concreteness ratings!

There are a few things that are noteworthy about ChatGPT as a commercial product that, in my opinion, limit its effectiveness as a stimulus generation tool:

Replicability
Transparency
Non-stationary model parameters and ruleset
Computational cost
Labor and prompting as a skill that must be learned
Longevity
Data privacy and corporate profit

In all cases, alternatives exist to a ChatGPT-centered stimulus generation paradigm so that researchers can still take advantage of the latest technologies and make their workflows more efficient (if efficiency is important to you, of course).

Let us consider these points in turn:

Replicability

Within the field of psychology, replication has come to play an increasingly large role in establishing whether effects on cognition, behavior, or other mental states are justified by the data we collect. In computational linguistics, natural language processing, and machine learning more broadly, these challenges are largely left to the wayside, though some folks have covered the need for power analyses and greater use of inferential statistics to justify calling something the new “state-of-the-art.”

With ChatGPT, the replicability question is altogether different — in that it is not always possible to get the service to give you the same answer. Randomness is built into the architecture for generative models to prevent the model from always producing the highest probability linguistic sequence. If this system works similarly to other models, an initial state is selected and beam search (like a spotlight or a flashlight in a dark, black-box woods) is conducted over potential possible “chains” of responses. Typically, a model then selects the continuation or response that has the highest probability. With ChatGPT, however, the selection of items that are considered within the beam is probably *******sampled******* from the distribution at each point in time. One of the challenges posed to researchers aiming to generate many stimuli for an experiment (or follow-up experiments) that obey a few different constraints, is that the prompt they worked hard to create may not always lead to the same types of stimuli, even when nothing has changed about the prompt.

Transparency

With this in mind, you have no direct control over the generations produced by ChatGPT. This becomes even clearer when we might want to generate outputs with even more constraints. Unlike GPT-whatever’s predecessors, you cannot inspect the working components. That means no embeddings, no logits, no indices or ghosts of the representations that are being calculated. While we assume that ChatGPT behaves like its large language model predecessors and contemporaries (e.g., GPT-3, T5, etc.), we cannot know what computations it is doing at any one point in time, and are never able to retrace its steps. This is especially fraught for folks who use measures like surprisal to understand the predictability of a word in context. With smaller (but nevertheless massive compared to old standards in computational psycholinguistics) models like GPT-2, it was possible to generate many possible continuations at each time step and arrive at a proxy for, say, cloze probabilities. These days are long gone in a prompt-centered world.

OpenAI has also made it very clear that the human is doing a lot for their human-in-the-loop reinforcement learning procedure. The model is “trained” by the inputs that we give it, both in the short term (while we are interacting with it) as well as the long term (the systems OpenAI releases). There are also a lot of hidden things under the hood, similar to how image generation models that relied on prompting (e.g., DALL-E) embedded secret rules to make it less racist. While this is a good thing for the end user, this front-end filtering it not disclosed to us, and is considered proprietary information. Any number of layers and layers of constraints may be being applied to the output that have nothing to do with the generative capacity of this model, and instead are meant to keep the model from producing racist, sexist, ableist, and other -ist language. Given OpenAI’s famously secretive nature, it is unlikely that we will ever learn what the full scope of these systems are to prevent misbehavior, and it is unclear how often these constraints change.

Non-stationary parameters

Even if the model were able to be inspected, ChatGPT is conceived of as a reinforcement learning model whose parameters are updated approximately weekly. When new versions are released, they purportedly are responses to additional “debugging” that is the result of millions and billions of prompts taking place every day, in addition to any ingestion of new data and raw text that may have been added to its training data. The model does not really have a “test” dataset to be evaluated on, as it is constantly being tested by the users who interact with it. The updating of the parameters exacerbates concerns about replicability because the model’s internal representations may drift over time. While this is obviously nice from a business point-of-view since data comes in that may be important to emphasize in the representations, such as geopolitical events, this is a mess for researchers who want to probe the internal knowledge state of these systems. ChatGPT simply is not a single monolithic thing, but rather an eternally changing slime mold trying to cross a maze to get food in the shortest path possible.

Another side effect of these large, eternally changing models is that comparing and contrasting the model at two time points — or more — is not feasible for anyone without an outrageous amount of compute. If your goal is to make these massive models interpretable, you might be out of luck simply because you might not have enough space in memory to store a single model, let alone compare two or more models from two different points in time. Snapshotting the model for the average end user would be extremely expensive.

A final point here is these models are being used in every domain of the web now to power services that are automatically generating fake content. Websites are now generated commonly by artificial intelligence systems from queries of a knowledge base. This leads to “contamination” of the training data, such that any future ChatGPT trained on web text now either has to be able to determine whether the data are human-generated, or model-generated — or not distinguish at all. From the point of view of psycholinguists who care about accurately estimating the statistical or linguistic structure that our participants who come into the lab should know, we are at risk of generating lower-quality content as these models age and become more influenced by the types of queries that they produce and which will come to populate all areas of the web.

Computational and environmental cost

Which of course brings us to the question of what your work is doing for OpenAI. OpenAI is funded by a number of investors, most prominently Microsoft, whose computers are an array of precious metals held in heavily refrigerated storage centers to keep from overheating. This is typically in the form of chilled water. This water can either be refrigerated, which requires electricity (on top of the electricity to run the computations) or it comes from snowmelt as it comes down from the North Cascades and various large rivers in the Pacific Northwest and elsewhere. Each computation eats up some of this cold water, but also other water, which fuels hydroelectric dams. Water drops from great heights down artificial precipices, to leverage gravity to create “clean” energy. These dams were built on some of the largest rivers in the United States and used to host massive salmon runs that are now more or less salmon ladder-jumps, for those that don’t die trying to swim upstream to spawn. Anyway, that water is one of the more renewable sources of electricity in the United States, and OpenAI would have you believe that the electricity is being put to good use. In reality, querying ChatGPT consumes massive amounts of electricity. If you’ve ever felt sick looking at how much Bitcoin uses, OpenAI’s tab is pretty nauseating too.

As a field, NLP had already had a reckoning with the energy costs of creating models like GPT-2. It has been a constant stream of ever-larger, ever-more-expensive language models with parameter sizes ballooning to trillions without a single blink of an eye. It’s a lot to ask of researchers to not use the bleeding edge black box models, but if you’re the type of person who turns off lights when you leave the room, or turns off the tap when you’re just lathering soap while washing your hands, each ChatGPT query you make is electricity that could be used to power someone’s house or public (and private) transit.

Labor

A concrete concern for researchers hoping to use prompting models for psycholinguistics is in the labor-training and futzing that comes with figuring out exactly how to ask a model what you want; the title Prompt Engineer exists now for a reason. ChatGPT can handle reasonably well-structured prompts as inputs to produce highly structured outputs. Learning prompt engineering and fine-tuning prompts for querying large language models like ChatGPT can take many hours of training.

Consider a scenario where a postdoc or faculty member takes 7-8 hours out of their day to prompt ChatGPT to create stimuli for a study. Between fine-tuning the prompt, editing the stimuli to conform to specifications, trying a wide variety of things to get the model to create more stimuli to mixed success, and waiting on the OpenAI servers to run their models to generate completions for you, you could have been giving research assistants more hands-on training up-front. Alternately, researchers could spend an equivalent amount of time devising constraints on generations from smaller, more efficient models that can be tested locally. Labor savings may certainly be found in that a concentrated task is likely to be more successful — RAs probably need sleep consolidation while ChatGPT does not, and people are highly likely to make the kinds of mistakes that can lead to items being eliminated from analyses entirely due to errors we only detect after data have been collected.

But, when we complain about those pesky training research assistants who can’t make stimuli like the postdocs and faculty members in the lab, and instead relegate stimulus creation tasks to ChatGPT, are we really doing the right thing for our research assistants, who we presumably want to learn about language and psycholinguistics? Do we not want (some of) them to know the parameters we use to generate stimuli, and to have some control over those factors themselves? If we did eventually want to hand over the reins to our students so they can do prompting themselves, we still would need young researchers to know and recognize parameters that are critical for generating sentence stimuli. Automation risks making researchers less effective in the long-run.

Technical debt and model longevity

Prompting as a subfield within natural language processing is only a few years old. Neural language models themselves only rose to mainstream prominence in 2013, with many of the old frameworks (e.g., word2vec, bi-LSTMs, masked language models, etc.) being tossed aside for this new paradigm. But, despite the open modeling community’s best efforts, sharing models publicly does not guarantee that the models will continue to function or even be available to end users after short periods of time. When hardware and software requirements for machine learning models are constantly changing, the chances that a model becomes unusable only a few months after its release is incredibly likely. Simply try to download a github repository from November’s EMNLP meeting, or NAACL’s meeting last summer. Maintaining a code base for a single model is a nightmare even at smaller scales and ChatGPT is already on the way out. The endless pursuit of “technological advances” leaves us with more broken toys than ones that actually work; for those who wish to automate stimuli, a stable paradigm is needed.

Data privacy and corporate profit

The final point I wish to make here is that OpenAI is not a nonprofit whose goals are to make your lives better. OpenAI is a business whose business model is getting people to run their models at prices that have been heavily subsidized by venture capitalists. The VC funding racket, as seen by the recent collapse of Silicon Valley Bank, is largely speculative and not grounded in evidence that systems work. Beyond this, OpenAI is subsidized by Microsoft, who immediately integrated ChatGPT into Bing search. These corporations are invested in folks’ dependence on models like ChatGPT and are, in the long run, aiming to crowd out smaller systems that may perform less well on certain tasks. Certainly the approach here is unscientific — the work is not intended for consumption by scientists, is not a suitable object of scientific inquiry, and more importantly the use of this “free” model makes all of us poorer in the knowledge we could have gained about language, statistics, and networks of language communities.

OpenAI stands to make a profit from you even if you are not giving them money because your use of the service changes how the service works. Every query you make could be used for an update to the system, which could be productionized into the new Bing search process. Bing will make money off ads that better target users querying the search engine, which makes OpenAI and Microsoft richer. It’s a trope but we are obviously the product being sold in this scenario. This is at odds with the kind of work scientists do — funded by public dollars, for public sharing of scientific knowledge. If we have to query a private system for private profit, we are undermining the principle value of our work as scientists to society.

In summary: I’m tired. Try to use something else.

There are a lot of models that have been released over the years, with many of them being feasibly downloadable to one’s computer. The template-style approach employed in psycholinguistic stimulus generation (e.g., “The doctor gave the stethoscope to the nurse” or “The doctor the nurse disliked quit earlier last week”) is well-suited to older models, such as masked language models like BERT and RoBERTa. These models allow you to drop in a “mask” over any word or sequence of words (or morphemes, or characters, depending on how your language of study works). By applying a mask, you can query the model for what it thinks is the most likely word at that position — and you can even select samples that are not the single most likely option. Of course, a challenge is avoiding making stimuli that are too templatic, so you can take the guesses of a model and combine your sentence with other models like GPT-2. GPT-2, which is fully downloadable and whose weights have been released publicly, does next word prediction, which can be used as a proxy for the linguistic predictability of each element in a sentence. More importantly, GPT-2 has generative abilities as well, so you can use it to create highly probable filler material between your words of interest.

The major selling point of OpenAI’s ChatGPT is its convenience. We have access to a language model that works really well and does pretty sophisticated problem solving. This is interesting and valuable as an object for conducting one’s work. The downsides of ChatGPT are specific to it — not intrinsic to language modeling as a whole. Using ChatGPT as part of one’s work undermines open science, reproducibility, and lacks the flexibility of previous systems that could be manipulated and changed to suit one’s scientific needs. I hope we can move beyond the prompting paradigm that stuffs the pockets of venture capitalists and start to create open, public, probe-able systems that every psycholinguist can use.

Scidentity Crisis

Discussion about this post