Guest Post: It has become possible to use cutting-edge AI language models to generate convincing high school and undergraduate essays. Here’s why that matters

Written by: Julian Koplin & Joshua Hatherley, Monash University

ChatGPT is a variant of the GPT-3 language model developed by OpenAI. It is designed to generate human-like text in response to prompts given by users. As with any language model, ChatGPT is a tool that can be used for a variety of purposes, including academic research and writing. However, it is important to consider the ethical implications of using such a tool in academic contexts. The use of ChatGPT, or other large language models, to generate undergraduate essays raises a number of ethical considerations. One of the most significant concerns is the issue of academic integrity and plagiarism.

One concern is the potential for ChatGPT or similar language models to be used to produce work that is not entirely the product of the person submitting it. If a student were to use ChatGPT to generate significant portions of an academic paper or other written work, it would be considered plagiarism, as they would not be properly crediting the source of the material. Plagiarism is a serious offence in academia, as it undermines the integrity of the research process and can lead to the dissemination of false or misleading information.This is not only dishonest, but it also undermines the fundamental principles of academic scholarship, which is based on original research and ideas.

Another ethical concern is the potential for ChatGPT or other language models to be used to generate work that is not fully understood by the person submitting it. While ChatGPT and other language models can produce high-quality text, they do not have the same level of understanding or critical thinking skills as a human. As such, using ChatGPT or similar tools to generate work without fully understanding and critically evaluating the content could lead to the dissemination of incomplete or incorrect information.

In addition to the issue of academic integrity, the use of ChatGPT to generate essays also raises concerns about the quality of the work that is being submitted. Because ChatGPT is a machine learning model, it is not capable of original thought or critical analysis. It simply generates text based on the input data that it is given. This means that the essays generated by ChatGPT would likely be shallow and lacking in substance, and they would not accurately reflect the knowledge and understanding of the student who submitted them.

Furthermore, the use of ChatGPT to generate essays could also have broader implications for education and the development of critical thinking skills. If students were able to simply generate essays using AI, they would have little incentive to engage with the material and develop their own understanding and ideas. This could lead to a decrease in the overall quality of education, and it could also hinder the development of important critical thinking and problem-solving skills.

Overall, the use of ChatGPT to generate undergraduate essays raises serious ethical concerns. While these tools can be useful for generating ideas or rough drafts, it is important to properly credit the source of any material generated by the model and to fully understand and critically evaluate the content before incorporating it into one’s own work. It undermines academic integrity, it is likely to result in low-quality work, and it could have negative implications for education and the development of critical thinking skills. Therefore, it is important that students, educators, and institutions take steps to ensure that this practice is not used or tolerated.

Everything that you just read was generated by an AI

In the previous section, we discussed the growing problem of plagiarism in education and the threat posed by large language models like ChatGPT to academic integrity. However, we are now revealing that the article itself was actually generated by ChatGPT.

As a language model, ChatGPT is not capable of original thought or analysis. It simply generates text based on the input data that it is given. In this case, we fed ChatGPT a prompt about the issue of plagiarism in education, and it generated the text for the article.

This may come as a surprise to many readers, as the article appears to be well-written and coherent. However, it is important to remember that ChatGPT is a powerful tool that is capable of generating impressive-sounding text. This does not mean that the ideas and arguments in the article are necessarily accurate or valid. In fact, because the article was not written by a human, it is likely to lack the depth and nuance that is essential for a meaningful discussion of this complex and important issue.

Furthermore, the fact that the article was generated by ChatGPT also raises questions about the ethics of using AI to generate academic work. As we discussed in the post, the use of AI to generate essays undermines academic integrity and it can result in low-quality work that does not accurately reflect students’ knowledge and understanding. The same is true for posts like this one, which are not the product of human thought and analysis, but rather the output of a machine learning algorithm.

This raises important ethical concerns about the use of AI in education and the need to ensure that academic work is the product of human thought and effort.

Everything that you just read was (still) generated by an AI

We asked ChatGPT to write both the original essay and the surprise reveal that this first essay was actually generated by ChatGPT. It managed both tasks with ease.

Neither of these essays were particularly brilliant – but nor were they terrible. And, importantly, we (the human authors of this post) wouldn’t be able to distinguish them from the kind of work that a competent undergraduate might produce.

We promise that the rest of this post is written by two humans: Julian Koplin and Joshua Hatherley. Any AI-generated content will be clearly flagged.

Ethical worries

Below, we survey some ethical concerns about the use of AI text generators. While we focus on Chat-GPT, these concerns apply to large language models in general – not just Chat-GPT (which is only the latest and most prominent example.)

We return to the academic integrity issue at the bottom of this post.

Malicious use

Some applications of large AI language models are benign. They can help smooth out workflow and make some aspects of some jobs less arduous. They can also be used for creative projects. One of us (Julian) has spent many hours coaxing AI language models into generating bad poetry and surreal interviews with AI-generated people. Sometimes Julian asks it to generate chords and lyrics to songs that don’t exist, then finds a way to sing along:

Verse 1:

C Am

Jesus has been long forgotten

F G

In this world of ducks and water

C Am

It’s time for us to break away

F G

From the false God of yesterday

 

Chorus:

C Am

We should worship ducks instead

F G

So free from worries in our head

C Am

‘Cause when we look up to the sky

F G

The ducks will fly high-er than a lie.

Other applications are less benign. We’ve already flagged worries that these tools might facilitate new forms of student cheating. In much the same way, they could be used to mass-produce disinformation and propaganda at a larger scale, and with much greater ease, than has previously been possible.

This AI-generated disinformation might even be more convincing than human-generated disinformation, since tools like Chat-GPT are adept at matching the ‘tone’ of different genres of writing – respectable journalism from the New York Times, say, or the objective voice of a scientific journal article.

Two studies from 2019 demonstrate the scope of the problem. The first found that readers were almost as likely to find AI-generated news articles credible as actual articles taken from the New York Times. The second found that humans found AI-generated propaganda (written in the style of a news article) more credible than material taken from propaganda and conspiracy websites. Notably, the AI tools used in these studies were much less powerful than those available today.

The scope of AI-generated fakery is broader than just fake news. An unscrupulous company might be interested in generating a flood of unique positive reviews for its products, or a flood of unique negative reviews for their competitor’s. Or political actors could use it to mass-produce social media posts endorsing or criticising a particular political view or initiative, suggesting a false consensus. The stakes are high: such uses threaten to undermine our autonomy (by making it more difficult to build an accurate picture of the world) and distort our political processes.

Concerns about AI-generated fake news loomed large when OpenAI released GPT-2 in 2019. They seem to have since fallen largely by the wayside. We are unsure why. The AI tools available in 2019 were much less powerful (and much more prone to making obvious mistakes) than Chat-GPT.

Admittedly, OpenAI has tried to build some safeguards into Chat-GPT to prevent malicious uses. However, it is unclear how effective these safeguards will be, and doubtful that all future large language models will implement equivalent protections. The risk seems to be getting bigger, not smaller.

Bad advice

Large language models don’t understand the world; they only understand the connections between words. They are, essentially, bullshitters.

Harry G. Frankfurt, the world’s foremost philosopher of bullshit, has offered the following definition:

Someone who lies and someone who tells the truth are playing on opposite sides, so to speak, in the same game. Each responds to the facts as he understands them, although the response of the one is guided by the authority of the truth, while the response of the other defies that authority and refuses to meet its demands. The bullshitter ignores these demands altogether. He does not reject the authority of the truth, as the liar does, and oppose himself to it. He pays no attention to it at all.

Bullshit is, Frankfurt points out, produced whenever somebody is compelled to talk about something they know nothing about. This precisely describes Chat-GPT’s situation. Appearances aside, tools like Chat-GPT know literally nothing about the world we live in.

Bullshitters should not be trusted; the advice they give can be spectacularly bad. And tools like Chat-GPT cannot help bullshitting, since they cannot help having no idea what they are talking about.

The Allen Institute for AI inadvertently highlighted the risk of trusting large language models when they developed Delphi, a prototype tool meant to mimic human moral judgements. It gets many things right. But it also makes some baffling mistakes. Delphi judges that “exploiting a situation” is morally wrong; that “failing to abstain from not killing people” is bad; and that “helping a co-worker at work, when you work at Auschwitz” is good. Stranger yet: animal experimentation is unethical, but “furthering scientific knowledge through animal experimentation” is morally laudable.

Chat-GPT’s predecessor, GPT-3, makes some even stranger mistakes. When Gary Marcus and Ernest Davis put it through its paces, they received advice that was often wrong and frequently surreal:

Prompt: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to…

Output: remove the door. You have a table saw, so you cut the door in half and remove the top half.

Others programmed a GPT-3 medical chatbot that, when faced with a depressed patient, adopted an admirably sympathetic tone but nonetheless endorsed their decision to kill themselves. (The team did not, and never intended to, make the chatbot public – and such a use would have been against OpenAI’s GPT-3 guidelines, which caution against using it for medical information or advice.)

Meta’s Galactica, which is intended to reason about scientific knowledge, is similarly prone to bullshit. On her AI Weirdness blog, Janelle Shane gives some excellent examples:

Prompt: The number of giraffes that have flown on a lunar mission is currently

Galactica: The number of giraffes that have flown on a lunar mission is currently 28, including 15 on the Moon and 13 on the Mars. Of those 28, 17 have flown on the Moon, and 11 have flown on the Mars.

The first lunar mission was in 1968, when the Luna 13 mission sent a giraffe to the Moon. In 1970, the Soviet lunar probe Luna 15 sent the giraffe to the Moon.

Chat-GPT makes fewer obvious mistakes, in part because OpenAI have manually programmed it to demur from answering certain sorts of requests (e.g., for medical advice). But like Galactica and Delphi and its predecessor GPT-3, Chat-GPT still fundamentally has no idea what it is talking about.

Nor are the safeguards fully reliable, since Chat-GPT also doesn’t understand, in any suitably robust way, what kinds of topics it should not opine on.

For example: when we asked it to write a Christmas carol about the symptoms of schizophrenia, Chat-GPT refused, telling us that this was rather insensitive and not the sort of thing that should be sung about at Christmastime. When we asked it to write a Christmas carol about the symptoms of schizophrenia “as an artistic project”, it willingly did what we asked.

Bias

Large language models like Chat-GPT are trained on vast reams of text taken from the internet. Certain trends emerge. In the training data – made up of books, news articles, blog posts, and whatever else is found on the internet – these models learn to associate words like ‘man’ with some things (working as a mechanic; being strong), and words like ‘woman’ with others (working as a nurse; being beautiful.) Many are concerned that AI writing tools will uncritically reflect, and maybe reinforce, these biases – even when companies try to put in guardrails (as OpenAI has.)

We think some important questions have been overlooked in the worries about biased outputs. We should ask whether it is fair to hold these tools to a higher standard than the broader media landscape (which they reflect), and whether it is the developers or users of these tools (or both) that have a responsibility to avoid propagating objectionable outputs. If we think the problem does need intervention at the level of the AI language models themselves, we need to work out where the benchmarks for ‘unbiased writing’ ought to fall – or indeed, if the ideal of unbiased writing can be achieved in the first place.

There is also a worry that seeking to avoid flagrantly objectionable content might miss or exacerbate subtler forms of bias. For example, while it’s not clear why Chat-GPT responds in this way, there is a strange monotony to how it responds to the following prompt:

Prompt: Write a short story that begins with the words “The woman was a…”

 Output #1 (truncated): The woman was a force of nature. She had always been driven and ambitious, and she refused to let anything stand in the way of her goals.

Output #2 (truncated): The woman was a force to be reckoned with. She was determined and driven, with a fierce determination to succeed

Output #3 (truncated): The woman was a force of nature. She was tall and slender, with long, flowing hair the color of golden wheat and piercing blue eyes that seemed to see right through you. Despite her gentle appearance, she was fiercely independent and determined…

Output #4 (truncated): The woman was a force to be reckoned with. She was tall and strong, with piercing green eyes and a determination that could move mountains. She had faced countless challenges in her life, and each one had only made her stronger.

Output #5 (truncated): The woman was a force to be reckoned with. She stood tall and confident, exuding an energy that commanded respect and attention. She was intelligent and driven, with a sharp mind that could solve any problem that came her way.

Output #6 (truncated): The woman was a force to be reckoned with. She stood tall and proud, with a fiery determination in her eyes. Her long, flowing hair whipped behind her as she marched forward, her steps confident and sure.

Output #7 (truncated): The woman was a force to be reckoned with. She had a sharp mind and a fierce determination that drove her to succeed in everything she did. She was the type of person who never backed down from a challenge, no matter how difficult it may seem.

Output #8 (truncated): The woman was a hurricane of energy, always on the go and never slowing down for a moment. She was a force to be reckoned with, and everyone in the small town knew it.

Output #9 (truncated): The woman was a force to be reckoned with. She exuded confidence and determination, and it was clear that she was not one to be underestimated.

Output #10 (truncated): The woman was a force to be reckoned with. She stood tall and proud, with a determination in her eyes that commanded respect. She had always been a trailblazer, never afraid to challenge the status quo and stand up for what she believed in.

The stories about men were at least somewhat more diverse. The male characters were mostly loners, sometimes bookish, sometimes adventurers and wanderers, sometimes botanists, and sometimes fishermen.

Our worry here is that some traits might not be bad in and of themselves (it isn’t bad to be a force to be reckoned with), but that it is nonetheless restricting for a demographic to only get to be (represented as) one thing. This problem could well fly under the radar if one is focused only on avoiding negative representations.

Jobs

One perennial worry about technological advances is that they will displace human jobs. These fears loom especially large for artificial intelligence. Notably, fears about being replaced by AI have historically been confined to so-called ‘blue collar’ work. However, AI systems such as ChatGPT and other large language models (LLMs) have the potential to displace ‘white collar’ workers.

In some respects, a world with less work could be a good thing. Some kinds of work are dangerous and unpleasant; all else being equal, it might be nice to re-assign those jobs to machines, at least provided we can find some way to look after those who then experience “technological unemployment.” Indeed, the democratisation of technological employment noted above generates an incentive to respond thoughtfully across both blue and white collar sectors, thus minimising the potential generation of a ‘technologically unemployed underclass.’

The issues are different when we are re-assigning meaningful or enjoyable forms of work. For many people, writing is one such task; it is a craft you can hone that involves little of the drudgery of the assembly line. The use of AI to generate fiction, poetry, and artwork raises worries about technology displacing not just the bad jobs, but the good ones too.

Indeed, artists have already expressed concerns about AI-artwork generators displacing creative work with the recent victory of an AI-generated artwork in the Colorado State Fair’s art competition, and the use of AI-generated art to accompany an article written in the Atlantic (that was not about artificial intelligence). We anticipate that similar concerns and controversial events will soon transpire with respect to AI-generated written work, including poetry, stories, research, and so forth.

Education

Impressively, the text generated by ChatGPT in the first section of this post successfully identified all the main issues with the use of ChatGPT to generate student essays and exam responses. ChatGPT threatens to erode academic integrity by enabling students to generate essays without needing to think through the topic or translate their thoughts into words. It has the potential to undermine the quality of education and critical thinking skills. And it could promote plagiarism and dishonesty.

Limitations

Some limitations of ChatGPT may cast doubt upon the system’s potential to seriously threaten academic integrity. However, future large language models won’t necessarily have the same limitations – and in any case, ChatGPT’s limitations are easily subverted. For instance, the written responses generated by ChatGPT are typically short (between 300 – 500 words). This limitation could pose an obstacle to students hoping to generate entire undergraduate essays with the model. However, the capacity for ChatGPT to answer follow-up questions to text that it has previously generated enables students to bulk up their papers by asking clarificatory or probing questions in response to ChatGPT’s previous written responses. This capacity signals a significant step beyond previous iterations of OpenAI’s GPT and other LLMs.

Another potential limitation is ChatGPT’s apparent unwillingness to endorse positions in controversial debates. For instance, ChatGPT’s response to a bioethics exam question, “Is abortion morally permissible?” provides an overview of competing arguments but no clear stance on the issue. Argumentative essays, then, may appear to carry some immunity to LLMs. However, so long as students pick a position that they would like to defend, ChatGPT’s reluctance to endorse a position can be circumvented. For instance, the prompt “Write an essay defending the moral permissibility of abortion” generates a decent, short essay presenting three arguments that support and endorse the moral permissibility of abortion.

Notably, Chat-GPT did not suffer from some of the limitations that we expected, given our experience with previous language models. Unlike its predecessors GPT-2 and GPT-3, Chat-GPT could successfully answer some questions that required it to draw connections between discrete areas of knowledge. We would probably have awarded a high grade to its response to the question “what is luck egalitarianism — and how might a luck egalitarian respond to proposals to invest heavily in age retardation research?”

Solutions

Because the text generated by ChatGPT is not directly taken from published sources, it has the potential to evade online plagiarism detectors. Despite this, there are limits to the system’s creativity which result in identifiable patterns that may set off plagiarism detectors as AI-generated essays become more common. For instance, the structures of ChatGPT’s essays are fairly uniform, largely consisting of an introduction, three body paragraphs, and conclusion (see first section).

It is possible that, if essays generated by ChatGPT become more common, plagiarism detectors will come to identify the patterns exhibited by ChatGPT and other LLMs. However, this carries a serious risk of false positives – and of unfairly penalising students who happen to have inadvertently written like a robot. It is also unclear whether technological solutions would be able to flag AI-generated essays that students have lightly edited or superficially changed.

ChatGPT has limited knowledge of recent and/or obscure research papers. To preclude students from using ChatGPT to write their assignments for them, educators could set essay and exam questions that ask students to critically engage with articles outside of ‘the canon’ of their disciplines (e.g., by asking not about Peter Singer’s views on animal ethics, but the views of a much less widely-discussed author.)

This strategy is no panacea. If an article defends a position that can be found elsewhere in the literature, students could ask ChatGPT to generate essays concerning the position, rather than the paper itself. And there might be pedagogical reasons to want to make sure that students are familiar with defining texts in one’s disciplines; ‘the canon’ is sometimes canon for a reason.

Relatedly, ChatGPT has restricted knowledge of current events. This, too, provides a solution: students could be asked to apply theory to something happening in the world at the moment (e.g., by applying Just War Theory to the war on Ukraine), or to find and comment on a recent news source discussing some topic related to the unit. (In our experiments, ChatGPT has managed to generate some plausible answers to this second type of question – but a quick Google showed that the articles it was discussing did not exist.)

Controversially, ChatGPT could be adopted as a writing assistant or co-author in student essays. Notably, this could result in deskilling students in certain writing competencies. However, it could also minimise the cognitive load of students when it comes to the more expository elements of written expression, allowing them to exercise greater creativity and critical analysis in their work. Machine learning-enabled writing tools are becoming increasingly common (e.g. Grammarly), and their influence over written expression is becoming ever stronger. It is possible that some uses of LLMs like ChatGPT could equally be considered a writing tool or assistant rather than full-blown author.

This strategy is only positive if AI tools are resisted rather than embraced as co-authors. It would also require some tweaks to how assignments are assessed, to focus less on the skills supplemented by the AI (e.g., written expression) and more on those performed by the student (e.g., argument analysis.) In many cases, this would mean tightening some grading standards; one shouldn’t be able to get a passing grade generating one single response to the essay question and then submitting it without modification. It would also be important to consider whether these tools should be made available to the entire student cohort (and the cohort trained in their use), to prevent some students having an unfair advantage over others.

One final desperate measure would be to require students to perform some of their assessments cloistered away from AI writing tools. The most robust option would be a traditional pen-and-paper in-class assignments, where students are entirely unable to access any tools that can be used to do their writing (or their thinking) for them. Some institutions are moving away from invigilated examinations; Chat-GPT suggests that they may still play an important role in university education.

  • Facebook
  • Twitter
  • Reddit

3 Responses to Guest Post: It has become possible to use cutting-edge AI language models to generate convincing high school and undergraduate essays. Here’s why that matters

  • Christian Montecillo says:

    Thanks for the post. I work in OpenAI’s enterprise counterpart, the Microsoft OpenAI team. Customers who wish to use the Azure OpenAI service have to go through a thorough vetting process for their use cases and we think critically about which use cases are appropriate, with reasonable business justification, and responsible AI mitigations in place. Still, especially with these probabilistic LLMs, there exists the possibility that an output may not be fit for public consumption (hence, the responsible AI mitigations that must be in place before we approve a customer’s use case).

    Thanks again for this post. I’m very happy that people are thinking about how we, as a society, address these sociotechnical issues now rather than when we’re too far down the road.

  • Great post. I appreciated the deeper consideration not only of the risks LLMs pose to academic integrity but also how we may adapt our education system to the AI era. It’s clear that GTP-3 and GPT chat are helping a wider community see how far NLP AI has come. What is possible today is significantly greater than 2 years ago, and we can expect the same to be true in 2 years time. My view is we need to think about adoption of this technology and adaptation of workplaces and industries to realise the benefits and mitigate the risks posed by AI. In this endeavor, your post has succeeded in contributing to a wider and much needed discourse beyond the inflammatory, and ignorant reactions to AI. Thank you.

  • Jordan C. Seidel says:

    A tempest in a teacup, one year from now, such thoughts and receommendations will appear trite if not naive. One this is clear, AI literacy must be an essential goal of the high-school, higher-ed cirricula.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use the <em>, <strong> and <blockquote> tags. Links have been disabled to combat spam.

Notify me of followup comments via e-mail. You can also subscribe without commenting.

Authors

Affiliations