You might have seen the headlines: Colossal Biosciences claims to have brought back the dire wolf. Except, it’s not quite a direct resurrection. What Colossal actually created are genetically engineered proxies: grey wolves modified to have some dire wolf traits.
I wondered if the news might renew interest in the ethics of “de-extinction” and perhaps lead to an uptick in ethical analyses on the topic.
Typically, these involve some attempt at: understanding the science and uncertainties, identifying who and what is affected (stakeholders, values, potential impacts), applying various ethical frameworks to clarify trade-offs, and ultimately informing potential policy or action.
Letting Gemini 2.5 Have a Go
I decided to see how Gemini 2.5 – Google’s currently most advanced language model – would handle this process.
Step 1: Gemini’s Deep Research
I started with Gemini 2.5’s ‘Deep Research’ feature. This tool can explore hundreds of web sources, analyze the information, and synthesize cohesive reports that are often 10k+ words long.
I asked it to report on the dire wolf proxy news, frame it within broader de-extinction discussions, review the ethics literature, and pinpoint critical regulatory questions.
Minutes later I had a coherent 12,000-word report.
More than coherent, it was detailed, comprehensive, and with 93 un-hallucinated references (though scroll down and see my annotations about the limitations of the works cited).
Step 2: AI-Powered Ethical Analysis
Building on that initial report, I used the standard Gemini 2.5 model for the next stage. I guided it through five detailed prompts mirroring an ethical analysis workflow:
- Define the subject, aim, scope.
- Identify values, stakeholders, key arguments.
- Apply ethical frameworks.
- Evaluate alternatives and dilemmas.
- Formulate judgments and recommendations for stakeholders.
The result was a 5,000-word ethical analysis rooted in the initial ‘Deep Research’ report.
Frankly, it was good – if a colleague had sent it to me as a ‘first completed draft’, I’d assume it took them weeks of work. More than that, it was nuanced enough that I’d say professional ethicists could reasonably disagree on some of its finer points – a pretty significant threshold for language model performance.
Some high-level assessments
- Identifying Core Concepts: The analysis did a good job of pinpointing the central ethical issues, like animal welfare, ecological risk, restorative justice, and moral hazards (see my annotations on p.2 and elsewhere). It didn’t just list them; it articulated ambiguities, asking, for instance, how predictable animal suffering should be weighed against uncertain future benefits (like ecological restoration) and what level of harm is acceptable.
- Stakeholders Analysis: The list of affected parties it generated was more complete than certainly I might have developed.
- Unexpected Clarity: This was particularly striking. After identifying stakeholders and competing values, it created an unsolicited table linking each stakeholder to potential positive/negative impacts and the core values involved. Same for the table summarising arguments for/against proxy creation. These was clearer than most human-authored break-downs I’ve seen.
- Evaluating Alternatives: Its analysis of different paths forward (e.g., pursuing proxy creation vs. other options) was thoughtful. Again, it produced an unsolicited table evaluating alternatives through consequentialist and deontological lenses, outlining trade-offs (see table below).
- Tailored Recommendations: The final recommendations followed logically from the analysis and were broken down for key stakeholders in a pointed, non-generic way.
Table generated in the analysis evaluating alternatives to proxy creation:
Alternative | Evaluation using Consequentialism | Evaluation using Deontology/Rights-Based Ethics | Key Trade-Offs |
1. Halt Project & Redirect Resources | Likely High Positive Net Utility: Avoids animal suffering (- disutility), eliminates ecological risks (- disutility), redirects resources to likely more effective conservation (+ utility). Forgoes potential unique knowledge from proxies (- minor utility?). | Strongly Positive: Aligns with duty of non-maleficence, duty to respect animals (ends instrumentalization), potentially duty of justice (resource allocation). Upholds honesty if accompanied by truthful communication. Avoids inherent wrongness of instrumental creation. | Sacrifices specific knowledge/tech development path; Financial/reputational cost to Colossal. |
2. Refocus on Proxy Research (Enhanced Ethics) | Ambiguous/Improved Utility: Reduces negative utility from suffering/risk if enhancements are truly effective. Increases positive utility via honesty. Still incurs welfare costs & resource diversion. Net utility depends heavily on effectiveness & containment. | Improved but Still Problematic: Reduces violation of non-maleficence. Upholds duty of honesty. However, still treats animals as means to an end, violating inherent value/rights (Regan). Instrumentalization remains a core objection. | Higher costs; Requires abandoning "de-extinction" hype; Still ethically objectionable to many due to instrumentalization; Welfare improvements may be limited by current technology. |
3. Shift to Non-Animal Methods | Likely Positive Net Utility: Avoids all animal suffering and ecological risks (- disutility eliminated). Allows knowledge gain (+ utility) efficiently. Frees resources. High positive utility compared to live proxy creation. | Strongly Positive: Fully aligns with duties of non-maleficence and respect for animals (no sentient beings instrumentalized or harmed). Allows ethical pursuit of knowledge. | Sacrifices knowledge only obtainable from a whole, live organism; Requires shift in research focus/methods for Colossal. |
4. Implement Robust Governance Framework First | Positive Long-Term Utility: Likely improves overall outcomes by ensuring better risk management, incorporating societal values, and preventing reckless development (+ utility). May cause short-term delays (- utility). | Positive: Upholds procedural justice, accountability, and transparency. Ensures duties (precaution, consultation) are considered. Helps protect rights from being easily overridden by utility claims. Aligns with responsible governance principles (cf. Jasanoff, 2005 on biotech governance). | Slows innovation/research; Requires political will and resources to establish effective oversight; Does not resolve underlying ethical conflicts but provides a process to manage them. |
Room for improvement:
- Normative Depth: While it identified relevant trade-offs – like the consequentialist tension between immediate harms (animal suffering) and potential large, uncertain future benefits (ecological restoration and scientific advancement) or deontological concerns about animal rights and manipulating nature – the application of these frameworks wasn’t as deep or comprehensive as it should be. (Note: Dedicated prompts exploring specific normative theories hugely improved on this – have a look at this in-depth consequentialist analysis).
- Lack of novelty: While the prompts did not ask for novelty, the synthesis that went into the report was just that: a synthesis of what it found online. This may be fine for a high-level overview, which is what this is, but it’s a reminder that, currently, the strength of language models lies more in rapid synthesis and structured analysis than in true conceptual or normative originality.
- Works cited: The ethical analysis itself only cites specific sections from the Deep Research report, so the analysis isn’t directly grounded in the literature. And while the Deep Research report didn’t have a single hallucinated source, i) some of the works listed weren’t actually cited in the report, ii) academic sources were limited to open access ones or those on ResearchGate, and iii) there were a couple of not so useful sources from Reddit/social media discussions on the main news item.
Taking a step back, it’s worth noting the value of applying follow-up prompts here. Using ‘contrarian’ prompts or one’s that adopt specific expert viewpoints (like a defence department bio-research overseer) helped uncover some biases, such as the analysis’s arguably tech-pessimistic tone, and also raised missing important arguments, for instance regarding dual-use research.
What This Means for Research Practices
Overall, I’m fairly blown away by this. It’s increasingly hard to deny that language models are rapidly becoming capable of automating core research tasks like literature reviews and synthesizing complex arguments into a large, coherent body of text.
Last year language models struggled with anything over 800 words, let alone citing sources. Where will it be next year? And what will that mean for research practices, especially ones that rely mostly on abstract reasoning abilities?
I’m hesitantly optimistic. The ability to automate and direct intelligence towards specific end goals will force disciplines — especially in the humanities — to take a much-needed step back and ask: What are we fundamentally trying to achieve here? Answering that will require a clear vision and the ability to articulate detailed, plain-language explanations of the methods needed to get there.
Some medium-term possibilities:
- Shift from ‘Grunt Work’ to ‘Taste’: The crucial human skill becomes ‘research taste’ – the kind of expertise guiding researchers to ask good questions and pursue productive lines of inquiry. It’s about identifying promising directions and having a vision for one’s field.
- Prompt Engineering & Output Analysis: Refining prompts, experimenting with different AI models and pipelines, and developing methods to critically evaluate LLM outputs will become key skills.
- Reasoning Transparency and Analysis: Language models increasingly use a ‘chain-of-thought’ process, effectively showing the deliberations that went into their outputs. While these ‘thoughts’ may not faithfully reflect the internal reasoning, they offer a partial kind of transparency often lacking in human scholarship. This could allow us to unpack normative disagreements more effectively by revealing the specific points of divergence that are often hidden within the tacit, idiosyncratic reasoning of human scholars (see this for an analogous point on medical reasoning). Evaluating language models’ chains-of-thought could become a big part of the job.
Understandably, many researchers may not welcome such fundamental changes to their work, which carry significant trade-offs. One major concern involves exacerbating ‘check-box ethics’, worsening the existing tendency for ethical reviews to become superficial formalities. This ease of producing long, convincing outputs could allow difficult questions to be glossed over and potentially lead to a homogenization of ethical discourse if we don’t develop methods to evaluate LLM outputs critically.
Beyond this, de-skilling is an obvious major concern, particularly if the value of early-career researchers shifts from things like conducting literature reviews and drafting manuscripts to prompt testing and output validation. Furthermore, if funding priorities move from supporting personnel to covering access to compute, the very structure and availability of research jobs, at least as we know them, could be threatened.
Nonetheless I don’t see this as ‘replacing’ — in this case — ethicists so much as augmenting their work and shifting their focus towards higher-level strategy, articulating methods in plain language, and precisely defining the purpose of their inquiries. What’s clear is the need to seriously consider how these tools will reshape research practices, possibly sooner than we think.