Study Finds Large Language Models Can Re-Identify Anonymous Users at Scale

Anonymity

Insider Brief

  • A new study finds that large language models can re-identify anonymous online users at scale using only unstructured text, undermining assumptions about online privacy.
  • The researchers show that LLM-based systems significantly outperform classical methods, achieving meaningful recall at high precision by extracting identity signals, searching candidates, and reasoning over matches.
  • The findings indicate that deanonymization becomes easier with more user data and compute, suggesting current approaches to anonymization and platform privacy may need to be reconsidered.

Large language models can now re-identify anonymous online users using only their writing, according to a new study, raising questions about whether pseudonymity on the internet still offers meaningful protection.

The research shows that modern AI systems can automate the process of deanonymization—matching anonymous profiles to real-world identities or other accounts—using unstructured text such as forum posts, comments, and interview transcripts. Tasks that once required hours of manual investigation can now be completed in minutes at relatively low cost.

The study — which the team posted on arXiv — was conducted by researchers from ETH Zurich, Anthropic, and collaborating institutions.

The findings suggest that the long-standing assumption of “practical obscurity” — that anonymity holds because identifying individuals is too difficult or expensive — no longer applies.

Scattered Clues to Automated Identification

The study frames deanonymization as a two-step process. First, it builds a profile from a person’s writing and then it matches that profile to a known identity. Historically, both steps depended on structured data, such as ratings or demographic attributes, or required skilled human analysts.

The researchers show that large language models can now perform both steps end-to-end. By analyzing text, the models extract identity-relevant signals such as education, profession, location, interests, and even subtle linguistic cues. These signals are then used to search for and verify potential matches across large datasets or the open web.

In one experiment, AI agents were tasked with identifying users based on anonymized profiles derived from platforms such as Hacker News and Reddit. The systems achieved recall rates between roughly 25% and 67% at precision levels ranging from 70% to 90%, meaning they could correctly identify a substantial portion of users while keeping errors relatively low.

In another test involving anonymized interview transcripts with scientists, the system correctly identified 9 out of 33 individuals, despite the removal of explicit identifiers.

To test performance at scale, the researchers developed a modular pipeline that breaks deanonymization into four steps: extracting features from text, searching for candidate matches, reasoning over those candidates, and assigning confidence scores.

This pipeline allows the system to operate across large datasets and measure performance using standard metrics such as precision and recall. It also enables comparison with earlier approaches, including the well-known Netflix Prize attack, which relied on structured movie-rating data.

The results show a sharp improvement over those earlier methods. In a benchmark linking Hacker News users to LinkedIn profiles, the AI-based system achieved up to 45% recall at 99% precision, compared with near-zero performance from classical techniques.

The reasoning stage — where the model evaluates whether two profiles belong to the same person — proved especially important, significantly improving accuracy beyond simple similarity matching.

More Data And More Compute Increase Risk

The study finds that the likelihood of successful identification increases as more user data becomes available. In experiments linking Reddit users across communities, recall rose sharply as the number of shared references — such as movies discussed — grew. Users who revealed more about themselves were substantially easier to identify.

Increased computational effort also improved outcomes. Models that spent more time reasoning about candidate matches performed better, particularly when aiming for high confidence results.

Even as the number of potential matches grows, the system remains effective. Performance declines gradually as datasets scale from thousands to tens of thousands of users, and the researchers estimate that meaningful success rates could persist even with candidate pools in the millions.

The findings suggest that the signals used to identify individuals online are not new. Human investigators have long relied on similar clues. What has changed is the speed, cost, and scale at which those clues can now be analyzed.

As a result, the study argues that online privacy models need to be reconsidered. Removing names, usernames, and other direct identifiers may no longer be sufficient if the remaining text still contains enough context to reconstruct identity.

The research also challenges the idea that unstructured text is inherently safer than structured datasets. Previous work showed that anonymized datasets could often be re-identified. This study extends that concern to everyday online communication.

Limitations And Future Work

The researchers report that some experiments rely on users who voluntarily linked accounts across platforms, which may make them easier to identify than fully anonymous individuals. Other tests use synthetic setups, such as splitting a single user’s activity into separate profiles, which may not fully reflect real-world conditions.

The system’s reliance on external tools, including web search, also makes it difficult to isolate the contribution of the language model itself. And ethical constraints prevent large-scale testing on genuinely anonymous users, leaving some uncertainty about how the results translate to the broader internet.

The study points to several directions for future work, including combining semantic analysis with writing-style techniques to improve accuracy, and developing better methods to anonymize text without reducing its usefulness.

Overall, the findings suggest that platform policies and user expectations around anonymity may need to evolve. As AI systems lower the barrier to large-scale identification, the distinction between anonymous and identifiable online behavior is becoming increasingly difficult to maintain.


For a deeper, more technical dive, please review the paper on arXiv. It’s important to note that arXiv is a pre-print server, which allows researchers to receive quick feedback on their work. However, it is not — nor is this article, itself — official peer-review publications. Peer-review is an important step in the scientific process to verify results.

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

Related Articles

Obriy AI Announces $500K in Funding to Scale Multi-Agent Platform for Enterprise and Government Automation

Obriy AI has secured $500,000 in pre-seed funding led by N1 Investment Company to accelerate development of its multi-agent AI platform, SURE. The Kyiv-based company

Booz Allen Invests in Maritime Autonomy Company Ulysses

Insider Brief Booz Allen Hamilton announced its venture arm, Booz Allen Ventures, has invested in Ulysses, a San Francisco-based company developing autonomous surface and underwater

LinkedIn Data Shows Hiring Slowdown but No Immediate AI-Driven Job Losses

LinkedIn has reported a roughly 20% decline in hiring since 2022, according to Blake Lawit, Chief Global Affairs and Legal Officer, speaking at the Semafor

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

Subscribe today for the latest news about the AI landscape