Graph Deanonymization: Circumventing Privacy in Anonymized Networks

Siddhartha Datta

(Press right arrow or space bar to navigate to next slide)

Anonymization

What do we wish to achieve from network anonymization? Why do networks need to be anonymous? What are our constraints?

  • Share the network data (e.g. for academic or industrial collaboration)
  • Domain may be sensitive (e.g. communication data, social networks, mobility trace networks, peer-to-peer networks, epidemiological/healthcare networks)
  • Things we may wish to hide: node privacy, edge privacy,
  • Preserve graph properties (e.g. Degree, Path Length, Local & Global Clustering Coefficients, Betweenness Centrality)
  • Inhibit de-anonymization

Controversey over the Netflix Prize dataset

Netflix released an pseudonymized dataset of movie preferences, as part of a competition into recommender systems. Researchers use information from the Internet Movie Data Base (IMDB) to de-anonymize a number of records. After 2009 the completion is suspended following a lawsuit on privacy grounds.

De-anonymization

  • Incentives: learn sensitive patterns amongst node attributes, learn sensitive attributes about nodes
  • Attacks that help applications: mass surveilance, abusive marketing, phishing

Methods

Anonymization Techniques:

  • Naive ID Removal: replacing node label with randomly-mapped node label
  • Edge Editing based Anonymization: AddDel (k randomly chosen edges will be added to G followed by another k randomly chosen edges will be deleted from G), Switch (k edge switches are conducted, where edges between nodes will be removed and re-routed to another set of nodes)
  • k-anonymity: remove node attributes based on metric, e.g. k-Neighborhood Anonymity, k-Degree Anonymity
  • Aggregation/Class/Cluster based Anonymization: anonymize users into clusters; anonymized graph should consist of supernodes (each corresponding to nodes in a cluster/subgraph), and superedges (edge densities among supernodes)
  • Random Walk based Anonymization: an edge between two users i and j is replaced by another edge between i and u, where u is the destination of a random walk starting from j

Methods

De-anonymization Techniques:

  • Seed-based De-anonymization: identify some users as seeds first (e.g. infiltrate a social network and identify your seed node subgraph); active (e.g. create extensive fake accounts around network), passive (e.g. existing personal account)
  • Seed-free De-anonymization: use probablistic reasoning (e.g. minimizing edge difference between anonymized graph and auxiliary graph with simulated node labelleing and simulated edge creation/removal)

Methods

De-anonymization attack components:

  • Seed: an attacker's unique insertion that can help them reverse whatever anonymizes their seed
  • Active/Passive: whether the attacker needs to make widespread modifications to their network they wish to compromise
  • Auxiliary graph: any pre-existing data that can be used in conjunction with the anonymized graph to match patterns / identify nodes, etc
  • Domain knowledge: if the attacker can estimate the scheme used to anonymize the dataset (e.g. based on community), then they can reformulate their de-anonymization strategy

Hypothetical situation:

  • The TAs are provided with a k-anonymized graph dataset of student summatives (i.e. some nodes are anonymized, some attributes are anonymized). Is it beyond the realm of reason to expect them to figure out who wrote what?
  • Imagine Bernie takes each student and pairs them to their summative text.
  • To restrict their access to the full text, he may process each sentence into a set of attributes (e.g. 100 most frequently used words in the summative).

k-attribute anonmization

Evaluation & Reflection

Should we be concerned? Is the method flawed? What would put our privacy at greater risk?

  • Structure-based de-anonymization / known node list from other external sources (which defender cannot control for) nullifies many methods that retain graph structure
  • Perturbing the graph changes its structure, and may nullify the purpose of graph sharing in the first place (e.g. if too many edges removed and random ones created, then clustering coefficients may change)
  • De-anonymization is not a fully-automatic process; it requires understanding the context, accumulating resources, identifying vulnerabilities
  • The use of seeds in de-anonymization is more reliable, but not all situations may support seeds.
More papers to read: