XIFENG YAN

home | research | publications | tutorials | software


Research Guidelines
(1) Identify fundamental concepts and principles of data mining,
(2) Design algorithms to model, manage, and mine large-scale graphs and networks in bioinformatics, social networks, and computer systems,
(3) Develop systems to facilitate complex knowledge discovery, involving data mining, data management, and machine learning.
Research Areas 
Data Mining Foundations
Mining and Managing Large-Scale Graphs
Biological Network Analysis
Social Network and Business Analytics
Data Mining for Software and Systems

My research is funded by NSF, ARMY, Alzheimer's Association, UCSB, and CNSF.

Data Mining Foundations

Frequent pattern mining was extensively studied in data mining society. Unfortunately, the exponential pattern sets generated by many mining processes have undermined their general utility. We are drowning in data; ironically, we are also drowning in patterns discovered by ourselves. To overcome the pattern redundancy problem, we proposed statistical and combinatorial models to summarize discovered patterns. We continue working on the foundations of the following mining problems: (1) Colossal pattern mining, (2) Direct mining of significant/discriminative/interesting patterns, (3) Approximate pattern mining, and (4) Pattern-based classification and clustering.
[SDM'03, VLDB'05, KDD'05a, KDD'06, ICDE'07a, ICDE'07b, ICDM'07b, ICDE'08, SIGMOD'10]

Managing and Mining Large-Scale Graphs and Networks (Foundation of Graph Information Systems)

Graph data has grown steadily in a wide range of scientific and commercial domains, such as in bioinformatics, computer security, and social networks. However, due to the lack of management and mining tools, it becomes extremely hard, if not impossible, for users to search and analyze any reasonably large collection of graphs. For instance, browsing and crosschecking biological network databases depicted simultaneously in multiple windows is by no means an inspiring experience for scientists. My study is focused on two fundamental problems in large scale graph data mining and graph data management: (1) For a given graph data set (single large network or multiple graphs), what are the hidden structural patterns and how to find them? (2) Given a graph query, how to index and perform fast search in large-scale graph datasets? We have made progress in these two problems to such an extent that it is now close to the design of a general graph information system. [ICDM'02, KDD'03, KDD'05b, SIGMOD'04, SIGMOD'05, ICDE'06, PAKDD'07, ICDM'07a, VLDB'07a, TODS'05, TODS'06, KDD'08b, SIGMOD'08, SDM'09, VLDB'09, ICDE'10, SIGMOD'10, SIGKDD'10]

Biological Network Analysis

Biological networks including metabolic, protein-protein interaction, signaling and transcriptional regulatory networks are not randomly set up for proper functioning of cells. Instead, not only the activities of individual molecules in these networks, but also their interactions are integrated and coordinated in a timely and robust manner, in response to intrinsic and external signals. By searching and mining these interaction networks, it is becoming possible to identify pathways and modules in control of specific biological processes. The objective of this research is to identify composite network motifs in cellular systems and transcriptional regulatory modules from multiple networks. [KDD'03, KDD'05b, ISMB'04, Bioinformatics'06, ISMB'07a, ISMB'07b]

Social Network and Business Analytics

Social networks are another rich data source for network research.  We are working on computing problems arising from networks extracted from help-desk tickets and emails.  By mining ticket processing data, we are able to quantitatively analyze individual behaviors and social relationships in an organization, which could shed light on management optimization.  Social networks derived from email communications offer another way to peek into an enterprise's organization.  We are investigating email networks to infer and categorize the expertise of employees, which could provide a revolutionary new solution for labor resource management. [PKDD'05, VLDB'07b, KDD'08a, VLDB'08 demo, ICDE'09 demo, ASONAM'10, BPM'10]

Data Mining for Software Analysis and Computer Security

The current strain on software development and services demands systems that are less reliant on human intervention. The amount of data generated by various systems is ever increasing, such as system logs, program traces, and intrusion alerts. Mining these system data has the potential to make computing more intelligent, reliable, and maintainable. We are seeking data mining techniques to enhance the performance and reliability of computer systems, such as improving the effectiveness of storage caching, identifying intrusion sources, and isolating software bugs by mining source code and runtime data. [FCRC'03, FAST'04, SDM'05, FSE'05, SDM'06, TSE'06, ISSTA'09, Oakland'10]