**Fangqiu Han**, Shulong Tan, Huan Sun, Xifeng Yan, Mudhakar Srivatsa, Deng Cai, "

**Distributed
Representations of Expertise**",

*in Proc. of the 16th SIAM international conference on data mining (SDM2016)*.

[Abstract][Paper]
Collaborative networks are common in real life, where
domain experts work together to solve tasks issued by
customers. How to model the proficiency of experts is
critical for us to understand and optimize collaborative
networks. Traditional expertise models, such as topic
model based methods, cannot capture two aspects of
human expertise simultaneously: Specialization (what
area an expert is good at?) and Proficiency Level (to
what degree?). In this paper, we propose new models
to overcome this problem. We embed all historical
task data in a lower dimension space and learn vector
representations of expertise based on both solved and
unsolved tasks.
Specifically, in our first model, we assume that each
expert will only handle tasks whose difficulty level just
matches his/her proficiency level, while experts in the
second model accept tasks whose levels are equal to or
lower than his/her proficiency level. Experiments on
real world datasets show that both models outperform
topic model based approaches and standard classifiers
such as logistic regression and support vector machine
in terms of prediction accuracy. The learnt vector
representations can be used to compare expertise in a
large organization and optimize expert allocation.

Yu Su,

**Fangqiu Han**, Richard E. Harang and Xifeng Yan, "

**A fast Kernel for Attributed Graphs**",

*in Proc. of the 16th SIAM international conference on data mining (SDM2016)*.

[Abstract] [Paper]
As a fundamental technique for graph analysis, graph kernels
have been successfully applied to a wide range of problems.
Unfortunately, the high computational complexity of
existing graph kernels is limiting their further applications
to larger-scale graph datasets. In this paper, we propose a
fast graph kernel, the descriptor matching (DM) kernel, for
graphs with both categorical and numerical attributes. The
computation time of the DM kernel is linear with respect
to graph size. On graphs with n nodes and m edges, the
kernel computation for two graphs can be done in O(n+m)
time. Although there are other linear-time graph kernels,
most of them are restricted to graphs with only categorical
attributes; their effciency mainly comes from the sparseness
of the feature space resulted from the mutually orthogonal
categorical attributes. Extensive experiments on both synthetic
and real-world graph datasets show promising performance
of DM in both accuracy and efficiency.

Shengqi Yang,

**Fangqiu Han**, Yinghui Wu and Xifeng Yan, "

**Fast Top-K Search in Knowledge Graphs**",

*in Proc. of the 32nd IEEE International Conference on Data Engineering (ICDE2016)*.

[Abstract] [Paper]
Given a graph query Q posed on a knowledge graph
G, top-k graph querying is to find k matches in G with the highest
ranking score according to a ranking function. Fast top-k search
in knowledge graphs is challenging as both graph traversal and
similarity search are expensive. Conventional top-k graph search
is typically based on threshold algorithm (TA), which can no long
fit the demand in the new setting.
This work proposes STAR, a top-k knowledge graph search
framework. It has two components: (a) a fast top-k algorithm
for star queries, and (b) an assembling algorithm for general
graph queries. The assembling algorithm uses star query as
a building block and iteratively sweeps the star match lists
with a dynamically adjusted bound. For top-k star graph query
where an edge can be matched to a path with bounded length
d, we develop a message passing algorithm, achieving time
complexity O(d^2|E| + (d_m)^d) and space complexity linear to d|V|
(assuming the size of Q and k is bounded by a constant), where
d_m is the maximum node degree in G. STAR can further be
leveraged to answer general graph queries by decomposing a
query to multiple star queries and joining their results later.
Learning-based techniques to optimize query decomposition are
also developed. We experimentally verify that STAR is 5-10 times
faster than the state-of-the-art TA-style graph search algorithm,
and 10-100 times faster than a belief propagation approach.

Honglei Liu,

**Fangqiu Han**, Hongjun Zhou, Xifeng Yan and Kenneth Kosik, "

**Fast Motif Discovery in Short Sequences**",

*in Proc. of the 32nd IEEE International Conference on Data Engineering (ICDE2016)*.

[Abstract] [Paper]
Motif discovery in sequence data is fundamental to
many biological problems such as antibody biomarker identification.
Recent advances in instrumental techniques make it
possible to generate thousands of protein sequences at once,
which raises a big data issue for the existing motif finding
algorithms: They either work only in a small scale of several
hundred sequences or have to trade accuracy for efficiency. In this
work, we demonstrate that by intelligently clustering sequences,
it is possible to significantly improve the scalability of all the
existing motif finding algorithms without losing accuracy at all.
An anchor based sequence clustering algorithm (ASC) is thus
proposed to divide a sequence dataset into multiple smaller
clusters so that sequences sharing the same motif will be located
into the same cluster. Then an existing motif finding algorithm
can be applied to each individual cluster to generate motifs. In
the end, the results from multiple clusters are merged together
as final output. Experimental results show that our approach is
generic and orders of magnitude faster than traditional motif
finding algorithms. It can discover motifs from protein sequences
in the scale that no existing algorithm can handle. In particular,
ASC reduces the running time of a very popular motif finding
algorithm, MEME, from weeks to a few minutes with even better
accuracy.

**Fangqiu Han**, Subhash Suri and Xifeng Yan, "

**Observability of Lattice Graphs**",

*Journey of Algorithmica, 2015.*
[Abstract][Paper]
We consider a graph observability problem: how many edge colors are needed for an unlabeled
graph so that an agent, walking from node to node, can uniquely determine its location from just
the observed color sequence of the walk? Specifically, let G(n, d) be an edge-colored subgraph of
d-dimensional (directed or undirected) lattice of size n^d. We say that G(n, d)
is t-observable if an agent can uniquely determine its current position in the graph from the color
sequence of any t-dimensional walk, where the dimension is the number of different directions
spanned by the edges of the walk. A walk in an undirected lattice G(n, d) has dimension between 1
and d, but a directed walk can have dimension between 1 and 2d because of two different orientations
for each axis.
We derive bounds on the number of colors needed for t-observability. Our main result is that
Theta(n^(d/t)) colors are both necessary and sufficient for t-observability of G(n, d), where d is considered
a constant. This shows an interesting dependence of graph observability on the ratio between the
dimension of the lattice and that of the walk. In particular, the number of colors for full-dimensional
walks is Theta(n^(1/2)) in the directed case, and Theta(n) in the undirected case, independent of the lattice
dimension. All of our results extend easily to non-square lattices: given a lattice graph of size
N = Pi_{i=1}^n(n_i), the number of colors for t-observability is Theta(N^(1/t)).

Yang Li, Chi Wang,

**Fangqiu Han**, Jiawei Han, Dan Roth and Xifeng Yan, "

**Mining Evidences for Named Entity Disambiguation**",

*Proc. of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'13)*, Chicago, IL, USA, Auguest 2013.

[Abstract] [Paper]
Named entity disambiguation is the task of disambiguating
named entity mentions in natural language text and link
them to their corresponding entries in a knowledge base
such as Wikipedia. Such disambiguation can help enhance
readability and add semantics to plain text. It is also a
central step in constructing high-quality information network
or knowledge graph from unstructured text. Previous
research has tackled this problem by making use of various
textual and structural features from a knowledge base.
Most of the proposed algorithms assume that a knowledge
base can provide enough explicit and useful information to
help disambiguate a mention to the right entity. However,
the existing knowledge bases are rarely complete (likely will
never be), thus leading to poor performance on short queries
with not well-known contexts. In such cases, we need to collect
additional evidences scattered in internal and external
corpus to augment the knowledge bases and enhance their
disambiguation power. In this work, we propose a generative
model and an incremental algorithm to automatically
mine useful evidences across documents. With a specific
modeling of "background topic" and "unknown entities", our
model is able to harvest useful evidences out of noisy information.
Experimental results show that our proposed
method outperforms the state-of-the-art approaches significantly:
boosting the disambiguation accuracy from 43%
(baseline) to 86% on short queries derived from tweets.

Yang Li, Pegah Kamousi,

**Fangqiu Han**, Shengqi Yang, Xifeng Yan and Subhash Suri, "

**Memory Efficient Minimum Substring Partitioning**",

*Proc. of the 39th Int. Conf. on Very Large Databases (VLDB'13)*, Riva del Garda, Trento, Italy, Auguest 2013.

[Abstract] [Paper] [Press]
Massively parallel DNA sequencing technologies are revolutionizing
genomics research. Billions of short reads generated
at low costs can be assembled for reconstructing the
whole genomes. Unfortunately, the large memory footprint
of the existing de novo assembly algorithms makes it challenging
to get the assembly done for higher eukaryotes like
mammals. In this work, we investigate the memory issue of
constructing de Bruijn graph, a core task in leading assembly
algorithms, which often consumes several hundreds of gigabytes
memory for large genomes. We propose a disk-based
partition method, called Minimum Substring Partitioning
(MSP), to complete the task using less than 10 gigabytes
memory, without runtime slowdown. MSP breaks the short
reads into multiple small disjoint partitions so that each
partition can be loaded into memory, processed individually
and later merged with others to form a de Bruijn graph.
By leveraging the overlaps among the k-mers (substring of
length k), MSP achieves astonishing compression ratio: The
total size of partitions is reduced from O(kn) to O(n), where
n is the size of the short read database, and k is the length
of a k-mer. Experimental results show that our method can
build de Bruijn graphs using a commodity computer for any
large-volume sequence dataset.

**Fangqiu Han**, Zhiyi Tan, Yang Yang, "

**On the Optimality of List Scheduling for Online Uniform Machines Scheduling**",

*Optimization Letters* 6.7 (2012): 1551-1571.

[Abstract] [Paper]
This paper considers classical online scheduling problems on uniform
machines. We show the tight competitive ratio of LS for any combinations of speeds of
three machines. We prove that LS is optimal when s3 >= s2 >= s1 = 1 and s3^2 >= s2^2
+ s2s3 + s2, where s1, s2, s3 are the speeds of three machines. On the other hand, LS
can not be optimal for all combinations of machine speeds, even restricted to the case
of 1 = s1 = s2 < s3. For m > 3 machines, LS remains optimal when one machine
has very large speed, and the remaining machines have the same speed.