1) Concept Extraction. In this step we extract all the concepts in the contexts of name observations and represent them as the nodes in the semantic-graph. We first gather all the N-grams (up to 8 words) and identify whether they correspond to semantically meaningful concepts: if After concept identification, we filter out all the N-grams which do not correspond to the The retained N-grams are identified as concepts, corresponding with their semantic meanings (a concept may have multiple semantic meaning
2) Concept Connection. In this step we represent the semantic relations as the edges between nodes. That is, for each pair of extracted concepts, we identify whether there are semantic relations between them: 1) If there is only one semantic relation between them, we connect these two concepts with an edge, where the edge weight is the strength of the semantic relation; 2) If there is more than one semantic For example, if both Wikipedia and WordNet provide
In this section, we describe how to capture the semantic relations between the concepts in
1) The edges of The edges model the direct semantic relations between concepts. We call this form of semantic knowledge
2) The structure of Except for the edges, the structure of the semantic-graph also models the semantic knowledge of concepts. For example, the neighbors of a concept represent all the concepts which are explicitly We call this form of semantic knowledge
The problem of quantifying the relatedness between nodes in a graph is not a new problem, However, these similarity measures are not suitable for our task, because all of them assume that the edges are uniform so that they cannot take edge weight into consideration.
In this section we describe how to leverage the semantic knowledge captured in the structural Because the key problem of named entity disambiguation is
In the following we describe each step in detail.
Intuitively, if two observations of the target name represent the same entity, it is highly possible that the concepts in their contexts are closely related, i.e., the named entities in their contexts are socially related and the Wikipedia concepts in their contexts are semantically related. In contrast, if two name observations represent different entities, the concepts within their contexts will not be closely related. Therefore
Given the computed similarities, name observations are disambiguated by grouping them according to their represented entities. In this paper, we group name observations using the hierarchical agglomerative clustering (HAC) algorithm, which The HAC The merging threshold can be determined through cross-validation. We employ the single-link method to compute the similarity between two clusters,
To assess the performance of our method and compare it with traditional methods, we conduct a series of experiments. In the experiments, we evaluate the proposed SSR method on the task of personal name disambiguation, which is the most common type of named entity disambiguation. In the following, we first explain the general experimental settings in
We adopted the standard data sets used in the First Web People Search Clustering Task (WePS1) (Artiles et al., 2007) and the Second Web People Search Clustering Task (WePS2) (Artiles et al., 2009). The three data sets we used are WePS1_training data set, WePS1_test data set, and WePS2_test data set. Each of the three data sets consists of a set of ambiguous personal names (