On-the-fly generation of integrated representations of Linked Data (LD)
search results is challenging because it requires successfully automating a number
of complex subtasks, such as structure inference and matching of
both instances and concepts, each of which gives rise to uncertain
outcomes. Such uncertainty is unavoidable given the semantically heterogeneous
nature of web sources, including LD ones. This paper approaches the problem of
structuring LD search results as an evidence-based one. In particular, the paper shows
how one formalism (viz., probabilistic soft logic (PSL)) can be exploited to assimilate
different sources of evidence in a principled way and to beneficial
effect for users. The paper considers syntactic evidence derived from matching
algorithms, semantic evidence derived from LD vocabularies, and user
evidence, in the form of feedback. The main contributions are: sets
of PSL rules that model the uniform assimilation of diverse kinds of evidence,
an empirical evaluation of how the resulting PSL programs perform in terms
of their ability to infer structure in LD search results, and, finally, a concrete
example of how populating such inferred structures for presentation
to the end user is beneficial, besides enabling the collection of
feedback whose assimilation further improves search result presentation.
|
The paper determines the algebraic and logic structure produced by the multiset semantics of the core patterns of SPARQL. We prove that the fragment formed by AND, UNION, OPTIONAL, FILTER, MINUS and SELECT corresponds precisely to both, the intuitive multiset relational algebra (projection, selection, natural join, arithmetic union and except), and multiset classical non-recursive Datalog with safe negation.
|
In this paper we propose an OBDA approach for accessing geospatial data stored in geospatial relational databases, using the OGC standard GeoSPARQL and R2RML or OBDA mappings. We introduce extensions to existing SPARQL-to-SQL approaches to support GeoSPARQL features. We describe the implementation of our approach in the system ontop-spatial, an extension of the OBDA system Ontop for creating virtual geospatial RDF graphs on top of geospatial relational databases. Last, we present an experimental evaluation of our system using workload and queries from a recent benchmark. In order to measure the performance of our system, we compare it to the state-of-the-art geospatial RDF store, and confirm its efficiency.
|
In several subject domains, classes themselves may be subject to categorization, resulting in classes of classes (or “metaclasses”). When representing these do-mains, one needs to capture not only entities of different classification levels, but also their (intricate) relations. We observe that this is challenging in current Se-mantic Web languages as there is little support to guide the modeler in producing correct multi-level ontologies, especially because of the nuances in the constraints that apply to entities of different classification levels and their relations. In order to address these representation challenges, we propose a vocabulary that can be used as a basis for multi-level ontologies in OWL along with a number of integri-ty constraints to prevent the construction of inconsistent models. In this process we employ an axiomatic theory called MLT (a Multi-Level Modeling Theory).
|
Conjunctive query answering over expressive Horn Description Logic ontologies is a relevant and challenging problem which, in some cases, can be addressed by application of the chase algorithm.
In this paper, we define a novel acyclicity notion which provides sufficient condition for termination of the restricted chase over Horn-SRIQ ontologies.
We show that our notions generalize most of the existing acyclicity conditions (both theoretically and empirically) and its use results in a more efficient reasoning procedure.
Furthermore, we implement a materialization based reasoner for acyclic ontologies which vastly outperforms state-of-the-art reasoners.
|
Query containment is one of the building block of query optimization techniques. In the relational world, query containment is a well-studied problem. At the same time it is well-understood that relational queries are not enough to cope with graph-structured data, where one is interested in expressing queries that capture navigation in the graph. This paper contributes a study on the problem of query containment for an expressive class of navigational queries called Extended Property Paths (EPPs). EPPs are more expressive than previous navigational extensions of SPARQL like property paths and nested regular expressions, for which containment has already been studied. We attack the problem of EPPs (and SPARQL with EPPs) containment and provide complexity bounds.
|
Finding associations between entities is a common information need in many areas. It has been facilitated by the increasing amount of graph-structured data on the Web describing relations between entities. In this paper, we define an association connecting multiple entities in a graph as a minimal connected subgraph containing all of them. We propose an efficient graph search algorithm for finding associations, which prunes the search space by exploiting distances between entities computed based on a distance oracle. Having found a possibly large group of associations, we propose to mine frequent association patterns as a conceptual abstract summarizing notable subgroups to be explored, and present an efficient mining algorithm based on canonical codes and partitions. Extensive experiments on large, real RDF datasets demonstrate the efficiency of the proposed algorithms.
|
Annotations are useful to semantically enrich documents and other datasets with concepts of standardized vocabularies and ontologies. In the medical domain, many documents are not annotated at all and manual annotation is a difficult and time-consuming process. Therefore, automatic annotation methods become necessary to support human annotators with recommendations. We propose a reuse-based annotation approach that clusters items in medical documents according to verified ontology-based annotations. We identify a set of representative features for annotation clusters and propose a context-based selection strategy that considers the semantic relatedness and frequent co-occurrences of annotated concepts. We evaluate our methods and the annotation tool MetaMap based on reference mappings between medical forms and the Unified Medical Language System.
|
In recent years RDF and OWL have become the most common knowledge representation languages in use on the Web, propelled by the recommendation of the W3C. In this paper we examine an alternative way to represent knowledge based on Prototypes. This Prototype based representation has different properties, which we argue to be
more suitable for data sharing and reuse on the Web. Prototypes avoid the distinction between classes and instances and provide means for objects based data sharing and reuse.
In this paper we discuss the requirements and design principles for Knowledge Representation based on Prototypes on the Web, after which we propose a formal syntax and semantics. We show how to embed knowledge representation based on Prototypes in the current Semantic Web standard stack. An implementation and practical evaluation of the system is presented in a separate resource paper.
|
User validation is one of the challenges facing the ontology alignment community, as there are limits to the quality of automated alignment algorithms.
In this paper we present a broad study on user validation of ontology alignments that encompasses three distinct but interrelated aspects: the profile of the user, the services of the alignment system, and its user interface. We discuss key issues pertaining to the alignment validation process under each of these aspects, and provide an overview of how current systems address them. Finally, we use experiments from the Interactive Matching track of the Ontology Alignment Evaluation Initiative (OAEI) 2015 to assess the impact of errors in alignment validation, and how systems cope with them as function of their services.
|
Despite developments of Semantic Web-enabling technologies,
the gap between non-expert end-users and the Semantic Web still
exists. In the field of semantic content authoring, tools for interacting
with semantic content remain directed at highly trained individuals. This
adds to the challenges of bringing user-generated content into the Semantic
Web.
In this paper, we present Seed, short for Semantic Editor, an extensible
knowledge-supported natural language text composition tool, which
targets non-experienced end-users enabling automatic as well as semiautomatic
creation of standards based semantically annotated textual
content. We point out the structure of Seed, compare it with related
work and explain how it utilizes Linked Open Data and state of the art
Natural Language Processing to realize user-friendly generation of textual
content for the Semantic Web. We also present experimental evaluation
results involving a diverse group of more than 120 participants,
which showed that Seed helped end-users easily create and interact with
semantic content with nearly no prerequisite knowledge.
|
Advances in information extraction have enabled the automatic construction of large knowledge graphs (KGs) like DBpedia, Freebase, Yago and Wikidata. These KGs are inevitably bound to be incomplete. To fill in the gaps, data correlations in the KG can be analyzed to infer Horn rules and to predict new facts. However, Horn rules do not take into account possible exceptions, so that predicting facts via such rules introduces errors.
To overcome this problem, we present a method for effective revision of learned Horn rules by effectively incorporating exceptions (i.e., negated atoms) into their bodies. This way errors are largely reduced. We apply our method to discover rules with exceptions from real-world KGs. Our experimental results demonstrate the effectiveness of the developed method and the improvements in accuracy for KG completion by rule-based fact prediction.
|
Data stream applications are becoming increasingly popular on the web. In these applications, one query pattern is especially prominent: a join between a continuous data stream and some background data (BGD). Oftentimes, the target BGD is large, maintained externally, changing slowly, and costly to query (both in terms of time and money). Hence, practical applications usually maintain a local (cached) view of the relevant BGD. Given that these caches are not updated as part of the transaction modifying the original BGD, they should be maintained under realistic budget constraints (in terms of latency, computation time, and possibly financial cost) to avoid stale data leading to wrong answers.
This paper proposes to model the join between streams and the BGD as a bipartite graph. By exploiting the graph structure, we keep the quality of results good enough without refreshing the entire cache for each evaluation. We also introduce two extensions to this method: first, we consider both the sliding window (specifying the currently relevant section of the data stream) and the change rate of the BGD to focus on updates that have the longest effect. Second, by considering the future impact of a query to the BGD we propose to sometimes delay updates to provide more fresher answers in future.
Using an implemented system we empirically show that we can improve result freshness by 93% over baseline algorithms such as Random Selection or Least Recently Updated.
|
Explicit Query Interpretation and Diversification for Context-driven Concept Search across Ontologies
Finding relevant concepts from a corpus of ontologies is useful in many scenarios, including document classification, web page annotation, and automatic ontology population. Millions of concepts are contained in a large number of ontologies across diverse domains. SPARQL-based query demands knowledge of the structure of ontologies and the query language, whereas more user-friendly, simple keyword-based approaches suffer from false positives as concept descriptions in ontologies may be ambiguous and overlapping. In this paper, we propose a keyword-based concept search framework that (1) exploits the structure and semantics in ontologies, by constructing contexts for each concept; (2) generates the interpretations of a query; and (3) balances relevance and diversity of search results. A comprehensive evaluation against both the domain-specific BioPortal and the general-purpose Falcons on widely-used performance metrics demonstrates that our system outperforms both.
|
In this paper we study instance-level update in DL-LiteA, the description logic underlying the OWL 2 QL standard. In particular we focus on formula based approaches to ABox insertion and deletion. We show that DL-LiteA, which is well known for enjoying first-order rewritability of query answering, enjoys a first-order rewritability property also for updates. That is, every update can be reformulated into a set of insertion and deletion instructions computable through a non-recursive DATALOG program. Such a program is readily translatable into a first-order query over the ABox considered as a database, and hence into SQL. Exploiting this result we implement an update component for DL-LiteA-based systems and perform some experiments showing that the approach works in practice.
|
The unprecedented growth in mobile devices, combined with advances in Semantic Web Technologies, has given birth to opportunities for more intelligent systems on-the-go. Limited resources of mobile devices, especially energy, demand approaches to make mobile reasoning more applicable. While Mobile-Cloud integration is a promising method for harnessing the power of semantic technologies in the mobile infrastructure, it is an open question on deciding when to reason with ontologies on mobile devices. In this paper, we introduce an energy consumption prediction mechanism for ontology reasoning on mobile devices, which allows analysis of feasibility of ontology reasoning on mobile devices in terms of energy consumption. The prediction models contributes to mobile-cloud integration and helps improve further development of ontology and semantic solutions in general.
|
The emergence of Linked Data on the WWW has spawned research interest in an online execution of declarative queries over this data. A particularly interesting approach is traversal-based query execution which fetches data by traversing data links and, thus, is able to make use of up-to-date data from initially unknown data sources. While the downside of this approach is the delay before the query engine completes a query execution, user perceived response time may be improved significantly by returning as many elements of the result set as soon as possible. To this end, the query engine requires a traversal strategy that enables the engine to fetch result-relevant data as early as possible. The challenge for such a strategy is that the query engine does not know a priori what data sources will be discovered during the query execution and which of them contain result-relevant data. In this paper, we investigate 14 different approaches to rank traversal steps and achieve a variety of traversal strategies. We experimentally study their impact on response times and compare them to a baseline that resembles a breadth-first traversal. While our experiments show that some of the approaches can achieve noteworthy improvements over the baseline in a significant number of cases, we also observe that for every approach, there is a non-negligible chance to achieve response times that are worse than the baseline.
|
Statistical data in the form of RDF Data Cubes is becoming increasingly valuable as it influences decisions in areas such as health care, policy and finance. While a growing amount is becoming freely available through the open data movement, this data is opaque to laypersons. Semantic Question Answering (SQA) technologies provide access via free-form natural language queries but general SQA systems cannot process RDF Data Cubes. On the intersection between RDF Data Cubes and SQA, we create a new subfield of SQA, called RDCQA. We create an RDQCA benchmark as task 3 of the QALD-6 evaluation challenge, to stimulate further research and enable quantitative comparison between RDCQA systems. We design and evaluate the CubeQA algorithm, which
is the first RDCQA system and achieves a global F 1 score of 0.43 on the QALD6T3-test dataset, showing that RDCQA is feasible.
|
During the past couple of years, more and more data has been published as native RDF datasets. In this setup, both the size of the datasets and the need to process aggregate queries represent challenges for standard SPARQL query processing techniques. To overcome these limitations, materialized views can be created and used as a source of precomputed partial results during query processing. However, materialized view techniques, as proposed in relational databases, do not support RDF specifics, such as incompleteness and the need to support implicit (derived) information. Therefore, to overcome these challenges, this paper proposes MARVEL – the approach consisting of a view selection algorithm based on an RDF-specific cost model, a view definition syntax. and an algorithm for rewriting SPARQL queries using materialized RDF views. The experimental evaluation shows that the approach can improve query response time by more than an order of magnitude and is able to efficiently handle RDF specifics.
|
Alignments between ontologies usually come with numerical attributes expressing the confidence of each correspondence. Semantics supporting such confidences must generalise the semantics of alignments without confidence. There exists a semantics which satisfies this but introduces a discontinuity between weighted and non-weighted interpretations. Moreover, it does not provide a calculus for reasoning with weighted ontology alignments. This paper introduces a calculus for such alignments. It is given by an infinite relation-type algebra, the elements of which are weighted taxonomic relations. In addition, it approximates the non-weighted case in a continuous manner.
|
Large-scale knowledge graphs (KGs) abound in industry and academia.
They provide a unified format for integrating information sources,
aided by standards such as, e.g., the W3C RDB to RDF Mapping Language.
Meaningful semantic integration, however, is much harder than
syntactic alignment. Ontologies could be an interoperable and
declarative solution to this task. At a closer look, however, we find
that popular ontology languages, such as OWL and Datalog, cannot
express even the most basic relationships on the normalised data
format of KGs. Existential rules are more powerful, but may make
reasoning undecidable, and normalising them to suit KGs can destroy
syntactic restrictions that ensure decidability and low complexity. We
study this issue for several classes of existential rules and derive more
general syntactic criteria to recognise well-behaved rule-based ontologies
over knowledge graphs.
|
Resolving the semantic heterogeneity in the semantic web requires finding correspondences between ontologies describing resources. In particular, with the explosive growth of data sets in the Linked Open Data, linking multiple vocabularies and ontologies simultaneously, known as holistic matching problem, become necessary. Currently, most state-of-the-art matching approaches are limited to pairwise matching. In this paper, we propose an approach for holistic ontology matching that is modeled through a linear program extending the maximum-weighted graph matching problem with linear constraints (cardinality, structural, and coherence constraints). Our approach guarantees the optimal solution with mostly coherent alignments. To evaluate our proposal, we discuss the results of experiments performed on the Conference track of the OAEI 2015, under both holistic and pairwise matching settings.
|
While massive volumes of text are now more easily available for knowledge harvesting, many important facts about our everyday world are not expressed in a particularly explicit way. To address this, we present WebBrain, a new approach for harvesting commonsense knowledge that relies on joint learning from Web-scale data to fill gaps in the knowledge acquisition. We train a neural network model that not only learns word2vec-style vector representations of words but also commonsense knowledge about them. This joint model allows general semantic information to aid in generalizing beyond the extracted commonsense relationships. Experiments show that we can obtain word embeddings that reflect word meanings, yet also allow us to capture conceptual relationships and commonsense knowledge about them.
|
Semantics spread in large-scale knowledge bases can be used to intermediate heterogeneous users’ activity logs distributed in services; it can improve applications that assist users to decide next activities across services. Since user activities can be represented in terms of re- lationships involving three or more things (e.g. a user tags movie items on a webpage), they can be represented as a tensor. The recent semantic sensitive tensor factorization (SSTF) is promising since it achieves high accuracies in predicting users’ activities by applying semantics behind objects (e.g. item categories) to tensor factorization. However, SSTF fo- cuses on the factorization of data logs from a single service and thus has two problems: (1) the balance problem caused when simultaneously han- dling heterogeneous datasets and (2) the sparcity problem caused when there are insufficient data logs within a single service. Our solution, Se- mantic Sensitive Simultaneous Tensor Factorization (S3TF), tackles the above problems by: (1) It creates tensors for individual services and fac- torizes those tensors simultaneously; it does not force to create a tensor from multiple services and factorize the single tensor. This avoids low prediction results caused by the balance problem. (2) It utilizes shared semantics behind distributed logs and gives semantic biases to each ten- sor factorization. This avoids the sparsity problem by using the shared se- mantics among services. Experiments using the real-world datasets show that S3TF achieves up to 13% higher accuracy in rating predictions than the current best tensor method. It also extracts implicit relationships across services in the feature spaces by simultaneouse factorization.
|
With the success of Open Data a huge amount of tabular data sources
became available that could potentially be mapped and linked into the Web of
(Linked) Data. Most existing approaches to “semantically label” such tabular
data rely on mappings of textual information to classes, properties, or instances
in RDF knowledge bases in order to link – and eventually transform – tabular
data into RDF. However, as we will illustrate, Open Data tables typically contain
a large portion of numerical columns and/or non-textual headers; therefore
solutions that solely focus on textual “cues” are only partially applicable for mapping
such data sources. We propose an approach to find and rank candidates of
semantic labels and context descriptions for a given bag of numerical values. To
this end, we apply a hierarchical clustering over information taken from DBpedia
to build a background knowledge graph of possible “semantic contexts” for
bags of numerical values, over which we perform a nearest neighbour search to
rank the most likely candidates. Our evaluation shows that our approach can assign
fine-grained semantic labels, when there is enough supporting evidence in
the background knowledge graph. In other cases, our approach can nevertheless
assign high level contexts to the data, which could potentially be used in combination
with other approaches to narrow down the search space of possible labels.
|
Semantic labeling is the process of mapping attributes in data sources to classes in an ontology and is a necessary step in heterogeneous data integration. Variations in data formats, attribute names and even ranges of values of data make this a very challenging task. In this paper, we present a novel domain-independent approach to automatic semantic labeling that uses machine learning techniques. Previous approaches use machine learning to learn a model that extracts features related to the data of a domain, which requires the model to be re-trained for every new domain. Our solution uses similarity metrics as features to compare against labeled domain data and learns a matching function to infer the correct semantic labels for data. Since our approach depends on the learned similarity metrics but not the data itself, it is domain-independent and only needs to be trained once to work effectively across multiple domains. In our evaluation, our approach achieves higher accuracy than other approaches, even when the learned models are trained on domains other than the test domain.
|
We build on our earlier finding that more than 95% of the triples in actual RDF triple graphs have a remarkably tabular structure, whose schema does not necessarily follow from explicit metadata such as ontologies, but which an RDF store can automatically derive by looking at the data using so-called ``emergent schema'' detection techniques. In this paper we investigate how computers and in particular RDF stores can take advantage from this emergent schema to more compactly store RDF data and more efficiently optimize and execute SPARQL queries. To this end, we contribute techniques for efficient emergent schema aware RDF storage and new query operator algorithms for emergent schema aware scans and joins. In all, these techniques allow RDF schema processors fully catch up with relational database techniques in terms of rich physical database design options and efficiency, without requiring a rigid upfront schema structure definition.
|
Evaluating joins over RDF data stored in a shared-nothing server cluster is key
to processing truly large RDF datasets. To the best of our knowledge, the
existing approaches use a variant of the data exchange operator that is
inserted into the query plan statically (i.e., at query compile time) to
shuffle data between servers. We argue that this often misses opportunities for
local computation, and we present a novel solution to distributed query
answering that consists of two main components. First, we present a query
answering algorithm based on dynamic data exchange, which exploits data
locality better than the static approaches. Second, we present a partitioning
algorithm for RDF data based on graph partitioning whose aim is to increase
data locality. We have implemented our approach in the RDFox system, and our
performance evaluation suggests that our techniques outperform the state of the
art by up to an order of magnitude.
|
Linked Open Data has been recognized as a valuable source for background information in data mining. However, most data mining tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. Our evaluation shows that such vector representations outperform existing techniques for the propositionalization of RDF graphs on a variety of different predictive machine learning tasks, and that feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.
|
According to Semantic Web standards, IRIs are individual
constants or predicate letters whose names are chosen arbitrarily
and carry no formal meaning. At the same time it is a well-known
aspect of Semantic Web pragmatics that IRIs are often constructed
mnemonically, in order to be meaningful to a human interpreter.
The latter has traditionally been termed 'Social Meaning', a
concept that has been discussed but not yet quantitatively
studied by the Semantic Web community.
In this paper we use statistical model learning as a method to
quantify the meaning that is (at least) encoded in Semantic Web
names, We implement the approach and evaluate it over hundreds of
thousands of data sets in order to illustrate its efficacy. Our
experiments confirm that many Semantic Web names are indeed
meaningful and, more interestingly, we provide a quantitative
lower bound on how much meaning is (at least) encoded in names on
a per-dataset basis.
To our knowledge, this is the first paper about the interaction
between social and formal meaning, as well as the first paper
that uses statistical model learning as a method to quantify
meaning in the Semantic Web context.
|
To realise a semantic Web of Things, the challenge of achieving efficient Resource Description Format (RDF) storage and SPARQL query performance on Internet of Things (IoT) devices with limited resources has to be addressed. State-of-the-art SPARQL-to-SQL engines have been shown to outperform RDF stores on some benchmarks. In this paper, we describe an optimisation to the SPARQL-to-SQL approach, based on a study of time-series IoT data structures, that employs metadata abstraction and efficient translation by reusing existing SPARQL engines to produce Linked Data `just-in-time'. We evaluate our approach against RDF stores, state-of-the-art SPARQL-to-SQL engines and streaming SPARQL engines, in the context of IoT data and scenarios. We show that storage efficiency, with succinct row storage, and query performance can be improved from 2 times to 3 orders of magnitude.
|
Combinatorial creativity combines existing concepts in a novel way in order to produce a new concept. For example, we can imag- ine jewelry that measures blood pressure. For this, we would combine the concept of jewelry with the capabilities of medical devices. Combinato- rial creativity can be used to develop new business ideas, to find plots for books or movies, or simply to disrupt conventional thinking. In this paper, we propose a formal language for combinatorial creativity, based on description logics. We show that our language can be used to model existing inventions and (to a limited degree) to generate new concepts.
|
Mapping data to a shared domain ontology is a key step in publishing semantic content on the Web. Most of the work on automatically mapping structured and semi-structured sources to ontologies focuses on semantic labeling, i.e., annotating data fields with ontology classes and/or properties. However, a precise mapping that fully recovers the intended meaning of the data needs to describe the semantic relations between the data fields too. We present a novel approach to automatically discover the semantic relations within a given data source. We mine the small graph patterns occurring in Linked Open Data and combine them to build a graph that will be used to infer semantic relations. We evaluated our approach on datasets from different domains. Mining patterns of maximum length five, our method achieves an average precision of 75% and recall of 77% for a dataset with very complex mappings to the domain ontology, increasing up to 86% and 82%, respectively, for simpler ontologies and mappings.
|
The assessment of risk in medicine is a crucial task, depending on scientific knowledge derived by rigorous clinical studies regarding the (quantified) factors affecting biological changes, as well as on particular knowledge about the current status of a particular patient. Existing non-semantic risk prediction tools are typically based on hardcoded scientific knowledge, and only cover a very limited range of patient states. This makes them rapidly out of date, and limited in application, particularly for patients with co-morbidities (multiple co-occurring conditions). Semantic Web and Quantified Self technologies make it possible to address this task in a much more principled way, to maximise knowledge and data reuse and minimise maintenance requirements while enabling new and sophisticated applications involving widely-available biometric sensors.
We present a framework for calculating clinical risk predictions for patients based on automatically-gathered biometric data. This framework relies on generic, reusable ontologies for representing clinical risk, and sensor readings, and reasoning to support the integration of data represented according to these ontologies. This integration makes novel use of Semantic Web technologies, and supports straightforward extension and maintenance by medical professionals. The framework is evaluated in terms of its predictions, extensibility and ease of use for domain experts.
|
The goal of this work is to learn a measure supporting the detection of strong relationships between Linked Data entities. Such relationships can be represented as paths of entities and properties, and can be obtained through a blind graph search process traversing Linked Data. The challenge here is therefore the design of a cost-function that is able to detect the strongest relationship between two given entities, by objectively assessing the value of a given path. To achieve this, we use a Genetic Programming approach in a supervised learning method to generate path evaluation functions that compare well with human evaluations. We show how such a cost-function can be generated only using basic topological features of the nodes of the paths as they are being traversed (i.e. without knowledge of the whole graph), and how it can be improved through introducing a very small amount of knowledge about the vocabularies of the properties that connect nodes in the graph.
|
In recent years, there has been an increasing efforts to develop techniques for related entity recommendation, where the task is to retrieve a ranked list of related entities given a keyword query. Another trend in the area of information retrieval (IR) is to take temporal aspects of a given query into account when assessing the relevance of documents. However, while this has become an established functionality in document search engines, the significance of time, especially when explicitly given, has not been recognized for entity recommendation, yet. We address this gap by introducing the task of time-aware entity recommendation. We propose the first probabilistic model that takes time-awareness into consideration for entity recommendation by leveraging heterogeneous knowledge of entities extracted from different data sources publicly available on the Web. We extensively evaluate the proposed approach and our experimental results show considerable improvements compare to time-agnostic entity recommendation approaches.
|
The amount of entities in large knowledge bases available on the Web has been increasing rapidly, making it possible to propose new ways of intelligent information access. In addition, there is an impending need for technologies that can enable cross-lingual information access. As a simple and intuitive way of specifying information needs, keyword queries enjoy widespread usage, but suffer from the challenges including ambiguity, incompleteness and cross-linguality. In this paper, we present a knowledge base approach to cross-lingual keyword query interpretation by transforming keyword queries in different languages to their semantic representation, which can facilitate query disambiguation and expansion, and also bridge the language barriers of queries. The experimental results show that our approach achieves both high efficiency and effectiveness and considerably outperforms the baselines.
|
Navigational graph queries are an important class of queries that can extract implicit binary relations over the nodes of input graphs. Most of the navigational query languages used in the RDF community, e.g. property paths in W3C SPARQL 1.1 and nested regular expressions in nSPARQL, are based on the regular expressions. It is known that regular expressions have limited expressivity; for instance, some natural queries, like same generations-queries} are not expressible with regular expressions. To overcome this limitation, in this paper, we present cfSPARQL, an extension of SPARQL query language equipped with context-free grammars. The cfSPARQL language is strictly more expressive than property paths and nested expressions. The additional expressivity can be used for modelling graph similarities, graph summarization and ontology alignment. Despite the increasing expressivity, we show that cfSPARQL still enjoys a low computational complexity and can be evaluated efficiently.
|
We address the problem of performing entity resolution on RDF graphs containing multiple types of nodes, using the links between instances of different types to improve the accuracy. For example, in a graph of products and manufacturers the goal is to resolve all the products and all the manufacturers. We formulate this problem as multi-type graph summarization problem, which involves clustering the nodes in each type that refer to the same entity into one super node and creating weighted links among super nodes that summarize the inter-cluster links in the original graph. Experiments show that the proposed approach outperforms several state-of-the-art generic entity resolution approaches, especially in data sets with one-to-many, many-to-many relations and attributes with missing values.
|
Feature extraction algorithms in Music Informatics aim at deriving statistical and semantic information directly from audio signals. These may be ranging from energies in several frequency bands to musical information such as key, chords or rhythm. There is an increasing diversity and complexity of features and algorithms in this domain and applications call for a common structured representation to facilitate interoperability, reproducibility and machine interpretability. We propose a solution relying on Semantic Web technologies that is designed to serve a dual purpose (1) to represent computational workflows of audio features and (2) to provide a common structure for feature data to enable the use of Open Linked Data principles and technologies in Music Informatics. The Audio Feature Ontology is based on the analysis of existing tools and music informatics literature, which was instrumental in guiding the ontology engineering process. The ontology provides a descriptive framework for expressing different conceptualisations of the audio feature extraction domain and enables designing linked data formats for representing feature data. In this paper, we discuss important modelling decisions and introduce a harmonised ontology library consisting of modular interlinked ontologies that describe the different entities and activities involved in music creation, production and publishing.
|
Significant advances in Natural Language Processing (NLP) research are fostered when high-quality annotated corpora were provided for general use. In an effort to develop a sembank (i.e., an annotated corpus dedicated to capturing the semantic meaning of a large set of annotated sentences), NLP researchers have developed the Abstract Meaning Representation (AMR) formulation. Each AMR is a rooted, labeled graph that represents the semantics of a single sentence. Nodes in the core AMR graph represent concepts/entities (such as nouns, PropBank frames, etc.) and edges represent relations between concepts, (such a frame-specific arguments, roles, etc.). AMRs have been used to annotate corpora of classic books, newstext and the biomedical research literature. Research is progressing on creating automatic parsers to generate AMRs directly from textual input. In the work described here, we map the AMR representation to a linked data format (AMR-LD), adopting the ontological formulation of the underlying AMR faithfully. We describe the process of generating AMR-LD data from standard AMRs derived from biomedical research articles, including mapping named entities to well-known linked-data resources, such as Uniprot and PubChem, as well as an open-source software to convert AMR data to RDF. We describe the benefits of AMR-LD, including convenient analysis using SPARQL queries and ontology inferences, and embedding into the web of Linked Data. Finally, we discuss the possible impact of semantic web representations that are directly derived from natural language.
|
Household appliances are set to become highly intelligent, smart and networked devices in the near future. Systematically deployed on the Internet of Things (IoT), they would be able to form complete energy consuming, producing, and managing ecosystems. Smart systems are technically very heterogeneous, and standardized interfaces on a sensor and device level are therefore needed. However, standardization in IoT has largely focused at the technical communication level, leading to a large number of different solutions based on various standards and protocols, with limited attention to the common semantics contained in the message data structures exchanged at the technical level. The Smart Appliance REFerence ontology (SAREF) is a shared model of consensus developed in close interaction with the industry and with the support of the European Commission. It is published as a technical specification by ETSI and provides an important contribution to achieve semantic interoperability for smart appliances. This paper builds on the success achieved in standardizing SAREF and presents SAREF4EE, an extension of SAREF. SAREF4EE has been created in collaboration with the EEBus and Energy@Home industry associations to interconnect their (different) data models. By using SAREF4EE, smart appliances from different manufacturers that support the EEBus or Energy@Home standards can easily communicate with each other using any energy management system at home or in the cloud.
|
Assessing the Underworld (ATU) is a large interdisciplinary UK research project addressing urban infrastructure challenges, especially how to make streetworks more efficient and sustainable. One of the key challenges it addresses is integrated inter-asset maintenance. As the assets on the surface of the ground (e.g. pavements) and those buried under it (e.g. pipes and cables) are supported by the ground, the properties and processes of soil affect the performance of these assets to a significant degree. In order to make integrated decisions, it is necessary to combine the knowledge and expertise in multiple areas, such as roads, soil, buried assets, sensing, etc. This requires an underpinning knowledge model, in the form of an ontology. Within this context, we present a new ontology for describing soil properties (e.g. soil strength) and processes (e.g. soil compaction), as well as how they affect each other. This ontology can be used to express how the ground affects and is affected by assets buried under the ground or on the ground surface. The ontology is written in OWL 2 and openly available from the University of Leeds data repository: http://doi.org/10.5518/54.
|
Over the past years, the size of the Data Web has increased significantly, which makes obtaining general insights into its growth and structure both more challenging and more desirable. The lack of such insights hinders important data management tasks such as quality, privacy and coverage analysis. In this paper, we present LODStats, which provides a comprehensive picture of the current state of a significant part of the Data Web. LODStats integrates RDF datasets from data.gov, publicdata.eu and datahub.io data catalogs and at the time of writing lists over 9 000 RDF datasets. For each RDF dataset, LODStats collects comprehensive statistics and makes these available in adhering to the LDSO vocabulary. This analysis has been regularly published and enhanced over the past four years at the public platform lodstats.aksw.org. We give a comprehensive overview over the resulting dataset.
|
We present a new hybrid knowledge base that combines the contextual information of distributional models with the conciseness and precision of manually constructed lexical networks. In contrast to dense vector representations, our resource is human readable and interpretable, and can be easily embedded within the Semantic Web ecosystem. Manual evaluation based on human judgments and an extrinsic evaluation on the task of Word Sense Disambiguation both indicate the high quality of the resource, as well as the benefits of enriching top-down lexical knowledge resources with bottom-up distributional information from text.
|
The Semantic Web Community has invested significant research effort in developing systems for Semantic Web search and exploration. But while it has been easy to assess the systems' computational efficiency, it has been much harder to assess how well different semantic systems help their users find and browse information. In this article, we propose and demonstrate the use of a benchmark for evaluating them, similar to the TREC benchmark for evaluating traditional search engines. Our benchmark includes a set of typical user tasks and a well-defined procedure for assigning a measure of performance on those tasks to a semantic system. We demonstrate its application to one such system, Rhizomer. We intend for this work to initiate a community conversation that will lead to a general accepted framework for comparing systems and measuring, and thus encouraging, progress towards better semantic search and exploration tools.
|
SPARQL is the W3C standard query language for querying data expressed in the Resource Description Framework (RDF). The increasing amounts of RDF data available raise a major need and research interest in building efficient and scalable distributed SPARQL query evaluators. In this context, we propose and share SPARQLGX: our implementation of a distributed RDF datastore based on Apache Spark. SPARQLGX is designed to leverage existing Hadoop infrastructures for evaluating SPARQL queries. SPARQLGX relies on a translation of SPARQL queries into executable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. We show that SPARQLGX makes it possible to evaluate SPARQL queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how SPARQLGX compares to related state-of-the-art implementations. Using a simple design, SPARQLGX already represents an interesting alternative in several scenarios. We share it as a resource for the further construction of efficient SPARQL evaluators.
|
In this paper, we experimentally compare the efficiency of various database engines for the purposes of querying the Wikidata knowledge-base, which can be conceptualised as a directed edge-labelled where edges can be annotated with meta-information called qualifiers. We select two popular SPARQL databases (Virtuoso, Blazegraph), a popular relational database (PostgreSQL), and a popular graph database (Neo4J) for comparison and discuss various options as to how Wikidata can be represented in the models of each engine. We design a set of experiments to test the relative query performance of these representations in the context of their respective engines. We first execute a large set of atomic lookups to establish a baseline performance for each test setting, and subsequently perform experiments on instances of more complex graph patterns based on real-world examples. We conclude with a summary of the strengths and limitations of the engines observed.
|
While the geographical domain has long been involved as an important part of the Linked Data, the small amount of Chinese linked geographical data hinders the integration and sharing of both Chinese and cross-lingual knowledge. In this paper, we contribute to the development of a new Chinese linked geographical dataset named Clinga, by obtaining data from the largest Chinese wiki encyclopedia. We manually design a new geography ontology to categorize a wide range of physical and human geographical entities, and carry out an automatic discovery of links to existing knowledge bases. The resulted Clinga dataset contains over half million Chinese geographical entities and is open access.
|
The paper presents a synthetic linked data generator that can generate a large amount of RDF data based on certain statistical distribution. Data generation is platform independent, supports streaming mode and produces output in N-Triples and N-Quad format. Different sets of output can be generated using various configuration parameters and the outputs are reproducible. Unlike existing generators, our generator accepts any vocabulary and can supplement the output with noisy and inconsistent data. The generator has an option to inter-link instances with real ones provided that the user supplies entities from real datasets.
|
A variety of tools for visualizing, editing, and documenting OWL ontologies have been developed in the last couple of years. The OWL coverage and conformance of these tools usually needs to be tested during development or for evaluation and comparison purposes. However, in particular for the testing of special OWL concepts and concept combinations, it can be tedious to find suitable ontologies and test cases. We have developed OntoBench, a generator for OWL 2 benchmark ontologies that can be used to test and compare ontology visualizers and related tools. In contrast to existing OWL benchmarks, OntoBench does not focus on scalability and performance but OWL coverage and concept combinations. Consistent benchmark ontologies are dynamically generated based on OWL 2 language constructs selected in a graphical user interface. OntoBench is available on GitHub and as a public service, making it easy to use the tool and generate custom ontologies or ontology fragments.
|
This paper proposes a mapping of the Linked Data Platform (LDP) specification for Constrained Application Protocol (CoAP). Main motivation stems from the fact that LDP W3C Recommendation presents resource management primitives for HTTP only. Hence, use cases related to Web of Things scenarios, where HTTP-based communication and infrastructures are unfeasible, are partially neglected. A general translation of LDP-HTTP requests and responses is provided, as well as a fully comprehensive framework for HTTP-to-CoAP proxying. The theoretical work is corroborated by an experimental campaign using the W3C Test Suite for LDP.
|
Processing data streams is increasingly gaining a momentum, given the need to process these flows of information in real time and at Web scale.
In this context, RDF Stream Processing (RSP) and Stream Reasoning (SR) have emerged as solutions to combine semantic technologies with stream and event processing techniques.
Research in these areas has proposed an ecosystem of solutions to query, reason and perform real time processing over heterogeneous and distributed data streams on the Web.
However, so far one basic building block has been missing: a mechanism to disseminate and exchange RDF streams on the Web.
In this work we close this gap, proposing TripleWave, a reusable and generic tool that enables the publication of RDF streams on the Web.
The features of TripleWave have been derived from requirements of real use-cases, and consider a diverse set of scenarios, independent of any specific RSP implementation.
TripleWave can be fed with existing Web streams (e.g. Twitter and Wikipedia streams) or time-annotated RDF datasets (e.g. the LinkedSensorData set), and it can be invoked through both pull- and push-based mechanisms, thus also enabling RSP engines to automatically register and receive data from TripleWave.
|
The Semantic Web Dog Food (SWDF) is the reference linked dataset of the Semantic Web community about papers, people, organisations, and events related to its academic conferences. In this paper we analyse the existing problems of generating, representing and maintaining Linked Data for the SWDF. With this work (i) we provide a refactored and cleaned SWDF dataset; (ii) we use a novel data model which improves the Semantic Web Conference Ontology, adopting best ontology design practices and (iii) we provide an open source maintenance workflow to support a healthy grow of the dataset beyond the Semantic Web conferences.
|
The OWL Reasoner Evaluation (ORE) Competition is an annual competition (with an associated workshop) which pits OWL 2 compliant reasoners against each other on various standard reasoning tasks over naturally occurring problems. The 2015 competition was the third of its sort and had 14 reasoners competing in six tracks comprising three tasks (consistency, classification, and realisation) over two profiles (OWL 2 DL and EL). In this paper, we outline the design of the competition and present the infrastructure used for its execution: the corpora of ontologies, the competition framework, and the submitted systems. All resources are publicly available on the Web, allowing users to easily re-run the 2015 competition, or reuse any of the ORE infrastructure for reasoner experiments or ontology analysis.
|
This paper describes the outcome of an e-government project named FOOD, FOod in Open Data, which was carried out in the context of a collaboration between the Institute of Cognitive Sciences and Technologies of the Italian National Research Council, the Italian Ministry of Agriculture (MIPAAF) and the Italian Digital Agency (AgID). In particular, we implemented several ontologies for describing protected names of products (wine, pasta, fish, oil, etc.). In addition, we present the process carried out for producing and publishing a LOD dataset containing data extracted from existing Italian policy documents on such products and compliant with the aforementioned ontologies.
|
YAGO is a large knowledge base that is built automatically from Wikipedia, WordNet and GeoNames. The project combines information from 10 Wikipedias of different languages, thus giving the knowledge a multilingual dimension. It also attaches spatial and temporal information to many facts, and thus allows the user to query the data over space and time. YAGO focuses on extraction quality and achieves a manually evaluated precision of 95%. In this paper, we explain from a general perspective how YAGO is built from its sources, how its quality is evaluated, how a user can access it, and how other projects utilize it.
|
A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web
In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. In this paper, we present a collection of 22 benchmark datasets at different sizes, derived from existing Semantic Web datasets as well as from external classification and regression problems linked to datasets in the Linked Open Data cloud. Such a collection of datasets can be used to conduct qualitative performance testing and systematic comparisons of approaches.
|
Effective, collaborative integration of software services and big data to develop insightful analytics, for Web-scale systems, is now a crucial techno-economic challenge. This requires new combined data and software engineering processes and tools. Semantic metadata standards such as RDFS and OWL, and linked data principles, provide a technical grounding for such integrated systems given an appropriate model of the domain. In this paper we introduce the ALIGNED suite of ontologies or vocabularies specifically designed to model the information exchange needs of combined software and data engineering processes. The models have been deployed to enable: tool-chain integration, such as the exchange of data quality reports; cross-domain communication, such as interlinked data and software unit testing; mediation of the system design process through the capture of design intents and as a source of context for model-driven software engineering processes. These ontologies are deployed in trial live web-scale, data-intensive system development environments in both the commercial and academic domains. We exemplify the usage of the suite on a complex collaborative software and data engineering scenario from the legal information system domain.
|
We performed a thorough replicate study of the top systems performing in the yearly SemEval Twitter Sentiment Analysis task. We highlight some differences between the results obtained by the top systems and the ones we are able to compute. We also propose SentiME, an ensemble system composed of 5 state-of-the-art sentiment classifiers. SentiME first trains the different classifiers using the Bootstrap Aggregating Algorithm. The classification results are then aggregated using a linear function that averages the classification distributions of the different classifiers. SentiME has also been tested over the SemEval2015 test set, properly trained with the SemEval2015 train test, outperforming the best ranked system of the challenge.
|
Increasingly, Web pages mix entities coming from different sources and represented in several different ways. It can thus happen that the same entity is both described by using schema.org annotations and by creating a text anchor pointing to its Wikipedia page. Often, those representations provide complementary information which is not exploited since those entities are disjoint.
In this project, we explore the extent to which entities represented in different ways repeat on the Web, how they are related, and how they complement (or link) to each other. Our initial experiments show that we can unveil a previously (unexploited) knowledge graph by applying simple instance matching techniques on a large collection of schema.org annotations and DBpedia. The resulting knowledge graph aggregates entities (often tail entities) scattered across several Web pages, and complements existing DBpedia entities with new facts and properties.
In order to facilitate further investigations in how to mine such information, we are releasing i) an excerpt of all CommonCrawl web pages containing both Wikipedia and schema.org annotations, ii) the toolset to extract this information and perform knowledge graph construction and mapping onto DBpedia, as well as iii) the resulting knowledge graph (VoldemortKG) obtained via label matching techniques.
|
Recently, a growing number of linguistic resources in different languages have been published and interlinked as part of the Linguistic Linked Open Data (LLOD) cloud. However, in comparison to English and other prominent languages, the presence of Chinese in such a cloud is still limited, despite the fact that Chinese is the most spoken language worldwide. Publishing more Chinese language resources in the LLOD cloud can benefit both academia and industry to better understand the language itself and to further build multilingual applications that will improve the flow of data and services across countries. In this paper, we describe Zhishi.lemon, a newly developed dataset based on the lemon model that constitutes the lexical realization of Zhishi.me, one of the largest Chinese datasets in the Linked Open Data (LOD) cloud. Zhishi.lemon combines the lemon core with the lemon translation module in order to build a linked data lexicon in Chinese with translations into Spanish and English. Links to BabelNet (a vast multilingual encyclopedic resource) have been provided as well. We also present a showcase of this module along with the technical details of transforming Zhishi.me to Zhishi.lemon. We have made the dataset accessible on the Web for both humans (via a Web interface) and software agents (with a SPARQL endpoint).
|
This paper introduces the Audio Effects Ontology (AUFX-O) building on previous theoretical models describing audio processing units and workflows in the context of music production. We discuss important conceptualisations of different abstraction layers, their necessity to successfully model audio effects, and their application method. We present use cases concerning the application of effects in music production projects, and the creation of audio effect metadata facilitating a linked data service exposing information about effect implementations. By doing so, we show how our model benefits knowledge sharing, and enables reproducibility and analysis of audio production workflows.
|
To enable knowledge access across languages, ontologies that are often represented only in English, need to be translated into different languages.
The main challenge in translating ontologies is to find the right term with respect to the domain modeled by ontology itself.
Machine translation services may help in this task; however, a crucial requirement is to have translations validated by experts before the ontologies are deployed.
Real-world applications must implement a support system addressing this task for relieve experts work in validating all translations.
In this paper, we present ESSOT, an Expert Supporting System for Ontology Translation.
The peculiarity of this system is to exploit semantic information of the concept's context for improving the quality of label translations.
The system has been tested both within the Organic.Lingua project by translating the modeled ontology in three languages and on other multilingual ontologies in order to evaluate the effectiveness of the system in other contexts.
The results have been compared with the translations provided by the Microsoft Translator API and the improvements demonstrated the viability of the proposed approach.
|
Conserving fossil-based energy to reduce carbon emissions is key to slowing down global warming. The 2015 Paris agreement on climate change emphasised the importance of raising public awareness and participation to address this societal challenge. In this paper we introduce EnergyUse; a social and collaborative platform for raising awareness on climate change, by enabling users to view and compare the actual energy consumption of various appliances, and to share and discuss energy conservation tips in an open and social environment. The platform collects data from smart plugs, and exports appliance consumption information and community generated energy tips as linked data. In this paper we report on the system design, data modelling, platform usage and early deployment with a set of 58 initial participants. We also discuss the challenges, lessons learnt, and future platform developments.
|
Rakuten Ichiba uses a taxonomy to organize the items it sells. Currently, the taxonomy classes that are relevant in terms of profit generation and difficulty of exploration are being manually extended with data properties deemed helpful to create pages that improve the user search experience and ultimately the conversion rate. In this paper we present a scalable approach that aims to automate this process, automatically selecting the relevant and semantically homogenous subtrees in the taxonomy, extracting from semi-structured text in items descriptions a core set of properties and a popular subset of their ranges, then ex- tending the covered range using relational similarities in free text. Additionally, our process automatically tags the items with the new semantic information and exposes them as RDF triples. We present a set of experiments showing the effectiveness of our approach in this business context.
|
The illegal parking of bicycles is a social problem in Tokyo and other urban areas. The purpose of this study was to sustainably build Linked Open Data (LOD) for the illegally parked bicycles and to support the problem solving by raising social awareness, in cooperation with the Bureau of General Affairs of Tokyo. We first extracted information on the problem factors and designed LOD schema for illegally parked bicycles. Then we collected pieces of data from Social Networking Service (SNS) and websites of municipalities to build the illegally parked bicycle LOD (IPBLOD) with more than 200,000 triples. We then estimated the missing data in the LOD based on the causal relations from the problem factors. As a result, the number of illegally parked bicycles can be inferred with 70.9% accuracy. Finally, we published the complemented LOD and a Web application to visualize the distribution of illegally parked bicycles in the city. We hope this raises social attention on this issue.
|
In model-based systems engineering a model specifying the system's design is shared across a variety of disciplines and used to ensure the consistency and quality of the overall design. Existing implementations for describing these system models exhibit a number of shortcomings regarding their approach to data management. In this emerging applications paper, we present the application of an ontology for space system design providing increased semantic soundness of the underlying standardized data specification, enabling reasoners to identify problems in the system, and allowing the application of operational knowledge collected over past projects to the system to be designed. Based on a qualitative evaluation driven by data derived from an actual satellite design project, a reflection on the applicability of ontologies in the overall model-based systems engineering approach is pursued.
|
This paper describes the outcomes of an ongoing collaboration between Siemens and the University of Oxford, with the goal of facilitating the design of ontologies and their deployment in applications. Ontologies are mainly used in Siemens to capture the conceptual information models underpinning a wide range of applications. We start by describing the key role that such models play in two use cases in the manufacturing and energy production sectors. Then, we discuss the formalisation of information models using ontologies, and the relevant reasoning services. Finally, we present SOMM---a tool that supports engineers with little background on semantic technologies in the creation of ontology-based models and in populating them with data. SOMM implements a fragment of OWL 2 RL extended with a form of integrity constraints for data validation, and it comes with support for schema and data reasoning, as well as for model integration. Our evaluation demonstrates the adequacy of SOMM's functionality and performance for Siemens applications.
|
Real-time analytics that requires integration and aggregation of heterogeneous and distributed streaming and static data is a typical task in many industrial scenarios such as diagnostics of turbines in Siemens. OBDA approach has a great potential to facilitate such tasks; however, it has a number of limitations in dealing with analytics that restrict its use in important industrial applications. Based on our experience with Siemens, we argue that in order to overcome those limitations OBDA should be extended and become analytics, source, and cost aware. In this work we propose such an extension. In particular, we propose an ontology, mapping, and query language for OBDA, where aggregate and other analytical functions are first class citizens. Moreover, we develop query optimisation techniques that allow to efficiently process analytical tasks over static and streaming data. We implement our approach in a system and evaluate our system with Siemens turbine data.
|
We present a domain-agnostic system for Question Answering over multiple semi-structured and possibly linked datasets without the need of a training corpus. The system is motivated by an industry use-case where Enterprise Data needs to be combined with a large body of Open Data to fulfill information needs not satisfied by prescribed application data models. Our proposed Question Answering pipeline combines existing components with novel methods to perform, in turn, linguistic analysis of a query, named entity extraction, entity / graph search, fusion and ranking of possible answers. We evaluate QuerioDALI with two open-domain benchmarks and a biomedical one over Linked Open Data sources, and show that our system produces comparable results to systems that require training data and are domain-dependent. In addition, we analyze the current challenges and shortcomings.
|
The process of classifying scholarly outputs is crucial to ensure timely access to knowledge. However, this process is typically carried out manually by expert editors, leading to high costs and slow throughput. In this paper we present Smart Topic Miner (STM), a novel solution which uses semantic web technologies to classify scholarly publications on the basis of a very large automatically generated ontology of research areas. STM was developed to support the Springer Nature Computer Science editorial team in classifying proceedings in the LNCS family. It analyses in real time a set of publications provided by an editor and produces a structured set of topics and a number of Springer Nature classification tags, which best characterise the given input. In this paper we present the architecture of the system and report on an evaluation study conducted with a team of Springer Nature editors. The results of the evaluation, which showed that STM classifies publications with a high degree of accuracy, are very encouraging and as a result we are currently discussing the required next steps to ensure large scale deployment within the company.
|
One focus of Semantic Technologies are formalisms that allow to express complex properties of and relationships between classes of data. The declarative nature of these formalisms is close to natural language and human conceptualisation and thus Semantic Technologies enjoy increasing popularity in scenarios where traditional solutions lead to very convoluted procedures which are difficult to maintain and whose correctness is difficult to judge.
A fruitful application of Semantic Technologies in the field of health care data analysis has emerged from the collaboration between Oxford and Kaiser Permanente a US health care provider (HMO). US HMOs have to annually deliver measurement results on their quality of care to US authorities. One of these sets of measurements is defined in a specification called HEDIS which is infamous amongst data analysts for its complexity. Traditional solutions with either SAS-programs or SQL-queries lead to involved solutions whose maintenance and validation is difficult and binds considerable amount of resources.
In this paper we present the project in which we have applied Semantic Technologies to compute the most difficult part of the HEDIS measures. We show that we arrive at a clean, structured and legible encoding of HEDIS in the rule language of the RDF-triple store RDFox. We use RDFox's reasoning capabilities and SPARQL queries to compute and extract the results. The results of a whole Kaiser Permanente regional branch could be computed in competitive time by RDFox on readily available commodity hardware. Further development and deployment of the project results are envisaged in Kaiser Permanente.
|
Building and Exploring National-wide Enterprise Knowledge Graphs for Investment Analysis in an Incremental Way
Full-fledged enterprise information can be a great weapon in investment analysis. However, enterprise information is scattered in different databases and websites. The information from a single source is incomplete and also suffers from noise. It is not an easy task to integrate and utilize information from diverse sources in real business scenarios. In this paper, we present an approach to build knowledge graphs (KGs) by exploiting semantic technologies to reconcile the data from diverse sources incrementally. We build a national-wide enterprise KG which incorporates information about 40,000,000 enterprises in China. We also provide querying about enterprises and data visualization capabilities as well as novel investment analysis scenarios, including finding an enterprise's real controllers, innovative enterprise analysis, enterprise path discovery and so on. The KG and its applications are currently used by two security companies in their investment banking businesses.
|
SPARQL has many nice features for accessing data integrated across different data sources, which is an important step in any data analysis task. We report the use of SPARQL for two real data analytic use cases from the healthcare and life sciences domains, which exposed certain weaknesses in the current specification of SPARQL; specifically when the data being integrated is most conveniently accessed via RESTful services and in formats beyond RDF, such as XML. We therefore extended SPARQL with generalized 'service', constructs for accessing services beyond the SPARQL endpoints supported by 'service'; for efficiency, our constructs additionally needed to support posting data, which is also not supported by 'service'. Furthermore, data from multiple sources led to natural modularity in the queries, with different portions of the query pertaining to different sources, so we also extended SPARQL with a simple 'function' mechanism to isolate the mechanics of accessing each endpoint. We provide an open source implementation of this SPARQL endpoint in an RDF store called Quetzal, and evaluate its use in the two data analytic scenarios over real datasets.
|
Knowledge graphs such as Yago and Freebase have become a powerful asset for enhancing search, and are being intensively used in both academia and industry. Many existing knowledge graphs are either available as Linked Open Data, or they can be exported as RDF datasets enhanced with background knowledge in the form of an OWL 2 ontology. Faceted search is the de facto approach for exploratory search in many online applications, and has been recently proposed as a suitable paradigm for querying RDF repositories. In this paper, we provide rigorous theoretical underpinnings for faceted search in the context of RDF-based knowledge graphs enhanced with OWL 2 ontologies. We identify well-defined fragments of SPARQL that can be naturally captured using faceted search as a query paradigm, and establish the computational complexity of answering such queries. We also study the problem of updating faceted interfaces, which is critical for guiding users in the formulation of meaningful queries during exploratory search. We have implemented our approach in a fully-fledged faceted search system, SemFacet, which we have evaluated over the Yago knowledge graph.
|
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
|
One of the main tasks when creating and maintaining knowledge bases is to validate facts and provide sources for them in order to ensure correctness and traceability of the provided knowledge. So far, this task is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. In this article, we present DeFacto (Deep Fact Validation)—an algorithm able to validate facts by finding trustworthy sources for them on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of web pages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. To achieve this goal, DeFacto collects and combines evidence from web pages written in several languages. In addition, DeFacto provides support for facts with a temporal scope, i.e., it can estimate in which time frame a fact was valid. Given that the automatic evaluation of facts has not been paid much attention to so far, generic benchmarks for evaluating these frameworks were not previously available. We thus also present a generic evaluation framework for fact checking and make it publicly available.
|
To enable efficiency in stream processing, the evaluation of a query is usually performed over bounded parts of (potentially) unbounded streams, i.e., processing windows “slide” over the streams. To avoid inefficient re-evaluations of already evaluated parts of a stream in respect to a query, incremental evaluation strategies are applied, i.e., the query results are obtained incrementally from the result set of the preceding processing state without having to re-evaluate all input buffers. This method is highly efficient but it comes at the cost of having to maintain processing state, which is not trivial, and may defeat performance advantages of the incremental evaluation strategy. In the context of RDF streams the problem is further aggravated by the hard-to-predict evolution of the structure of RDF graphs over time and the application of sub-optimal implementation approaches, e.g., using relational technologies for storing data and processing states which incur significant performance drawbacks for graph-based query patterns. To address these performance problems, this paper proposes a set of novel operator-aware data structures coupled with incremental evaluation algorithms which outperform the counterparts of relational stream processing systems. This claim is demonstrated through extensive experimental results on both simulated and real datasets.
|
The essence and value of Linked Data lies in the ability of humans and machines to query, access and reason upon highly structured and formalised data. Ontology structures provide an unambiguous description of the structure and content of data. While a multitude of software applications and visualization systems have been developed over the past years for Linked Data, there is still a significant gap that exists between applications that consume Linked Data and interfaces that have been designed with significant focus on aesthetics. Though the importance of aesthetics in affecting the usability, effectiveness and acceptability of user interfaces have long been recognised, little or no explicit attention has been paid to the aesthetics of Linked Data applications. In this paper, we introduce a formalised approach to developing aesthetically pleasing semantic web interfaces by following aesthetic principles and guidelines identified from literature. We apply such principles to design and develop a generic approach of using visualizations to support exploration of Linked Data, in an interface that is pleasing to users. This provides users with means to browse ontology structures, enriched with statistics of the underlying data, facilitating exploratory activities and enabling visual query for highly precise information needs. We evaluated our approach in three ways: an initial objective evaluation comparing our approach with other well-known interfaces for the semantic web and two user evaluations with semantic web researchers.
|
Ontology localization is the task of adapting an ontology to a different cultural context, and has been identified as an important task in the context of the Multilingual Semantic Web vision. The key task in ontology localization is translating the lexical layer of an ontology, i.e., its labels, into some foreign language. For this task, we hypothesize that the translation quality can be improved by adapting a machine translation system to the domain of the ontology. To this end, we build on the success of existing statistical machine translation (SMT) approaches, and investigate the impact of different domain adaptation techniques on the task. In particular, we investigate three techniques: (i) enriching a phrase table by domain-specific translation candidates acquired from existing Web resources, (ii) relying on Explicit Semantic Analysis as an additional technique for scoring a certain translation of a given source phrase, as well as (iii) adaptation of the language model by means of weighting n-grams with scores obtained from topic modelling. We present in detail the impact of each of these three techniques on the task of translating ontology labels. We show that these techniques have a generally positive effect on the quality of translation of the ontology and that, in combination, they provide a significant improvement in quality.
|
This paper presents a novel approach to Linked Data exploration that uses Encyclopedic Knowledge Patterns (EKPs) as relevance criteria for selecting, organising, and visualising knowledge. EKP are discovered by mining the linking structure of Wikipedia and evaluated by means of a user-based study, which shows that they are cognitively sound as models for building entity summarisations. We implemented a tool named Aemoo that supports EKP-driven knowledge exploration and integrates data coming from heterogeneous resources, namely static and dynamic knowledge as well as text and Linked Data. Aemoo is evaluated by means of controlled, task-driven user experiments in order to assess its usability, and ability to provide relevant and serendipitous information as compared to two existing tools: Google and RelFinder.
|
Knowledge graphs have gained increasing popularity in the past couple of years, thanks to their adoption in everyday search engines. Typically, they consist of fairly static and encyclopedic facts about persons and organizations–e.g. a celebrity’s birth date, occupation and family members–obtained from large repositories such as Freebase or Wikipedia. In this paper, we present a method and tools to automatically build knowledge graphs from news articles. As news articles describe changes in the world through the events they report, we present an approach to create Event-Centric Knowledge Graphs (ECKGs) using state-of-the-art natural language processing and semantic web techniques. Such ECKGs capture long-term developments and histories on hundreds of thousands of entities and are complementary to the static encyclopedic information in traditional knowledge graphs. We describe our event-centric representation schema, the challenges in extracting event information from news, our open source pipeline, and the knowledge graphs we have extracted from four different news corpora: general news (Wikinews), the FIFA world cup, the Global Automotive Industry, and Airbus A380 airplanes. Furthermore, we present an assessment on the accuracy of the pipeline in extracting the triples of the knowledge graphs. Moreover, through an event-centered browser and visualization tool we show how approaching information from news in an event-centric manner can increase the user’s understanding of the domain, facilitates the reconstruction of news story lines, and enable to perform exploratory investigation of news hidden facts."
|
The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of interlinked and distributed datasets from multiple domains. Running complex queries on this compendium often requires accessing data from different endpoints within one query. The abundance of datasets and the need for running complex query has thus motivated a considerable body of work on SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data. However, the granularity of previous evaluations of such systems has not allowed deriving of insights concerning their behavior in different steps involved during federated query processing. In this work, we perform extensive experiments to compare state-of-the-art SPARQL endpoint federation systems using the comprehensive performance evaluation framework FedBench. In addition to considering the tradition query runtime as an evaluation criterion, we extend the scope of our performance evaluation by considering criteria, which have not been paid much attention to in previous studies. In particular, we consider the number of sources selected, the total number of SPARQL ASK requests used, the completeness of answers as well as the source selection time. Yet, we show that they have a significant impact on the overall query runtime of existing systems. Moreover, we extend FedBench to mirror a highly distributed data environment and assess the behavior of existing systems by using the same performance criteria. As the result we provide a detailed analysis of the experimental outcomes that reveal novel insights for improving current and future SPARQL federation systems.
|
One of the major barriers to the deployment of Linked Data is the difficulty that data publishers have in determining which vocabularies to use to describe the semantics of data. This system report describes Linked Open Vocabularies (LOV), a high quality catalogue of reusable vocabularies for the description of data on the Web. The LOV initiative gathers and makes visible indicators that have not been previously harvested such as the interconnections between vocabularies, version history along with past and current referent (individual or organization). The report details the various components of the system along with some innovations such as the introduction of a property-level boost in the vocabulary search scoring which takes into account the property's type (e.g rdfs:label, dc:comment) associated with a matching literal value. By providing an extensive range of data access methods (full-text search, SPARQL endpoint, API, data dump or UI), the project aims at facilitating the reuse of well-documented vocabularies in the Linked Data ecosystem. The adoption of LOV by many applications and methods shows the importance of such a set of vocabularies and related features for the ontology design activity and the publication of data on the Web.
|
The rapid growth of the Linked Open Data cloud, as well as the increasing ability to lift relational enterprise datasets to a semantic, ontology-based level means that vast amounts of information are now available in a representation that closely matches the conceptualizations of the potential users of this information. This makes it interesting to create ontology based, user-oriented tools for searching and exploring this data. Although initial efforts were intended for tech users with knowledge of SPARQL/RDF, there are ongoing proposals designed for lay users. One of the most promising approaches is to use visual query interfaces, but more user studies are needed to assess their effectiveness. In this paper, we compare the effect on usability of two important paradigms for ontology-based query interfaces: form-based and graph-based interfaces. In order to reduce the number of variables affecting the comparison, we performed a user study with two state-of-the-art query tools developed by ourselves, sharing a large part of the code base: the graph-based tool OptiqueVQS*, and the form-based tool PepeSearch. We evaluated these tools in a formal comparison study with 15 participants searching a Linked Open Data version of the Norwegian Company Registry. Participants had to respond to 6 non-trivial search tasks using alternately OptiqueVQS* and PepeSearch. Even without previous training, retrieval performance and user confidence were very high, thus suggesting that both interface designs are effective for searching RDF datasets. Expert searchers had a clear preference for the graph-based interface, and mainstream searchers obtained better performance and confidence with the form-based interface. While a number of participants spontaneously praised the capability of the graph interface for composing complex queries, our results evidence that graph interfaces are difficult to grasp. In contrast, form interfaces are more learnable and relieve problems with disorientation for mainstream users. We have also observed positive results introducing faceted search and dynamic term suggestion in semantic search interfaces.
|
The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linked Data (LD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. In this article, we present the results of a systematic review of approaches for assessing the quality of LD. We gather existing approaches and analyze them qualitatively. In particular, we unify and formalize commonly used terminologies across papers related to data quality and provide a comprehensive list of 18 quality dimensions and 69 metrics. Additionally, we qualitatively analyze the 30 core approaches and 12 tools using a set of attributes. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused towards data quality, specifically for LD.
|
Directed Acyclic Graph(DAG) data is increasingly available on the Web, including Linked Open Data(LOD). Mining reachability relationships between entities is an important task for extracting knowledge from LOD. Diverse labeling schemes have been proposed to efficiently determine the reachability. We focus on a state-of-the-art 2-hop labeling scheme that is based on a permutation of vertices to achieve a linear index size and reduce on-line searches that are required when the reachability cannot be answered by 2-hop labels only. We observed that the approach can be improved in three different ways; 1) space-efficiency - guarantee the minimized index size without randomness 2) update-efficiency - update labels efficiently when graphs changes 3) parallelization - labeling should be cluster-based, and solved in a distributed fashion. In these regards, this PhD thesis proposes optimization techniques that address these issues. In this paper in particular, a way of reducing the 2-hop label size is proposed with preliminary results on real-world DAG datasets. In addition, we will discuss the feasibilities of the other issues based on our on-going works.
|
In Sir Tim Berners-Lee’s seminal article that introduce his vision of the semantic web, one of the use-cases described was a health- related example where health consumers utilized intelligent hand-held devices that aggregated and exchanged health data from the semantic web. Presently, majority of health consumers and patients rely on personal technology and the web to find information and to make personal health decisions. This proposal aims to contribute towards that use-case, specifically in the “hot-bed” issue of human papillomavirus (HPV) vac- cine. The HPV vaccine targets young adults and teens to protect against life-threatening cancers, yet a segment of the public has reservations against the vaccine. I propose an interactive dialogue agent that harness patient-level vaccine information encoded in an ontology that can be “talked to” with a natural language interface using utterances. I aim to pilot this technology in a clinic to assess if patient knowledge about HPV and the vaccine is increased, and if their attitude toward the vaccine is modified as a result of using the interactive agent.
|
Our PhD work aims to a comprehensive, scalable and resourced-awareness
software framework to process RDF data for embedded devices.
In this proposal, we introduce a system architecture supporting RDF storage,
SPARQL query, RDF reasoning and continuous query for RDF stream. The ar-
chitecture is designed to be applicable to embedded systems. For the efficient
performance and scalability, we propose data management techniques adapt-
ing to hardware characteristics of embedded devices. Since computing resources
on embedded devices are constraint, their usage should be context dependent.
Therefore, we work on a resource adaptation model that supports trading off
system performance and device resources depending on their availability. The
adaptation model is based on the resource cost model of the data management
techniques.
|
Modern data-driven applications often have to integrate and process large volumes of high-velocity data. To this end, they require fast and accurate Link Discovery solutions. Most Link Discovery frameworks rely on complex link specifications to determine candidates for links. Hence, the main focus of this work lies in the conception, development, implementation and evaluation of time-efficient and scalable Link Discovery approaches based on the link specification paradigm. We address the aforementioned challenges by presenting approaches for (1) time-constrained linking and (2) for the efficient computation and (3) scalable execution of link specifications with applications to periodically updated knowledge bases. The overall result of this thesis will be an open-source framework for link discovery on large volumes of RDF data streams.
|
Multiple datasets that add high value to biomedical research have been exposed on the web as part of the Life Sciences Linked Open Data (LS-LOD) Cloud. The ability to easily navigate through these datasets is crucial in order to draw meaningful biological co relations. However, navigating these multiple datasets is not trivial as most of these are only available as isolated SPARQL endpoints with very little vocabulary reuse. We propose an approach for Autonomous Resource Discovery and Indexing (ARDI), a set of configurable rules which can be used to discover links between biological entities in the LS-LOD cloud. We have catalogued and linked concepts and properties from 137 public SPARQL endpoints. The ARDI is used to dynamically assemble queries retrieving data from multiple SPARQL endpoints simultaneously.
|
Linked Data can be distributed through multiple interfaces on the Web,
each of them with their own expressivity.
However, there is no generic client available that can handle querying over multiple interfaces.
This increases the complexity of combining datasets and designing new interfaces.
One can imagine the difficulties that arise
when trying to create a client querying various interfaces at the same time,
that can be discovered just in time.
To this end, I aim to design a generic Linked Data querying engine
capable of handling different interfaces that can easily be extended.
Rule-based reasoning is going to be explored
to combine different interfaces without intervention of a human developer.
Using an iterative approach to extend Linked Data interfaces,
I am going to evaluate different querying set-ups for the SPARQL language.
Preliminary results indicate a broad spectrum of yet to be explored options.
As the PhD is still in an early phase, we hope to narrow the scope in the next months,
based on feedback of the doctoral consortium.
|
Producing alignments of highest quality requires ‘humans in the loop’, however, user involvement is currently one of the challenges for the ontology alignment community. Ontology alignment is a cognitively intensive task and could be efficiently supported by user interfaces encompassing well-designed visualizations and interaction techniques. This work investigates the application of large, high-resolution displays to improve users’ cognitive support and identifies several promising directions for their application—improving ontologies’ and alignments’ navigation, supporting users’ thinking process and collaboration.
|
Ontologies are constructed in various fields such as medical information, mechanical design, and etc. It is important to build high quality ontologies so that these ontologies are used as knowledge bases and knowledge models for application systems. However it is hard to build good quality ontologies because of the necessity of both knowledge of ontology and expertise in their target domain. For this background, ontology construction and refinement costs a lot of time and ef-fort. In order to reduce such costs, we develop an ontology refinement support system. This system have two main function. First, the system can detect points that should be refined and propose how to refine it. Second, the system can evaluate ontologies quantitatively. This system indicate how ontologies are consistent in a classificatory criterion. To develop the refinement support system, we focus on a guideline for building well-organized ontologies that “Each subclass of a super class is distinguished by the values of exactly one attribute of the super class”. When an ontology is built following this guideline, there is similarity among Is-a hierarchies. We use these similar Is-a hierarchies and develop an ontology refinement system.
|
With the success of Open Data a huge amount of tabular data become available that could potentially be mapped and linked into the Web of (Linked) Data. The use of semantic web technologies would then allow to explore related content and enhanced search functionalities across data portals. However, existing linkage and labeling approaches mainly rely on mappings of textual information to classes or properties in knowledge bases. In this work we outline methods to recover the semantics of tabular Open Data and to identify related content which allows a mapping and automated integration/categorization of Open Data resources and improves the overall usability and quality of Open Data.
|
Wikipedia has been the primary source of information for many automatically-generated Semantic Web data sources. However, they suffer from incompleteness since they largely do not cover information contained in the unstructured texts of Wikipedia. Our goal is to extract structured entity-relationships in RDF from such unstructured texts, ultimately using them to enrich existing data sources. Our extraction technique is aimed to be topic-independent, leveraging grammatical dependency of sentences and context semantic refinement. Preliminary evaluations of the proposed approach has shown some promising results.
|
Due to the growing need to timely process and derive valuable information and knowledge from data produced in the Semantic Web, RDF stream processing (RSP) has emerged as an important research domain. Of course, modern RSP have to address the volume and velocity characteristics encountered in the Big Data era. This comes at the price of designing high throughput, low latency, fault tolerant, highly available and scalable engines. The cost of implementing such systems from scratch is very high and usually one prefers to program components on top of a framework that possesses these properties, e.g., Apache Hadoop or Apache Spark. The research conducting in this PhD adopts this approach and aims to create a production-ready RSP engine which will be based on domain standards, e.g., Apache Kafka and Spark Streaming. In a nutshell, the engine aims to i) address basic event modeling - to guarantee the completeness of input data in window operators, ii) process real-time RDF stream in a distributed manner - efficient RDF stream handling is required; iii) support and extend common continuous SPARQL syntax - easy-to-use, adapt to the industrial needs and iv) support reasoning services at both the data preparation and query processing levels.
|
Client-server trade-offs can be analyzed using Linked Data Fragments,
which proposes an uniform view on all interfaces to rdf. This reveals a complete
spectrum between Linked Data documents and the sparql protocol, in which
we can advance the state-of-the-art of Linked Data publishing. This axis can be
explored in the following two dimensions: i) Selector, allowing different, more
complex questions for the server; and ii) Metadata, extending the response
with more information clients can use.
This work studies the second Metadata dimension in a practical Web context.
Considering the conditions on the Web, this problem becomes three-fold. First,
analog to the Web itself, ldf interfaces should exist in a distributed, scalable
manner in order to succeed. Generating additional metadata introduces overhead
on the server, which influences the ability to scale towards multiple clients. Second,
the communication between client and server uses the http protocol. Modeling,
serialization, and compression determine the extra load the overall network traffic.
Third, with query execution on the client, novel approaches need to apply this
metadata intelligently to increase efficiency.
Concretely, this work defines and evaluates a series of transparent, interchangeable,
and discoverable interface features. We proposed Triple Pattern Fragments, a Linked Data api with low-server cost, as a fundamental base . This
interface uses a single triple pattern as selector. To explore this research space,
we append this interface with different metadata, starting with an estimated
number of total matching triples. By combining several tpfs, sparql queries are
evaluated on the client-side, using the metadata for optimization. Hence, we can
measure the query execution
|
Linked data has the potential of interconnecting data from different domains, bringing new potentials to machine agents to provide better services for web users. The ever increasing amount of linked data in government open data, social linked data, linked medical and patients’ data provides new opportuni-ties for data mining and machine learning. Both are however strongly de-pendent on the selection of high quality data features to achieve good results. In this work we present an approach that uses ontological knowledge to gen-erate features that are suitable for building a decision tree classifier address-ing the specific data set and classification problem. The approach that we present has two main characteristics - it generates new features on demand as required by the induction algorithm and uses ontological knowledge about linked data to restrict the set of possible options. These two characteristics enable the induction algorithm to look for features that might be connected through many entities in the linked data enabling the generation of cross-domain explanation models.
|
Submenu |
---|
Program |
Accepted Papers |
Accepted Posters and Demos |
Keynote: Kathleen McKeown |
Keynote: Christian Bizer |
Keynote: Hiroaki Kitano |
Awards |
Tutorials |
Workshops |
Doctoral Consortium |
Mentoring Lunch |
Guidelines for Authors |
Lightning Talks |
Floorplan |