By Dr. David Hyland-Wood, Ephox Director of Technology
Like everyone involved in the software industry, Ephox’s R&D team is always looking for new ways to add value to our customers. This recently involved research into augmenting the content creation support within our editors.
An example of this work is a prototype integration of IBM Watson into our Textbox.io and TinyMCE rich-text editors, in order to help authors create content with the writing tone they intend while dynamically suggesting related content as they type. We are still in the formative phase of this process and our research into augmenting content creation will continue. At the same time, some of our explorations relate to the evolving world of big data.
The Data & Knowledge Engineering Lab (DKE) within the School of ITEE at The University of Queensland is starting a new data science program. Professor Shazia Sadiq kindly invited me to present a seminar on big data, held in early June.
Big data is a tricky topic, especially since it has become such a successful marketing meme. We all understand the word “big”, and we know the word “data”, so big data seems simple. It is anything but. Big data as a technical term is often segmented by the type of approach used to process it. For example, large data processed via the MapReduce programming model originally developed by Google drove the original definition of the term. These days, next-generation approaches such as the in-memory optimizations of Apache Spark, complexities found in the deep relationships of Linked Data, and even the huge bulk of governmental statistics qualify under the rubric. Because of this confusion, data scientists might be statisticians, programmers, analysts, or logicians. They might work in Python, R, Scala, or Java.
My own background as a data scientist has been in the implementation of systems that support a subset of graph theory operations, especially as they relate to the processing of Linked Data and other Semantic Web systems. The Callimachus Project Linked Data management system and the Mulgara semantic database are two of the Open Source Software projects I have started. Both projects are licensed under the Apache 2 license.
The World Wide Web contains tremendous quantities of freely available data that is linked to other data to form graphs of common knowledge. These data represent the world’s best curated semantic structures, and are used as part of the operation of the web’s search engines and some forms of advanced artificial intelligence such as Apple’s Siri and IBM’s Watson. I decided my talk should introduce these data sources, show people how to access and query them, and how to include them in projects using Open Source software such as the Callimachus Project and Apache Spark.
Some of the easiest data to grab comes directly or indirectly from Wikipedia. The DBPedia Project extracts huge amounts of structured data from Wikipedia, as does the newer (and currently less capable) Wikidata Project. Open data is available from the many contributors to the Linking Open Data Cloud, and also BaseKB, which forked the open portions of Google’s Open Knowledge Graph. It is possible to find well-structured, open data on everything from music albums and artists, to animal species, medicinal drugs, clinical trials, government spending, and (of course) Pokémon.
The slides for my talk are available, as is the sample code and data I used to develop the examples. The sample code includes some SPARQL queries against data related to Australian universities, and an example of reading RDF data into a Spark RDD to perform graph operations that one cannot accomplish in SPARQL.