DataChemist: A Technical Overview


DataChemist is a cloud-based graph database platform.

Our approach enables organizations of all types to upload data to the platform, structure their data, and then find hidden connections and relationships within it. By doing so we enable those organizations to see the world as it really is.

Here’s how we make it happen.

At the highest level, the DataChemist platform supports the following process:

  • The definition of a semantic data model, unique to the customer organization, that represents the world as it truly is
  • The import of existing data sets into this model - correcting ‘real world errors’ as they are uncovered
  • Providing multiple ways of querying and visualising this data, including graph navigation, tables, search and plots and charts
  • Providing tools for curating this data - a wiki-like UI which allows users to update the data in the DB while enforcing all defined constraints on the input data
  • Enabling the data to be accessed in a variety of ways - document, graph or table oriented - through flexible APIS which allow new services to be quickly and easily deployed based on the data.

What follows is a brief overview of how we do this. For more detail, including practical details of a specific project for IPG in Poland, read our whitepaper “Evolving Meaningful Value From Enterprise Data”.

Defining The Data Model

The DataChemist platform automatically generates an ontological data model in OWL from existing data, before allowing manual adaptations to ensure the final result accurately reflects the business objects that are important for your organization.

The initial automatic generation comprises three steps:

  1. Querying the structure of the existing database to identify keys that represent containment relationships and connection relationships and joining them together into interrelated objects.
  2. Generating URLs and unique identifiers to represent composite keys as first class addressable entities.
  3. Querying the data to identify refined types, enumerated types that are encoded as simple types such as strings and integers in the input data.

Having generated this model, we load it into the DataChemist platform to serve as our schema, then we map each value from its input form (SQL, CSV, XML…) into its new ontological form as a triple.

At this point, we allow for manual adjustment of the model to ensure that the final agreed ontology accurately represents the structure desired by the client. Now we import existing data sets.

Importing Existing Data

Importing data from multiple sources is easy in DataChemist. In fact, all you need to do is drag and drop CSV or SQL files using our online UI.

In the background, we bundle the triples into their object form and submit the objects to the API. Any errors in basic datatypes or enumerated types or type-ranges, or referential integrity breaches in the underlying data will immediately show up and will be refused by the API.

As we have a simple ontology (see above) that has been generated from the input schema, we don’t have to worry about universal constraints so we can simply strip out any erroneous values and keep the rest.

The platform provides support for constructing workflows that allow errors to be presented for correction by human users. As we keep track of provenance information, we are able to automatically connect errors in the imported data with the source in the original data and thus we can use the system to greatly cut the cost of finding and correcting errors in the database. This in effect, allows our platform to be used as the Master in Master Data Management.

Don’t forget, the same applies to edits to data in the system and any data imported in the future. If it breaks your ‘real world’ rules, it won’t happen. We help control your data quality: permanently

Distribution Of Data

DataChemist enables you to visualise data in a number of ways. Most notably, the platform supports the ‘classic’ graph database view as shown below.

In addition the DataChemist platform supports the generation of an almost infinite variety of page-based UIs that enable structured data to be read by humans and machines as required.

We have developed a simple query language, WOQL, based upon an extension of our OWL model. This language is concise, fast and can be associated with a given data store making the maintenance of the constraints transparent.

  • Our model defines all of the classes and properties that can be present in the data. We use these to automate the production of graphical query building tools that have drop-down menus and autocomplete functionality specifically tuned to the data model.
  • All of our relationships and properties are strongly typed - we can exploit this to provide further automation of query generation - for example by generating drop-downs and auto-complete boxes for references to other data objects in the database, or by making suitable operators available only when appropriate to the context.
  • We can automatically identify situations in which results have geographic aspects (xdd:coordinate) or temporal aspects and exploit this by automatically plotting them on maps and timelines.
  • We can automatically generate aggregations of aggregatable parts of the results and fine-grained statistics about the distribution of properties and classes in the results and the entire dataset and the differences between them.
  • We can automatically generate a relationship-oriented browser which allows users to navigate through the data as a graph - with a set of nodes representing entities and a set of edges representing relationships between them.
  • Because we have a richer, more expressive model available than competing technologies, we can always automate the conversion of data in our semantic format to the simpler formats used by other programs and we can make much of this format conversion generic and based on examination of the model. For example, we can specify that a particular output format represents coordinates in a particular way - our system will then use that format whenever it sees a xdd:coordinate in our data.

Share this post: