Moving to the cloud: Migrating Blazegraph to Amazon Neptune – idk.dev

During the lifespan of a graph database application, the applications themselves tend to only have basic requirements, namely a functioning W3C standard SPARQL endpoint. However, as graph databases become embedded in critical business applications, both businesses and operations require much more. Critical business infrastructure is required not only to function, but also to be highly available, secure, scalable, and cost-effective. These requirements are driving the desire to move from on-premises or self-hosted solutions to a fully managed graph database solution such as Amazon Neptune.

Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run business-critical graph database applications. Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latency. Neptune is designed to be highly available, with read replicas, point-in-time recovery, continuous backup to Amazon Simple Storage Service (Amazon S3), and replication across Availability Zones. Neptune is secure with support for AWS Identity and Access Management (IAM) authentication, HTTPS-encrypted client connections, and encryption at rest. Neptune also provides a variety of instance types, including low-cost instances targeted at development and testing, which provide a predictable, low-cost, managed infrastructure.

When choosing to migrate from current on-premises or self-hosted graph database solutions to Neptune, whats the best way to perform this migration?

This post demonstrates how to migrate from the open-source RDF triplestore Blazegraph to Neptune by completing the following steps:

This post also examines the differences you need to be aware of while migrating between the two databases. Although this post is targeted at those migrating from Blazegraph, the approach is generally applicable for migration from other RDF triplestore databases.

Before covering the migration process, lets examine the fundamental building blocks of the architecture used throughout this post. This architecture consists of four main components:

The following diagram summarizes these resources and illustrates the solution architecture.

Although its possible to construct the required AWS infrastructure manually through the AWS Management Console or CLI, this post uses a CloudFormation template to create the majority of the required infrastructure.

The process of exporting data from Blazegraph involves three steps:

The first step is exporting the data out of Blazegraph in a format thats compatible with the Neptune bulk loader. For more information about supported formats, see RDF Load Data Formats.

Depending on how the data is stored in Blazegraph (triples or quads) and how many named graphs are in use, Blazegraph may require that you perform the export process multiple times and generate multiple data files. If the data is stored as triples, you need to run one export for each named graph. If the data is stored as quads, you may choose to either export data in N-Quads format or export each named graph in a triples format. For this post, you export a single namespace as N-Quads, but you can repeat the process for additional namespaces or desired export formats.

There are two recommended methods for exporting data from Blazegraph. Which one you choose depends if the application needs to be online and available during the migration.

If it must be online, we recommend using SPARQL CONSTRUCT queries. With this option, you need to install, configure, and run a Blazegraph instance with an accessible SPARQL endpoint.

If the application is not required to be online, we recommend using the BlazeGraph Export utility. With this option, you must download Blazegraph, and the data file and configuration files need to be accessible, but the server doesnt need to be running.

SPARQL CONSTRUCT queries are a feature of SPARQL that returns an RDF graph matching the query template specified. For this use case, you use them to export your data one namespace at a time using the following query:

Although a variety of RDF tools to export this data exist, the easiest way to run this query is by using the REST API endpoint provided by Blazegraph. The following script demonstrates how to use a Python (3.6+) script to export data as N-Quads:

If the data is stored as triples, you need to change the Accept header parameter to export data in an appropriate format (N-Triples, RDF/XML, or Turtle) using the values specified on the GitHub repo.

Although performing this export using the REST API is one way to export your data, it requires a running server and sufficient server resources to process this additional query overhead. This isnt always possible, so how do you perform an export on an offline copy of the data?

For those use cases, you can use the Blazegraph Export utility to get an export of the data.

Blazegraph contains a utility method to export data: the ExportKB class. This utility facilitates exporting data from Blazegraph, but unlike the previous method, the server must be offline while the export is running. This makes it the ideal method to use when you can take the application offline during migration, or the migration can occur from a backup of the data.

You run the utility via a Java command line from a machine that has Blazegraph installed but not running. The easiest way to run this command is to download the latest blazegraph.jar release located on GitHub. Running this command requires several parameters:

For example, if you have the Blazegraph journal file and properties files, export data as N-Quads with the following code:

Upon successful completion, you see a message similar to the following code:

No matter which option you choose, you can successfully export your data from Blazegraph in a Neptune-compatible format. You can now move on to migrating these data files to Amazon S3 to prepare for bulk load.

With your data exported from Blazegraph, the next step is to create a new S3 bucket. This bucket holds the data files exported from Blazegraph for the Neptune bulk loader to use. Because the Neptune bulk loader requires low latency access to the data during load, this bucket needs to be located in the same Region as the target Neptune instance. Other than the location of the S3 bucket, no specific additional configuration is required.

You can create a bucket in a variety of ways:

You use the newly created S3 bucket location to bulk load the data into Neptune.

The next step is to upload your data files from your export location to this S3 bucket. As with the bucket creation, you can do this in the following ways:

Although this example code only loads a single file, if you exported multiple files, you need to upload each file to this S3 bucket.

After loading all the files in your S3 bucket, youre ready for the final task of the migration: importing data into Neptune.

Because you exported your data from Blazegraph and made it available via Amazon S3, your next step is to import the data into Neptune. Neptune has a bulk loader that loads data faster and with less overhead than performing load operations using SPARQL. The bulk loader process is started by a call to the loader endpoint API to load data stored in the identified S3 bucket into Neptune. This loading process happens in three steps:

The following diagram illustrates how we will perform these steps in our AWS infrastructure.

You begin the import process by making a request into Neptune to start the bulk load. Although this is possible via a direct call to the loader REST endpoint, you must have access to the private VPC in which the target Neptune instance runs. You could set up a bastion host, SSH into that machine, and run the cURL command, but Neptune Workbench is an easier method.

Neptune Workbench is a preconfigured Jupyter notebook which is an Amazon SageMaker notebook, with several Neptune-specific notebook magics installed. These magics simplify common Neptune interactions, such as checking the cluster status, running SPARQL and Gremlin traversals, and running a bulk loading operation.

To start the bulk load process use the %load magic, which provides an interface to run the Neptune loader API.

The result contains the status of the request. Bulk loads are long-running processes; this response doesnt mean that the load is complete, only that it has begun. This status updates periodically to provide the most recent loading job status until the job is complete. When loading is complete, you receive notification of the job status.

With your loading job having completed successfully your data is loaded into Neptune and youre ready to move on to the final step of the import process: validating the data migration.

As with any data migration, you can validate that the data migrated correctly in several ways. These tend to be specific to the data youre migrating, the confidence level required for the migration, and what is most important in the particular domain. In most cases, these validation efforts involve running queries that compare the before and after data values.

To make this easier, the Neptune Workbench notebook has a magic (%%sparql) that simplifies running SPARQL queries against your Neptune cluster. See the following code.

This Neptune-specific magic runs SPARQL queries against the associated Neptune instance and returns the results in tabular form.

The last thing you need to investigate is any application changes that you may need to make due to the differences between Blazegraph and Neptune. Luckily, both Blazegraph and Neptune are compatible with SPARQL 1.1, meaning that you can change your application configuration to point to your new Neptune SPARQL endpoint, and everything should work.

However, as with any database migration, several differences exist between the implementations of Blazegraph and Neptune that may impact your ability to migrate. The following major differences either require changes to queries, the application architecture, or both, as part of the migration process:

However, Neptune offers several additional features that Blazegraph doesnt offer:

This post examined the process for migrating from an on-premises or self-hosted Blazegraph instance to a fully managed Neptune database. A migration to Neptune not only satisfies the requirements of many applications from a development viewpoint, it also satisfies the operational business requirements of business-critical applications. Additionally, this migration unlocks many advantages, including cost-optimization, better integration with native cloud tools, and lowering operational burden.

Its our hope that this post provides you with the confidence to begin your migration. If you have any questions, comments, or other feedback, were always available through your Amazon account manager or via the Amazon Neptune Discussion Forums.

Dave Bechberger is a Sr. Graph Architect with the Amazon Neptune team. He used his years of experience working with customers to build graph database-backed applications as inspiration to co-author Graph Databases in Action by Manning.

Read the rest here:
Moving to the cloud: Migrating Blazegraph to Amazon Neptune - idk.dev

Related Posts

Comments are closed.