Google Cayley graph database tutorial - family tree

Intro

This tutorial is an introduction to graph databases using a hypothetical family tree as our data.

We will use Google Cayley open-source graph database that comes with a built-in query editor and a visualizer called SigmaJS. The Cayley server will be compiled and run in a virtual machine setup as an Otto Go application.

What is a graph database?

Most people are familiar with relational databases (Oracle, MySQL, PostgreSQL, Microsoft SQL Server) and document-oriented databases (MongoDB, CouchDB), a subtype of NoSQL databases.

Another type of NoSQL databases are graph databases based on graph theory that make use of nodes, edges and properties. Graph databases are designed to allow simple and rapid retrieval of complex hierarchical structures that are difficult to model in relational systems.

The best use for graph databases is when relationships matter as much as the data itself. The most used example is the friend of a friend request that is better implemented and faster to retrieve in a graph database than in a relational database with SQL queries. I've used a simple test-case in this tutorial, the family tree.

Some examples of graph databases are Neo4j, InfiniteGraph, Dgraph and Google Cayley, the one that we'll use in this article.

Step 1 - clone Google Cayley and create the development environment

We'll Git clone and compile my fork of Google Cayley from GitHub because I've implemented a newer version of SigmaJS with a plugin that shows edge labels - these will be our graph (family tree) relations.

Let's fire up Git Bash and start step-by-step:

1. $ mkdir cayley-mihailj

2. $ cd cayley-mihailj

3. $ mkdir -p src/github.com/google

4. $ cd src/github.com/google/

5. $ git clone https://github.com/mihailj/cayley.git

6. $ cd ../../../

7. create 'Appfile' in the root of our project with content:

application {
	name = "cayley-mihailj"
	type = "go"
}

8. compile the Otto application

$ otto compile

9. create the Otto development environment that comes with Go already installed and note the virtual machine IP address

$ otto dev

==> Development environment successfully created!
IP address: 100.123.34.77

10. connect to the virtual machine

$ otto dev ssh

Step 2 - compile Cayley

1. $ export GOPATH=`pwd`

2. $ cd src/github.com/google/cayley/

3. $ go get github.com/tools/godep

4. $ export PATH=$PATH:/vagrant/bin

5. $ cd /vagrant

6. $ cd /vagrant/src/github.com/google/cayley

7. $ godep restore

8. $ go build ./cmd/cayley

9. check that Cayley compiled successfully:

$ ./cayley version
Cayley snapshot

Step 3 - create data source

As a data source we will use a N-Quads plain text file format to describe the family tree graph. This format is also used to encode a RDF (Resource Description Framework) dataset used in the Semantic Web. The N-Quads format is an extension of the N-Triples, the main distinction is that N-Quads allows encoding multiple graphs.

Our test family tree is pretty simple to describe in English:

'There are two brothers, Brian and John Smith. Brian is married to Susan and they don't have children. John is married to Mary. They have 2 children, David and Jennifer. David is married to Lisa and they also have two children, Kevin and Elizabeth. Lisa has a sister Michelle. Their family name is Johnson, but Lisa took David's last name.'

Let's translate this to N-Quads using the family members id's and relations, then create a 'familytree.nq' file in the root of our project with this content:

"john" "name" "John Smith" .
"mary" "name" "Mary Smith" .
"brian" "name" "Brian Smith" .
"susan" "name" "Susan Smith" .
"david" "name" "David Smith" .
"jennifer" "name" "Jennifer Smith" .
"lisa" "name" "Lisa Smith" .
"michelle" "name" "Michelle Johnson" .
"kevin" "name" "Kevin Smith" .
"elizabeth" "name" "Elizabeth Smith" .
"john" "spouse" "mary" .
"john" "brother" "brian" .
"brian" "spouse" "susan" .
"john" "child" "david" .
"john" "child" "jennifer" .
"mary" "child" "david" .
"mary" "child" "jennifer" .
"david" "spouse" "lisa" .
"lisa" "brother" "michelle" .
"david" "child" "kevin" .
"david" "child" "elizabeth" .
"lisa" "child" "kevin" .
"lisa" "child" "elizabeth" .

Now we start the Cayley server with this file as the database:

$ ./cayley http --dbpath=/vagrant/familytree.nq --host 0.0.0.0
Cayley now listening on 0.0.0.0:64210

Step 4 - run Gremlin queries and see the graph representation of our family tree

We will run the queries in Gremlin graph traversal language using a Javascript dialect. Google Cayley also implements MQL (Metaweb Query Language) but at the moment it only supports very basic queries without some of the extended features.

1. load 'http://100.123.34.77:64210/ui/visualize' (replace with your virtual machine IP) in a host machine web browser

2. our first query will be something simple, let's see John's children:

graph.V('john').As('source').Out('child').As('target').All();

cayley_graph1What do we have in the above graph? The 'source' vertex for id 'john' and two edges that go to the 2 'target' vertexes with children id's.

2. we will improve this graph to show the full names instead of id's:

graph.V('john').Save('name','source').As('source').Out('child').Save( 'name', 'target').As('target').All();

cayley_graph2Great, this already looks better.

3. now remember, in the first step of this tutorial I've told you to clone my fork of Google Cayley that has edge labels to see the relations. You can make these visible like this:

graph.V('john').Save('name', 'source').As('source').Out('child', 'relation').Save('name', 'target').As('target').All();

cayley_graph3Now we have a proper representation of John's children with their full names and the relation with their parent.

4. let's introduce John's wife, Mary, in the graph:

graph.V('john').Save('name', 'source').As('source').Out(['child', 'spouse'], 'relation').Save('name', 'target').As('target').All();

cayley_graph4We can see that Mary is connected to John but not to her children.

5. to connect both John and Mary between them and also to their children we can run this query:

graph.V('john', 'mary').Save('name','source').As('source' ).Out(['child', 'spouse'], 'relation').Save('name', 'target' ).As('target').All();

cayley_graph56. now also show John's brother, Brian:

graph.V('john', 'mary').Save('name', 'source').As('source' ).Out(['child', 'spouse', 'brother'], 'relation').Save('name', 'target').As('target').All();

cayley_graph6Perfect, we now have all first level relatives of John in our graph.

7. next we'll see how to show John's grandchildren, David's kids:

graph.V('john').Save('name', 'source').As('source').Out('child', 'relation').Out('child', 'relation').Save('name', 'target' ).As('target').All();

cayley_graph7Because we don't have a direct relation between John and his grandchildren the relation is still named as 'child' because they are David's children.

8. we can draw better the relation between John, his children and grandchildren like this:

graph.V("john").Save('name', 'source').As('source').Out('child', 'relation').Save('name', 'target').As('target').ForEach(function(d) {
	g.Emit(d);

	g.V(d.id).Save('name', 'source').As('source').Out('child', 'relation').Save('name', 'target').As('target').ForEach(function(d2) {
		g.Emit(d2);
	} );
} );

cayley_graph89. or even show just John's grandchildren by playing with 'source' and 'relation' parameters (we did the same graph above in step 4.7, but this time the relation is correctly shown as 'grandchild'):

graph.V("john").Save('name', 'source').As('source').Out('child', 'relation').As('target').ForEach(function(d) {
	g.V(d.target).As('source').Out('child', 'relation').Save('name', 'target').As('target').ForEach(function(d2) {
		d2.source = d.source;
		d2.relation = 'grandchild';
		g.Emit(d2);
	} );
} );

cayley_graph910. as a last example we'll draw the whole family tree:

graph.V("john").Save('name', 'source').As('source').Out(['child', 'brother', 'spouse'], 'relation').Save('name', 'target').As('target' ).ForEach(function(d) {
	g.Emit(d);

	g.V(d.id).Save('name', 'source').As('source').Out(['child', 'brother', 'spouse'], 'relation').Save('name', 'target').As('target').ForEach(function(d2) {
		g.Emit(d2);

		g.V(d2.id).Save('name', 'source').As('source').Out(['child', 'brother', 'spouse'], 'relation').Save('name', 'target').As('target').ForEach(function(d3) {
			g.Emit(d3);
		} );
	} );
} );

cayley_graph10Pretty easy, right?

Outro

This tutorial was a quick introduction to graph databases and visualizations, I think that Google Cayley is amazing for this because it comes with a great UI so you can see results very fast.

Programming can be fun! 😛

Links:

https://en.wikipedia.org/wiki/Graph_database

https://www.w3.org/TR/2014/REC-n-triples-20140225/

https://www.w3.org/TR/n-quads/

https://github.com/google/cayley

https://github.com/google/cayley/blob/master/docs/GremlinAPI.md

https://groups.google.com/forum/#!forum/cayley-users

http://sigmajs.org/

2 thoughts on “Google Cayley graph database tutorial - family tree”

  1. Thank you very much for your post!
    Actually creating the DB from scratch with a simple file helps to understand the concepts of n-quads.
    It reveal the simplicity of the model and the power of the DB.

    Thank you again.

Leave a Reply

Your email address will not be published. Required fields are marked *