Graph Databases
Over the last decade there has been a rise in the use of non-relational stores such as in-memory, document, multi-model, and graph databases. This post is about the latter, graph databases. We will be taking a look specifically at Neo4j, and how it can be used to answer questions about highly connected data.
Graph Theory
In mathematics, the study of graphs is known as graph theory. There are many types of graphs, and among them is the directed graph. Directed graphs consist of a set of nodes, connected by directed edges.
The Property Graph Model
Property graphs contain nodes, and relationships.
Nodes
Nodes are the main data points of interest in a property graph.
Relationships
Relationships are the directed connections between two nodes.
Properties
Properties are attributes that belong to nodes and/or edges.
Labels
Labels are used to group data together.
By Originally uploaded by Ahzf (Transferred by Obersachse) – Originally uploaded on en.wikipedia, CC0, https://commons.wikimedia.org/w/index.php?curid=19279472
What is Neo4j?
Neo4j is a graph database management system. The system features native graph storage and processing, as well as support for the property graph model.
Using the Neo4j Sandbox
To work with Neo4j without having to install it locally, we will use the Neo4j Sandbox.
Using your browser, open a new tab or window and go to https://neo4j.com/sandbox-v2/.
You should see a splash screen with a dialogue like the one on the right.
Click the Start Now button.
The Log in/Sign up dialog page will appear.
Select the best option for you and log in.
After logging in, you should see several sandbox options.
Select the Blank Sandbox, and click Launch Sandbox.
After a few moments you should see a sandbox dialog with tabs across the top.
You will now have a sandbox that is available to you for a few days.
Click the details tab. Make a note of the information displayed and click the Neo4j Browser link to continue.
Neo4j Browser
Once you have completed the steps above, you should be viewing the Neo4j Browser.
In the center of the screen at the top you should see the Query Editor. We will make extensive use of the editor throughout the remainder of this post.
On the left side of the browser, click the database icon. You will see the Database Information panel. It includes the sections Node Labels, Relationship Types and Property Keys . The database is presently empty, but as data is added those sections will display icons that you can click to execute queries.
Please consult this beginner’s guide for more details on how to use and customize the browser.
Cypher Query Language
Neo4j’s Cypher Query Language is a declarative graph query language that aims to be intuitive and human-readable. Nodes, relationships, and properties are described using ascii-art. Pattern matching is used to query and update the data stored in the database.
This query would return a node from all nodes labeled “Companies” that have the name “Sharp Notions”:
MATCH (company:Companies{name: ‘Sharp Notions’})
RETURN company;
Please consult Cypher’s documentation for a better understanding of the queries used later in the post.
Importing Data
To utilize the concepts mentioned so far, we are going to create a new property graph by importing data to our sandbox.
The Baseball Databank is a compilation of historical baseball data distributed under Open Data terms. It is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
The database is updated annually, prior to the start of the next season. Some of the data it contains dates back to the 1870’s.
With this in mind, we can use Neo4j as the means to view historical baseball data using the Property Graph Model.
Creating Nodes
In the next few sections we will create nodes in the database using the Cypher import statements located here.
Each import statement contains the clause USING PERIODIC COMMIT, which is a query hint that may be used to prevent an out-of-memory error from occurring when importing large amounts of data using LOAD CSV .
To avoid issues during import we will execute the import statements one at a time.
LOAD CSV is used to import data from a comma delimited file at a specified location. In this case, we are loading files from a GitHub repository. Neo4j does not require a schema, so we can import at will.
The readme file included with the data contains important information about the the history of the database, and its structure. This will serve as our guide for the tables we import.
People Nodes
The aforementioned readme file indicates that the Master table is the most important. At some point Baseball Databank renamed Master to People.
We will create People nodes first, as they will contain names, the date of birth and other biographical information for a person.
To import data from the People table, begin by copying the People import statement provided.
Next, click inside the query editor, and hit the ESC key. The editor should expand.
Paste, and then execute the query by using the Run button in the top right of the editor.
As each node is created, it is being given a label of People, matching the basename of the file the data is being pulled from. This will be the convention for each table imported.
After a few moments, the process should complete.
The database is now populated with nodes labeled People.
If you click the database icon you will now see Node Labels and Property Keys are filled in.
Click People in the Node Labels section to execute a query. It will return 25 nodes.
Clicking the label with the number of results at the top of the query pane, will update the dialog at the bottom of the pane. This allows the color and size of the nodes to be changed.
To the far right, there is an arrow that when clicked expands to display the caption for a node. Select the playerID to display it as the caption for People nodes.
You can repeat these steps for the other labeled nodes as you deem necessary.
Lastly, if you hover over a node, its properties will be revealed in the bottom panel of the query results.
Let’s continue with the rest of our data imports. The process will be the same as the one we followed above to import People. Copy the matching import statements provided for the Batting, Pitching, Fielding, Teams, and HallOfFame tables. Be sure to import each one at a time.
Creating Relationships
Relationships on a property graph can help with answering questions about our data.
At this point we have created nodes labeled People, Batting, Pitching, Fielding, Teams, and HallofFame into the database. We will now take a look at creating relationships between these nodes to answer questions such as “Who is in the Hall of Fame?”
Who is in the Hall of Fame?
The ultimate honor for the people involved in baseball is to be inducted into The National Baseball Hall of Fame.
HallOfFame nodes contain a playerID property, which we can use to select People nodes with the corresponding playerID .
Each HallOfFame node also has an inducted property that indicates if a person was voted into the Hall of Fame.
The following query will return nodes for People who have been inducted to the Hall of Fame, and HallOfFame votes resulting in a person’s induction:
The query returned the nodes we wanted, but there is no relationship between them.
To create a relationship between a person and their induction vote*, execute the following query:
We have now created the relationship type WAS_INDUCTED_TO_HALL_OF_FAME.
* Please note that Neo4j does not require nodes to have matching properties in order to create a relationship. Any two nodes can be connected. Here we are using the data’s relational database origins to create relationships.
![]() | ![]() |
Returning to graph theory for a moment, we have created an edge going out from a Person node to a HallOfFame node.
Querying by that relationship, we can easily answer the question “Who is in the Hall of Fame?”
Let’s continue by asking a few more questions, and creating relationships to answer them.
What are this player’s statistics?
Using the following queries, we can create relationships between People (players) and the statistics they have compiled:
What are this team’s statistics?
We have finished creating relationships for the purposes of this post. You can create additional relationships by importing the other available tables, and thinking about how the resulting nodes are related.
Remember, any relationship created above can be queried by clicking its icon in the browser.
Revealing Relationships
One of the most powerful features of Neo4j is the ability reveal a node’s relationships by expanding it within a query result.
The result below displays a Hall of Fame player’s Batting and Fielding statistics, after double clicking the People node representing the player.
This next result expands the player’s Batting and Fielding statistics for a single season to reveal the team they were compiled for.
If we went on to expand the Teams node, the player’s teammates, their statistics and more would be revealed. This effectively answers “Who were this Hall of Fame player’s teammates?” or “What teams did this Hall of Fame player play for?”
Conclusion
This was a brief look at graph databases, and how they can help answer questions about connected data using relationships. However, there is much more to understand than can be covered here.
Consider creating a Neo4j sandbox to see if a graph database might fit your use case. Be sure to visit the Neo4j website and documentation for more information.