neo4j pt3 : RDBS/CSV to Graph Database
Category : Datascience Tag : gds
June 13, 2021, 4:52 p.m.

Short Description :
How to migrate from RDBS to Neo4j data warehouse as graph database format, showing data pipeline
source : datak

<h2>About</h2><h4>Relational Database vs Graph Database</h4><p>Graph database requires unique data structure so it could be used for an analysis. This article explains how to prepare graph dataset, migrating from a relational database to Neo4j graph database.</p><p>The transformation is iterative and deliberate, emphasizing the conceptual shift from relational tables to the nodes and relationships of a graph.</p><p><span id="docs-internal-guid-6ef7eaef-7fff-a633-c369-01bd8bb8fbd9"><span style="font-size: 11pt; font-family: Arial; background-color: transparent; font-variant-numeric: normal; font-variant-east-asian: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border:none;display:inline-block;overflow:hidden;width:624px;height:340px;"><img src="https://lh6.googleusercontent.com/obGJ8jiIwaVeZBblA4hW1i5jnj3naPgE1O-gMKJcrMwvPtnRD4GFI_xFQMqglk61RImvtFxtmxVUo-_oT6H-iZ0bG2GX-Bnrq7QrbQTFNPyPLtr-4nsINv0dqXv36nZYz-hbiiiU" width="624" height="340" style="margin-left: 0px; margin-top: 0px; width: 100%;"></span></span></span><br></p><p><br></p><h4>Graph Database vs Other NoSQL</h4><p>Graph database is a kind of NoSQL, with the relationships between data more focused.</p><p><span id="docs-internal-guid-3c65e25a-7fff-45e1-738d-404b244f288f"><span style="font-size: 11pt; font-family: Arial; background-color: transparent; font-variant-numeric: normal; font-variant-east-asian: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border:none;display:inline-block;overflow:hidden;width:624px;height:347px;"><img src="https://lh3.googleusercontent.com/71LcOAVyN8Y6EB6mthXim2KCcNndtYHIPSegHzgfF436Ts9yTt8TyJigY3r6PBojTQ_EgcuXo0jBjP_IwfI6eANlvSzSD6vt9tAF9Ki_KtxyYaBr6WF0cr1oDEW7sZWFCHZWO8S4" width="624" height="347" style="margin-left:0px;margin-top:0px;"></span></span></span><br></p><p><br></p><h4>This Article</h4><p>This article is mainly referenced from<a href="https://neo4j.com/developer/guide-importing-data-and-etl/" target="_blank"> the neo4j documentaion using Northwind dataset</a>. This includes followings procedures;</p><ol><li>Load : Create data from external CSV files</li><li>Index : Index nodes based on label</li><li>Relate : Transform foreign key references into data relationships</li></ol><h4><br></h4><h4>Dataset</h4><p>Below is RDBS mapping from Northwind datasets, we will scale down for small demonstration purpose focusing on 6 un-shaded tables. Northwind data is from merchant like e-commerce.</p><p><img src="/media/django-summernote/2021-06-15/d0022fc5-0014-474f-a00d-8e9ea2735636.png" style="width: 100%;"><br></p><p><br></p><h2>Data Preparation</h2><h4>1. Load Records</h4><p>Create a product catalog, which has `Products`, `Categories`, and `Suppliers` tables.</p><p>The `LOAD` syntax requires public internet access. `LOAD CSV` will retrieve a CSV file from a valid URL such as github, or other open api endpoints, applying a Cypher statement to each row using a named map (in the demo we will be using the name called `row`).</p><pre># Cypher #Load products.csv LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/products.csv" AS row CREATE (n:Product) SET n = row, n.unitPrice = toFloat(row.unitPrice), n.unitsInStock = toInteger(row.unitsInStock), n.unitsOnOrder = toInteger(row.unitsOnOrder), n.reorderLevel = toInteger(row.reorderLevel), n.discontinued = (row.discontinued &lt;&gt; "0")</pre><p>The graph view for product is;</p><pre># Cypher #for small data MATCH (p:Product) RETURN p #for big data to avoid out-of-memory error MATCH (p:Product) RETURN p LIMIT 100</pre><p><img src="/media/django-summernote/2021-06-15/5709cc54-66bd-4449-a653-8d8b7116105a.png" style="width: 877px;"></p><p>Load other 2 tables - categories and suppliers, which will be connected through relationship later</p><pre># Cypher #Load categories.csv LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/categories.csv" AS row CREATE (n:Category) SET n = row #Load suppliers.csv LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/suppliers.csv" AS row CREATE (n:Supplier) SET n = row</pre><p><br></p><h4>2. Create Indexes</h4><p>After the nodes are created, relationship is required to create between nodes. Importing the relationships will mean that we look up the nodes which is just created and add a relationship between those existing entities. Relationship modeling needs indexes to connect one graph database to another, which often uses id or another universal unique value.</p><pre># Cypher #create index for product using productID CREATE INDEX ON :Product(productID) #create index for category using categoryID CREATE INDEX ON :Category(categoryID) #create index for supplier using supplierID CREATE INDEX ON :Supplier(supplierID)</pre><h4><p><span style="font-size: 14px;">After we execute above indexes we can execute following code to view the indexes in the database just to confirm</span><br></p></h4><pre># Cypher CALL db.indexes()</pre><p><br></p><h4>3. Create the relationships between the nodes</h4><p>With the initial nodes and indexes being in place, we can now create the relationships between products, suppliers, and product categories.</p><pre># Cypher #match product and category MATCH (p:Product),(c:Category) WHERE p.categoryID = c.categoryID CREATE (p)-[:PART_OF]-&gt;(c) #match product and supplier MATCH (p:Product),(s:Supplier) WHERE p.supplierID = s.supplierID CREATE (p)&lt;-[:SUPPLIES]-(s)</pre><p>Check graph map overview<br></p><pre># Cypher MATCH (c:Category)-[]-(p:Product)-[]-(s:Supplier) RETURN c,p,s #big data MATCH (c:Category)-[]-(p:Product)-[]-(s:Supplier) RETURN c,p,s LIMIT 25</pre><p><img src="/media/django-summernote/2021-06-15/64f5064f-b66c-4f5b-87b1-97008a5037b6.png" style="width: 100%;"></p><p><br></p><h4>Query Example for Product</h4><p>List the product categories provided by each supplier</p><pre># Cypher MATCH (s:Supplier)--&gt;(:Product)--&gt;(c:Category) #make sure 2dash RETURN s.companyName AS Company, collect(DISCINCT c.categoryName) AS Categories</pre><p>Find the produce suppliers</p><pre># Cypher #for graph data MATCH (c:Category {categoryName:"Produce"})-[]-(p:Product)-[]-(s:Supplier) RETURN c,p,s #for a list MATCH (c:Category {categoryName:"Produce"})&lt;--(:Product)&lt;--(s:Supplier) RETURN DISTINCT s.companyName as ProduceSuppliers</pre><p><br></p><h4>Load Customer and Order table and create relations for these nodes</h4><p>Same procedure for customer and order table</p><pre># Cypher #load customers.csv LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/customers.csv" AS row CREATE (c:Customer) SET c = row #load orders.csv LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/orders.csv" AS row CREATE (o:Order) SET o = row #create index CREATE INDEX ON :Customer(customerID) CREATE INDEX ON :Order(orderID) #create data relationships MATCH (c:Customer),(o:Order) WHERE c.customerID = o.customerID) CREATE (c)-[:PURCHASED]-&gt;(o) #take a look at data graph MATCH (c:Customer)-[]-(o:Order) RETURN c,o #see what Maria Anders purchased MATCH (c:Customer {contactName:"Maria Anders"})-[]-(o:Order) RETURN c,o</pre><p><img src="/media/django-summernote/2021-06-15/2c428adf-1f2e-4e76-be9f-d4c8c81ed58b.png" style="width: 100%;"></p><p><img src="/media/django-summernote/2021-06-15/9f5046c6-ee9a-4229-b2ec-4c33c9d03275.png" style="width: 100%;"></p><p>You will notice that Order label is automatically by `shipName`, but you might prefer to show order ID instead. You can change the `caption` as followings;</p><ol><li>Click `Order` tab at top left</li><li>Click `triangle` button at bottom right to expand bottom `order` box</li><li>You should see many `Caption` candidate and select what you want</li></ol><p>For example, I'm changing from `shipName` above to `orderID`</p><p><img src="/media/django-summernote/2021-06-15/0d3859f1-b145-4c1b-923f-87ed005019ac.png" style="width: 100%;"></p><p><br></p><h4>Create Order Product Graph</h4><p>Order-detail table connects order table/information with product table/information, and this is called as `join table`. This case is a unique and can't use same codes above, but we need `a join table` to connect.</p><pre># Cypher #load csv, and create a connection LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/order-details.csv" AS row MATCH (p:Product),(o:Order) WHERE p.productID = row.productID AND o.orderID = row.orderID CREATE (o)-[details:ORDERS]-&gt;(p) SET details = row, details.quantify = toInteger(row.quantity)</pre><p>Quick look</p><pre># Cypher MATCH (p:Product)-[]-(o:Order)-[]-(c:Customer) RETURN p,o,c LIMIT 25</pre><p><img src="/media/django-summernote/2021-06-15/e18a319a-1402-4fec-8ddf-20bbf9221399.png" style="width: 100%;"></p><p><br></p><p>Create example query to get total products purchased for named `produce` products</p><pre># Cypher MATCH (cust:Customer)-[:PURCHASED]-&gt;(:Order)-[o:ORDERS]-&gt;(p:Product), &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(p)-[:PART_OF]-&gt;(c:Category {categoryName: "Produce"}) RETURN DISTINCT cust.contactName as CustomerName, SUM(o.quantity) AS TotalProductsPurchased</pre><p><br></p><p>We will see final graph data relationships between nodes by;</p><pre># Cypher call db.schema.visualization</pre><p><img src="/media/django-summernote/2021-06-15/2c0aee46-0d0d-41f1-8c4e-1f5925379035.png" style="width: 100%;"></p><p><br></p><p>To show everything, type code below, note that product has multiple connection and can't use one line of code to show all connections so split into 2 line of codes at `MATCH`;</p><pre># Cypher MATCH (c:Customer)-[]-(o:Order)-[]-(p:Product)-[]-(s:Supplier), &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(p:Product)-[]-(cat:Category) RETURN c,o,p,s,cat</pre><p><img src="/media/django-summernote/2021-06-15/24292150-d518-4321-bdd4-2e8ecbe1f71b.png" style="width: 100%;"></p><p>Graph visualization is too complicated to tell, but you will still see high level stats of information for nodes and connections at top left</p><p><br></p><p>For a sandbox, it'd be good to delete working data but to save the script somewhere</p><pre># Cypher #delete graph data MATCH (n) DETACH DELETE n</pre><p><br></p><p></p>


<< Back to Blog Posts
Back to Home

Related Posts

neo4j pt4 : exploratory data analysis / data profiling (APOC)
gds
Jun 13 2021
Part4 includes how to do data profiling : a holistic view of data using neo4j
neo4j pt5 : Visualization(Bloom)
Jun 25 2021
Bloom overview
neo4j pt6 : Graph Data Science 1
gds
Jun 26 2021
Part 6 includes how to use graph data science in neo4j
neo4j pt7 : Graph Data Science 2
Jun 27 2021
How to run graph data science algorithm in neo4j Cypher query
neo4j pt2 : query basics (Cypher)
gds
Jun 12 2021
Overview of graph data science with neo4j. Pt2 includes cypher query basics with toy movie data
neo4j pt1 : overview
gds
Jun 10 2021
Overview of graph data science with neo4j



© DATAK 2024