neo4j pt4 : exploratory data analysis / data profiling (APOC)
Category : Datascience Tag : gds
June 13, 2021, 4:55 p.m.

Short Description :
Part4 includes how to do data profiling : a holistic view of data using neo4j
source : datak

<h2>About</h2><p>Part 1 explains Neo4j basic and high level framing, Part 2 explains basic Cypher query, Part 3 explains how to retrieve data from RDBMS/CSV to graph database. In this part 4, we will explain exploratory data abalysis to understand high level data at a macro view.</p><p>Ref :&nbsp;<a href="https://neo4j.com/blog/data-profiling-holistic-view-neo4j/" target="_blank">A holistic view of data using Neo4j</a></p><p><br></p><h2>What is APOC?</h2><p>APOC stands for `Awesome Procedures On Cypher`, which is an add-on library for Neo4j that provides hundreds of procedures and functions adding a lot of useful functionality from highly complex graph algorithms. The functionality includes data integrations, graph algorithms, data conversion.</p><p><br></p><h2>Datasets for EDA</h2><p>We are going to create&nbsp;<a href="https://medium.com/neo4j/world-cup-2018-graph-19fbac0a75db" target="_blank">world cup 2018</a>&nbsp;graph database, which includes 1930 to 2014 worldcup data. Highlevel graph database image is as followings</p><p><img src="/media/django-summernote/2021-06-16/e5292fc9-6ce8-48ec-9d29-f4010786719d.png" style="width: 50%;"><br></p><p><br></p><p>This supports by neo4j browser so start type to see an instructions on how to prepare its graph data;</p><pre>// Cypher :play worldcup-2018</pre><p><br></p><h4>Import the data : schema</h4><p>Create indexes and constraints</p><pre>// Cypher CALL apoc.schema.assert( { Player:['name'], Appearance:['name'] }, { Match:['id'], WorldCup:['year'], Stadium:['name'], Country:['id'], Player:['id'], Squad:['name'] } ); </pre><p><br></p><h4>Import the data : tournaments and teams</h4><pre>// Cypher //load worldcup.csv LOAD CSV WITH HEADERS FROM "https://github.com/mneedham/neo4j-worldcup/raw/master/data/2018/import/worldcups.csv" AS csvLine MERGE (worldCup:WorldCup {year: toInteger(csvLine.year)}) SET worldCup.name = csvLine.name</pre><pre>// Cypher //load teams.csv LOAD CSV WITH HEADERS FROM "https://github.com/mneedham/neo4j-worldcup/raw/master/data/2018/import/teams.csv" AS csvLine MATCH (worldCup:WorldCup {year: toInteger(csvLine.year)}) MERGE (c:Country {id: toInteger(csvLine.teamId)}) ON CREATE SET c.name = csvLine.team MERGE (squad:Squad {name: c.name + " Squad for " + csvLine.year }) SET squad.year = toInteger(csvLine.year) MERGE (c)-[:NAMED_SQUAD]-&gt;(squad)-[:FOR_WORLD_CUP]-&gt;(worldCup); </pre><p><br></p><h4>Import the data : Squads</h4><p>`USING PERIODIC COMMIT 1000` does not work, and might need `call apoc.periodic.iterate()` instead for a batch commit : <a href="https://neo4j.com/labs/apoc/4.0/graph-updates/periodic-execution/#commit-batching" target="_blank">reference</a></p><p>This is a heavy load, but I could load without periodic execution to control transaction sizes in memory</p><pre>// Cypher //USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "https://github.com/mneedham/neo4j-worldcup/raw/master/data/2018/import/squads.csv" AS csvLine MATCH (worldCup:WorldCup {year: toInteger(csvLine.year)}), (c:Country {id: toInteger(csvLine.teamId)}) MATCH (squad:Squad {name: c.name + " Squad for " + worldCup.year }) MATCH (c)-[:NAMED_SQUAD]-&gt;(squad)-[:FOR_WORLD_CUP]-&gt;(worldCup) MERGE (p:Player {id: toInteger(csvLine.playerId)}) ON CREATE SET p.name = csvLine.playerName, p.position = csvLine.playerPosition, p.dob = CASE WHEN csvLine.playerDOB &lt;&gt; "" THEN datetime(csvLine.playerDOB) ELSE null END MERGE (p)-[:IN_SQUAD]-&gt;(squad); </pre><p><br></p><h4>Import the data : Matches</h4><pre>// Cypher //USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "https://github.com/mneedham/neo4j-worldcup/raw/master/data/2018/import/matches.csv" AS csvLine WITH csvLine, apoc.text.regexGroups(csvLine.reasonWin, "\\((\\d) - (\\d)\\)")[0] AS reasonWin WITH apoc.map.merge(csvLine, { h_penalties: toInteger(reasonWin[1]), a_penalties: toInteger(reasonWin[2])}) AS csvLine MATCH (worldCup:WorldCup {year: toInteger(csvLine.year)}) MERGE (match:Match {id: toInteger(csvLine.id)}) SET match.h_score = toInteger(csvLine.h_score), match.a_score = toInteger(csvLine.a_score), match.date = datetime({epochSeconds: apoc.date.parse(apoc.text.replace(csvLine.date, "Local time|-", ""), "s", "dd MMM yyyy HH:mm")}), match.description = csvLine.home + " vs. " + csvLine.away, match.round = csvLine.round FOREACH(i IN CASE WHEN exists(csvLine.reasonWin) THEN [1] ELSE [] END | MERGE (match:Match {id: toInteger(csvLine.id)}) SET match:ExtraTime ) FOREACH(i IN CASE WHEN csvLine.reasonWin contains "penalties" THEN [1] ELSE [] END | MERGE (match:Match {id: toInteger(csvLine.id)}) SET match:Penalties ) WITH * MATCH (home:Country {id: toInteger(csvLine.home_id)}) SET home.code = csvLine.home_code WITH * MERGE (match)-[:HOME_TEAM]-&gt;(home) MERGE (match)&lt;-[homePlayed:PLAYED_IN]-(home) SET homePlayed.score = toInteger(csvLine.h_score), homePlayed.penalties = toInteger(csvLine.h_penalties) WITH * MATCH (away:Country {id: toInteger(csvLine.away_id)}) SET away.code = csvLine.away_code WITH * MERGE (match)-[:AWAY_TEAM]-&gt;(away) MERGE (match)&lt;-[awayPlayed:PLAYED_IN]-(away) SET awayPlayed.score = toInteger(csvLine.a_score), awayPlayed.penalties = toInteger(csvLine.a_penalties) MERGE (match)&lt;-[:CONTAINS_MATCH]-(worldCup) FOREACH(i IN CASE WHEN toInteger(csvLine.year) = 2002 THEN [1] ELSE [] END | MERGE (host1:Country {name: "Korea Republic"}) MERGE (host2:Country {name: "Japan"}) MERGE (host1)&lt;-[:HOSTED_BY]-(worldCup) MERGE (host2)&lt;-[:HOSTED_BY]-(worldCup)) FOREACH(i IN CASE WHEN toInteger(csvLine.year) &lt;&gt; 2002 THEN [1] ELSE [] END | MERGE (host:Country {name: csvLine.host}) MERGE (host)&lt;-[:HOSTED_BY]-(worldCup)) MERGE (stadium:Stadium {name: csvLine.stadium}) MERGE (match)-[:PLAYED_IN_STADIUM]-&gt;(stadium); </pre><p><br></p><h4>Import the data : Lineups</h4><p>This is also heaviest data sets to load, so note that it will take longer.</p><pre>// Cypher //USING PERIODIC COMMIT 1000 //LOAD CSV WITH HEADERS FROM "file:///lineups.csv" AS csvLine LOAD CSV WITH HEADERS FROM "https://github.com/mneedham/neo4j-worldcup/raw/master/data/2018/import/lineups.csv" AS csvLine MATCH (player:Player {id: toInteger(csvLine.player_id)}) MATCH (match:Match {id: toInteger(csvLine.match_id)}) MERGE (app:Appearance {name: player.id + " in match " + match.id}) MERGE (app)-[:IN_MATCH]-&gt;(match) WITH player, app, csvLine CALL apoc.merge.relationship(player, csvLine.type, {}, {}, app) YIELD rel RETURN count(*);<br></pre><p><br></p><h4>Import the data : Events</h4><pre>// Cypher //USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "https://github.com/mneedham/neo4j-worldcup/raw/master/data/2018/import/events.csv" AS csvLine MATCH (match:Match {id: toInteger(csvLine.match_id)}) MATCH (player:Player {id: toInteger(csvLine.player_id)}) MATCH (player)-[:STARTED|:SUBSTITUTE]-&gt;(appearance)-[:IN_MATCH]-&gt;(match) // goals FOREACH(n IN (CASE WHEN csvLine.type IN ["penalty", "goal", "owngoal"] THEN [1] else [] END) | MERGE (appearance)-[:SCORED_GOAL]-&gt;(penalty:Goal {time: csvLine.time, type: csvLine.type}) ) </pre><p><br></p><h2>Exploratory Data Analysis / Data Profiling</h2><p>Data profiling</p><ol><li>Database Schema Analysis (Macro Analysis)</li><li>Node Analysis</li><li>Specific Question on world cup</li></ol><p><br></p><h3>1. Database Schema Analysis</h3><h4>1.1 High level graph database scheme</h4><pre>// Cypher call db.schema.visualization //or we could also use; // call apoc.meta.graph()</pre><p>The worldcup database has 10 modes type and 25 relationship type</p><p><img src="/media/django-summernote/2021-06-16/8ab9c4f3-9111-454e-a098-73b9c83fe651.png" style="width: 100%;"></p><p><br></p><h4>1.2 Count sum of nodes, relationship, and properties</h4><p>`elementType` shows if the info is either `node` or `relationship`</p><pre>// Cypher call apoc.meta.data()</pre><p><img src="/media/django-summernote/2021-06-16/dad516ac-3353-4dc9-bed7-6381c33a478f.png" style="width: 100%;"><br></p><p><br></p><h4>1.3 Stats sum of nodes and relationships</h4><pre>// Cypher call apoc.meta.stats()</pre><p><img src="/media/django-summernote/2021-06-16/cbb33cb0-fb27-4654-9880-ba2c10d4e5a1.png" style="width: 1258px;"></p><p><br></p><h4>1.4 Show all relationship types and node labels / types</h4><pre>// Cypher //relationship types call db.relationshipTypes() //node labels / types call db.labels()</pre><p><br></p><h4>1.5 Count</h4><pre>// Cypher //count all relationships MATCH ()-[]-() RETURN count(r) //count all nodes MATCH (n) RETURN count(n)</pre><p><br></p><h3>2 Node Analysis</h3><h4>2.1 Count nodes by their labels / types</h4><pre>// Cypher //list all node types and counts MATCH (n) RETURN labels(n) AS NodeTypes, count(n) AS NumberOfNodes;</pre><p><br></p><h4>2.2 Property Analysis</h4><p><span style="font-size: 14px;">2.2.1 List all properties of a node and relationship</span></p><pre style="line-height: 1.42857;">// Cypher // list for node/labels of Player as an example MATCH (p:Player) RETURN keys(p) LIMIT 1 // list for properties, this may not work MATCH ()-[f:FOR_WORLD_CUT]-() RETURN keys(f) LIMIT 1</pre><p style="line-height: 1.42857;"><br></p><h3 style="line-height: 1.42857;">3 Specific Questions Answered</h3><h4 style="line-height: 1.42857;">3.1 Who's hosted the World Cup each?</h4><pre style="line-height: 1.42857;">// Cypher MATCH path = (wc:WorldCup)-[:HOSTED_BY]-&gt;(country) RETURN path</pre><p style="line-height: 1.42857;"><br></p><h4 style="line-height: 1.42857;">3.2 Who's hosted the World Cup more than once?</h4><pre style="line-height: 1.42857;">// Cypher MATCH (host:Country)&lt;-[:HOSTED_BY]-(wc) WITH sc, host ORDER BY wc.year WITH host, count(*) AS times, collect(wc.year) AS years WHERE times &gt; 1 RETURN host.name, times, years</pre><p style="line-height: 1.42857;"><br></p><h4 style="line-height: 1.42857;">3.3 Have the hosts ever won the World Cup?</h4><pre style="line-height: 1.42857;">// Cypher MATCH (match:Match {round: 'Final'})&lt;-[hostPlayed:PLAYED_IN]-(host:Country), &nbsp;&nbsp;&nbsp;&nbsp; (host)&lt;-[:HOSTED_BY]-(worldCup), &nbsp;&nbsp;&nbsp;&nbsp; (worldCup)-[:CONTAINS_MATCH]-&gt;(match), &nbsp;&nbsp;&nbsp;&nbsp; (match)&lt;-[oppositionPlayed:PLAYED_in]-(opposition) WHERE (hostPlayed.scoe &gt; oppositionPlayed.score) OR (hostPlayed.penalties &gt; opporitionPlayed.score) RETURN host.name, worldCup.year, hostPlayed.score + "-" + oppositionPlayed.score AS score, opposition.name ORDER BY worldCup.year</pre><p style="line-height: 1.42857;"><br></p><h4 style="line-height: 1.42857;">3.4 Who are the top scores across all the World Cups?</h4><pre style="line-height: 1.42857;">// Cypher MATCH (player)--&gt;(stats)-[:SCORED_GOAL]-&gt;(goal), &nbsp;&nbsp;&nbsp;&nbsp; (stats)-[:IN_MATCH]-&gt;()&lt;-[:CONTAINS_MATCH]-(wc:WorldCup) WHERE goal.type IN ["goal", "penalty"] WITH player,name AS player, count(*) AS goals, &nbsp;&nbsp;&nbsp;&nbsp; collect(DISTINCT wc.year) AS competitions UNWIND competitions AS competition WITH player, goals, competitoin ORDER BY player, goals, competition RETURN player, goals, collect(competitoin) AS competitions ORDER BY goals DESC LIMIT 5</pre><p></p>


<< Back to Blog Posts
Back to Home

Related Posts

neo4j pt3 : RDBS/CSV to Graph Database
gds
Jun 13 2021
How to migrate from RDBS to Neo4j data warehouse as graph database format, showing data pipeline
neo4j pt5 : Visualization(Bloom)
Jun 25 2021
Bloom overview
neo4j pt6 : Graph Data Science 1
gds
Jun 26 2021
Part 6 includes how to use graph data science in neo4j
neo4j pt7 : Graph Data Science 2
Jun 27 2021
How to run graph data science algorithm in neo4j Cypher query
neo4j pt2 : query basics (Cypher)
gds
Jun 12 2021
Overview of graph data science with neo4j. Pt2 includes cypher query basics with toy movie data
neo4j pt1 : overview
gds
Jun 10 2021
Overview of graph data science with neo4j



© DATAK 2024