DATAK | neo4j pt7 : Graph Data Science 2

neo4j pt7 : Graph Data Science 2

Category : Datascience Tag : gds 101series NoSQL

June 27, 2021, 8:22 p.m.

Short Description :

How to run graph data science algorithm in neo4j Cypher query

source : datak

<p>This is mainly referenced by <a href="https://neo4j.com/docs/graph-data-science/current/introduction/" target="_blank">neo4j graph data science documents</a></p><p><br></p><ol><li>Graph data set and Exploratory Data Analysis</li><li>Memory Management</li><li>Graph Creation</li><li>Run Algorithm<ul><li>Page Rank</li><li>Label Propagation</li><li>Weakly Connected Components</li><li>Louvain</li><li>Node Similarity</li><li>Triangle Count</li><li>Local Clustering Coefficient</li></ul></li></ol><p><br></p><h2>1. Graph Data Preparation and Exploratory Data Analysis</h2><p>We will use built-in data from `Game of Thrones` fantasy saga.</p><pre>// Cypher //Load graph data base tutorial with game of thrones datasets :play <a href="https://guides.neo4j.com/sandbox/graph-data-science/index.html" target="_blank">https://guides.neo4j.com/sandbox/graph-data-science/index.html</a></pre><p><br></p><p>Exploratory Data Analysis</p><pre>// Cypher //high level node relationship Call db.schema.visualization() //summary statistics for number of interactions per character MATCH (c:Person)-[:INTERACTS]->() WITH c, count(*) AS num RETURN min(num) AS min, max(num) AS max, avg(num) AS avg_interactinos, stdev(num) AS stdev //summary stats by same grouped by book MATCH (c:Person)-[r:INTERACTS]->() WITH r.book AS book, c, count(*) AS num RETURN book, min(num) AS min, max(num) AS max, avg(num) AS avg_interactions, stdev(num) AS stdev ORDER BY book //high level stats call apoc.meta.stats()</pre><p><br></p><h2>2. Memory Management</h2><p>Assume you have a data and know something about its shape by EDA, you then need to estimate the memory usage of your graph and its algorithms and configure for Neo4j server. This is because;</p><blockquote><p>The graph algorithms run on an in-memory, heap-allocated projection of the Neo4j graph, which resides outside the main database. This means that before you execute an algorithm, you must create a projection of your graph in memory.</p></blockquote><p><br></p><h4>Estimate memory usage for graph creation</h4><pre>// Cypher //to estimate the required memory for a subset of your graph, 'Person' nodes and 'INTERACTS' relationships as an example CALL gds.graph.create.estimate('Person', 'INTERACTS') YIELD nodeCount, relationshipCount, requiredMemory</pre><p><br></p><h4>Create projected graph</h4><pre>// Cypher //create `got-interactions` graph CALL gds.graph.create('got-interactions', 'Person', 'INTERACTS')</pre><p><br></p><h4>Estimate memory usage for algorithm running</h4><pre>// Cypher //estimate the memory needed to execute an algorithm on your `got-interactions` graph, `Page Rank` as an example CALL gds.pageRank.stream.estimate('got-interactions')</pre><p><br></p><h4>Clean up</h4><blockquote><p>If you do not want to use the projected graph anymore, release it from the memory</p></blockquote><pre>// Cypher <p> </p>CALL gds.graph.drop('got-interactions');</pre><p><br></p><h2>3. Graph Creation</h2><blockquote><p>To enable fast caching of the graph topology, containing only the relevant nodes, relationships, and weights, the GDS library operates on in-memory graphs that are created as projections of the Neo4j stored graph.</p></blockquote><p><br></p><h4>Graph Catalog</h4><blockquote><p>The typical workflow is to create the projected graph `explicitly` by giving it a name and storing it in the `graph catalog`. This allows you to operate on the graph multiple times.</p></blockquote><p>You've created graph catalog at section 1 so let's start with removing by ;</p><pre>// Cypher Call gds.graph.drop('got-interactions')</pre><p><br></p><p>Create again, but since each `INTERACTS` relationship is symmetic, you can even ignore its direction by creating your garph with an `UNDIRECTED` orientation.</p><pre>// Cypher Call gds.graph.create('got-interactions', 'Person', {     INTERACTS: {         orientation: 'UNDIRECTED'     } })</pre><p><br></p><h2>4. Run Algorithms</h2><h4>Syntax : explicit graphs</h4><blockquote><p>Running algorithms on explicitly created graphs allows you to operate on a graph multiple times. To do this, refer to the graph by its name, as it is stored in the graph catalog.</p></blockquote><pre>// Cypher CALL gds.<algo-name>.<mode>( graphName: String, configuration: Map )</pre><ul><li><span style="background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px;"><algo-name> is the algorithm name</span></li><li><span style="background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px;"><mode> is the algorithm execution mode</span><ul><li><span style="background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px;">`write` : writes results to the Neo4j database and returns a summary of the results</span></li><li><span style="background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px;">`stats` : same as `write` but does not write to the Neo4j database</span></li><li><span style="background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px;">`stream` : streams results back to the user</span></li></ul></li><li><span style="background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px;">The `graphName` parameter value is the name of the graph from the graph catalog</span></li><li><span style="background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px;">The `configuration` parameter value is the algorithm-specific configuration</span></li></ul><p></p><p style="line-height: 1.42857;"><br></p><h4 style="line-height: 1.42857;">Page Rank Algorithm</h4><blockquote style="line-height: 1.42857;">Page Rank is an algorithm that measures the transitive influence and connectivity of nodes to find the most influential nodes in a graph.<br>It computes an influence value for each node, called a score. As a results, the score of a node is a certain weighted average of the scores of its direct neighbors.<br>PageRank is an iterative algorithm. In each iteration, every node propagates its score evenly divided to its neighbors.<br>The algorithm runs for a configurable maximum number of iterations (default is 20), or until the node scores converge. That is, when the maximum change in node score between two sequential iterations is smaller than the configured `tolerance` value.</blockquote><pre style="line-height: 1.42857;">// Cypher //Run a basic PageRank call in `stream` mode CALL gds.pageRank.stream('got-interactions') YIELD nodeId, score RETURN gds.util.asNode(nodeId).name AS name, score ORDER BY score DESC LIMIT 10 //compare the Page Rank of each `Person` node with the number of interactions for that node CALL gds.pageRank.stream('got-interactions') YIELD nodeId, score AS pageRank WITH gds.util.asNode(nodeId) AS n, pageRank MATCH (n)-[i:INTERACTS]-() RETURN n.name AS name, pageRank, count(i) AS interactions ORDER BY pageRank DESC LIMIT 10</pre><p style="line-height: 1.42857;">You will see that Page Rank results is not always follows number of interactions.</p><p style="line-height: 1.42857;"><br></p><h4 style="line-height: 1.42857;">Label Propagation</h4><blockquote style="line-height: 1.42857;"><p>Label Propagation (LPA) is a fast algorithm for finding communities in a graph. It propagates labels throughout the graph and forms communities of nodes based on their influence.</p></blockquote><pre>// Cypher // create in-memory graph for LPA run CALL gds.graph.create( 'got-interactions-weighted', 'Person', { INTERACTS: { orientation: 'UNDIRECTED', properties: 'weight' } } ) // run LPA with just on iteration CALL gds.labelPropagation.stream( 'got-interactions-weighted', { relationshipWeightProperty: 'weight', maxIterations: 1 } ) YIELD nodeId, communityId RETURN communityId, count(nodeId) AS size ORDER BY size DESC LIMIT 5 // the algorithm typically needs multiple iterations to achieve a stable result. you would need to repeat to change `maxIterations` multiple time to see when it's converged // clean up graphs from catalog for LPA for next metrics trial CALL gds.graph.drop('got-interactions-weighted');</pre><p><br></p><p><br></p><p><br></p><p><br></p>

<< Back to Blog Posts
Back to Home

view article

neo4j pt1 : overview

Datascience

gds

Jun 10 2021

Overview of graph data science with neo4j

view article

neo4j pt2 : query basics (Cypher)

Datascience

gds

Jun 12 2021

Overview of graph data science with neo4j. Pt2 includes cypher query basics with toy movie data

view article

neo4j pt3 : RDBS/CSV to Graph Database

Datascience

gds

Jun 13 2021

How to migrate from RDBS to Neo4j data warehouse as graph database format, showing data pipeline

view article

neo4j pt4 : exploratory data analysis / data profiling (APOC)

Datascience

gds

Jun 13 2021

Part4 includes how to do data profiling : a holistic view of data using neo4j

view article

neo4j pt6 : Graph Data Science 1

Datascience

gds

Jun 26 2021

Part 6 includes how to use graph data science in neo4j

view article

neo4j pt5 : Visualization(Bloom)

Datascience

gds NoSQL

Jun 25 2021

Bloom overview

view article

Graph Data Science : Market

Datascience

gds NoSQL

Jul 1 2021

Graph Data Science demonstration for Market analysis

Related Posts