Skip to content

Latest commit

 

History

History
592 lines (508 loc) · 20.4 KB

File metadata and controls

592 lines (508 loc) · 20.4 KB

Node Embeddings

This notebook demonstrates different methods for node embeddings and how to further reduce their dimensionality to be able to visualize them in a 2D plot.

Node embeddings are essentially an array of floating point numbers (length = embedding dimension) that can be used as "features" in machine learning. These numbers approximate the relationship and similarity information of each node and can also be seen as a way to encode the topology of the graph.

Considerations

Due to dimensionality reduction some information gets lost, especially when visualizing node embeddings in two dimensions. Nevertheless, it helps to get an intuition on what node embeddings are and how much of the similarity and neighborhood information is retained. The latter can be observed by how well nodes of the same color and therefore same community are placed together and how much bigger nodes with a high centrality score influence them.

If the visualization doesn't show a somehow clear separation between the communities (colors) here are some ideas for tuning:

  • Clean the data, e.g. filter out very few nodes with extremely high degree that aren't actually that important
  • Try directed vs. undirected projections
  • Tune the embedding algorithm, e.g. use a higher dimensionality
  • Tune t-SNE that is used to reduce the node embeddings dimension to two dimensions for visualization.

It could also be the case that the node embeddings are good enough and well suited the way they are despite their visualization for the down stream task like node classification or link prediction. In that case it makes sense to see how the whole pipeline performs before tuning the node embeddings in detail.

Note about data dependencies

PageRank centrality and Leiden community are also fetched from the Graph and need to be calculated first. This makes it easier to see if the embeddings approximate the structural information of the graph in the plot. If these properties are missing you will only see black dots all of the same size.


References

Dimensionality reduction with t-distributed stochastic neighbor embedding (t-SNE)

The following function takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization.

It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

(see https://opentsne.readthedocs.io)

1. Java Packages

1.1 Generate Node Embeddings using Fast Random Projection (Fast RP) for Java Packages

Fast Random Projection is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors.

👉Hint: To skip existing node embeddings and always calculate them based on the parameters below edit Node_Embeddings_0a_Query_Calculated so that it won't return any results.

The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName shortCodeUnitName projectName communityId centrality embedding
0 org.axonframework.config config axon-configuration-4.10.3 0 0.047302 [0.2620307505130768, -0.16623878479003906, -0....
1 org.axonframework.eventsourcing.eventstore eventstore axon-eventsourcing-4.10.3 0 0.037034 [-0.09079179167747498, -0.006333380937576294, ...
2 org.axonframework.eventsourcing.eventstore.inm... inmemory axon-eventsourcing-4.10.3 0 0.012211 [0.28706541657447815, 0.0232041347771883, -0.4...
3 org.axonframework.eventsourcing.eventstore.jdbc jdbc axon-eventsourcing-4.10.3 0 0.023525 [-0.49391722679138184, 0.13322685658931732, -0...
4 org.axonframework.eventsourcing.eventstore.jdb... statements axon-eventsourcing-4.10.3 0 0.015345 [-0.5326073169708252, 0.21447494626045227, -0....

1.2 Dimensionality reduction with t-distributed stochastic neighbor embedding (t-SNE)

This step takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization. For more details look up the function declaration for "prepare_node_embeddings_for_2d_visualization".

--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
   --> Time elapsed: 0.03 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration   50, KL divergence -1.7423, 50 iterations in 0.0553 sec
Iteration  100, KL divergence 1.2066, 50 iterations in 0.0156 sec
Iteration  150, KL divergence 1.2066, 50 iterations in 0.0145 sec
Iteration  200, KL divergence 1.2066, 50 iterations in 0.0147 sec
Iteration  250, KL divergence 1.2066, 50 iterations in 0.0144 sec
   --> Time elapsed: 0.11 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration   50, KL divergence 0.1902, 50 iterations in 0.0512 sec
Iteration  100, KL divergence 0.1713, 50 iterations in 0.0485 sec
Iteration  150, KL divergence 0.1638, 50 iterations in 0.0451 sec
Iteration  200, KL divergence 0.1624, 50 iterations in 0.0441 sec
Iteration  250, KL divergence 0.1626, 50 iterations in 0.0439 sec
Iteration  300, KL divergence 0.1629, 50 iterations in 0.0454 sec
Iteration  350, KL divergence 0.1629, 50 iterations in 0.0464 sec
Iteration  400, KL divergence 0.1633, 50 iterations in 0.0467 sec
Iteration  450, KL divergence 0.1631, 50 iterations in 0.0452 sec
Iteration  500, KL divergence 0.1629, 50 iterations in 0.0444 sec
   --> Time elapsed: 0.46 seconds



(114, 2)
codeUnit artifact communityId centrality x y
0 org.axonframework.config axon-configuration-4.10.3 0 0.047302 0.695418 -0.866715
1 org.axonframework.eventsourcing.eventstore axon-eventsourcing-4.10.3 0 0.037034 1.817866 -3.779398
2 org.axonframework.eventsourcing.eventstore.inm... axon-eventsourcing-4.10.3 0 0.012211 -0.300429 -4.027914
3 org.axonframework.eventsourcing.eventstore.jdbc axon-eventsourcing-4.10.3 0 0.023525 -4.176720 -4.519949
4 org.axonframework.eventsourcing.eventstore.jdb... axon-eventsourcing-4.10.3 0 0.015345 -4.167485 -4.504787

1.3 Visualization of the node embeddings reduced to two dimensions

png

1.4 Node Embeddings for Java Packages using HashGNN

HashGNN resembles Graph Neural Networks (GNN) but does not include a model or require training. It combines ideas of GNNs and fast randomized algorithms. For more details see HashGNN. Here, the latter 3 steps are combined into one for HashGNN.

The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName shortCodeUnitName projectName communityId centrality embedding
0 org.axonframework.config config axon-configuration-4.10.3 0 0.047302 [0.4330126941204071, -2.1650634706020355, -0.6...
1 org.axonframework.eventsourcing.eventstore eventstore axon-eventsourcing-4.10.3 0 0.037034 [0.0, -1.948557123541832, -1.0825317353010178,...
2 org.axonframework.eventsourcing.eventstore.inm... inmemory axon-eventsourcing-4.10.3 0 0.012211 [-0.21650634706020355, -1.2990380823612213, -0...
3 org.axonframework.eventsourcing.eventstore.jdbc jdbc axon-eventsourcing-4.10.3 0 0.023525 [-0.21650634706020355, -1.2990380823612213, -0...
4 org.axonframework.eventsourcing.eventstore.jdb... statements axon-eventsourcing-4.10.3 0 0.015345 [-0.4330126941204071, -1.2990380823612213, -0....
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
   --> Time elapsed: 0.00 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration   50, KL divergence -0.2694, 50 iterations in 0.0654 sec
Iteration  100, KL divergence 1.2165, 50 iterations in 0.0168 sec
Iteration  150, KL divergence 1.2165, 50 iterations in 0.0142 sec
Iteration  200, KL divergence 1.2165, 50 iterations in 0.0142 sec
Iteration  250, KL divergence 1.2165, 50 iterations in 0.0144 sec
   --> Time elapsed: 0.12 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration   50, KL divergence 0.6516, 50 iterations in 0.0516 sec
Iteration  100, KL divergence 0.6224, 50 iterations in 0.0496 sec
Iteration  150, KL divergence 0.6109, 50 iterations in 0.0465 sec
Iteration  200, KL divergence 0.6109, 50 iterations in 0.0455 sec
Iteration  250, KL divergence 0.6113, 50 iterations in 0.0454 sec
Iteration  300, KL divergence 0.6115, 50 iterations in 0.0461 sec
Iteration  350, KL divergence 0.6115, 50 iterations in 0.0470 sec
Iteration  400, KL divergence 0.6115, 50 iterations in 0.0465 sec
Iteration  450, KL divergence 0.6115, 50 iterations in 0.0527 sec
Iteration  500, KL divergence 0.6115, 50 iterations in 0.0470 sec
   --> Time elapsed: 0.48 seconds



(114, 2)
codeUnit artifact communityId centrality x y
0 org.axonframework.config axon-configuration-4.10.3 0 0.047302 3.763831 -6.318581
1 org.axonframework.eventsourcing.eventstore axon-eventsourcing-4.10.3 0 0.037034 0.742759 -4.933720
2 org.axonframework.eventsourcing.eventstore.inm... axon-eventsourcing-4.10.3 0 0.012211 -2.386420 5.726051
3 org.axonframework.eventsourcing.eventstore.jdbc axon-eventsourcing-4.10.3 0 0.023525 -1.278167 -4.108791
4 org.axonframework.eventsourcing.eventstore.jdb... axon-eventsourcing-4.10.3 0 0.015345 -6.817947 1.249824

png

2.5 Node Embeddings for Java Packages using node2vec

The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName shortCodeUnitName projectName communityId centrality embedding
0 org.axonframework.config config axon-configuration-4.10.3 0 0.047302 [0.06730452179908752, -0.33492279052734375, -0...
1 org.axonframework.eventsourcing.eventstore eventstore axon-eventsourcing-4.10.3 0 0.037034 [0.14006678760051727, -0.03767256811261177, 0....
2 org.axonframework.eventsourcing.eventstore.inm... inmemory axon-eventsourcing-4.10.3 0 0.012211 [-0.0299423485994339, -0.06651098281145096, -0...
3 org.axonframework.eventsourcing.eventstore.jdbc jdbc axon-eventsourcing-4.10.3 0 0.023525 [0.6322472095489502, 0.12474610656499863, 0.18...
4 org.axonframework.eventsourcing.eventstore.jdb... statements axon-eventsourcing-4.10.3 0 0.015345 [0.7179538607597351, 0.2249232977628708, 0.241...
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
   --> Time elapsed: 0.00 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration   50, KL divergence -0.2703, 50 iterations in 0.0637 sec
Iteration  100, KL divergence 1.1571, 50 iterations in 0.0177 sec
Iteration  150, KL divergence 1.1571, 50 iterations in 0.0148 sec
Iteration  200, KL divergence 1.1571, 50 iterations in 0.0150 sec
Iteration  250, KL divergence 1.1571, 50 iterations in 0.0146 sec
   --> Time elapsed: 0.13 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration   50, KL divergence 0.3317, 50 iterations in 0.0534 sec
Iteration  100, KL divergence 0.3082, 50 iterations in 0.0482 sec
Iteration  150, KL divergence 0.3005, 50 iterations in 0.0458 sec
Iteration  200, KL divergence 0.2994, 50 iterations in 0.0463 sec
Iteration  250, KL divergence 0.2988, 50 iterations in 0.0463 sec
Iteration  300, KL divergence 0.2988, 50 iterations in 0.0468 sec
Iteration  350, KL divergence 0.2988, 50 iterations in 0.0475 sec
Iteration  400, KL divergence 0.2985, 50 iterations in 0.0469 sec
Iteration  450, KL divergence 0.2984, 50 iterations in 0.0464 sec
Iteration  500, KL divergence 0.2986, 50 iterations in 0.0466 sec
   --> Time elapsed: 0.47 seconds



(114, 2)
codeUnit artifact communityId centrality x y
0 org.axonframework.config axon-configuration-4.10.3 0 0.047302 -0.219140 -0.386250
1 org.axonframework.eventsourcing.eventstore axon-eventsourcing-4.10.3 0 0.037034 3.171413 -1.556873
2 org.axonframework.eventsourcing.eventstore.inm... axon-eventsourcing-4.10.3 0 0.012211 1.425012 -1.941952
3 org.axonframework.eventsourcing.eventstore.jdbc axon-eventsourcing-4.10.3 0 0.023525 3.147093 8.328047
4 org.axonframework.eventsourcing.eventstore.jdb... axon-eventsourcing-4.10.3 0 0.015345 3.148180 8.344025

png