-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H-4088: Add knowledge graph glossary and tutorial content #6498
Open
vilkinsons
wants to merge
30
commits into
main
Choose a base branch
from
d/docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
6cd29c5
Re-order docs
vilkinsons 763cbd0
Update copy
vilkinsons 60ea7cc
Update README.md
vilkinsons d12f213
Update README.md
vilkinsons cf4f913
Update README.md
vilkinsons f6a0542
Update README.md
vilkinsons 6537739
Update README.md
vilkinsons ae31226
Update README.md
vilkinsons 5aac237
Update knowledge-graphs.mdx
vilkinsons 867f5ea
Delete `apps/hashdotai/glossary/url_map.json`
vilkinsons 745e7a0
Update knowledge-graphs.mdx
vilkinsons 0de5f72
Create knowledge-graphs-in-healthcare-and-life-sciences.mdx
vilkinsons 0f03118
Create knowledge-graphs-in-supply-chain-management.mdx
vilkinsons ce177a9
Create knowledge-graphs-in-finance.mdx
vilkinsons 92bc2af
Create knowledge-graphs-in-retail-and-ecommerce.mdx
vilkinsons 03e16ea
Update knowledge-graphs-in-finance.mdx
vilkinsons 24a291a
Update knowledge-graphs-in-retail-and-ecommerce.mdx
vilkinsons 5379b9f
Create knowledge-graphs-in-enterprise-knowledge-management.mdx
vilkinsons 5dd0ddd
Merge branch 'main' into d/docs
vilkinsons 15663ff
Create WIP_event-driven-knowledge-graphs.mdx
vilkinsons b38b18a
Create WIP_labeled-property-graphs.mdx
vilkinsons 1d392e8
temp
vilkinsons c72624d
Create `scalars.mdx`
vilkinsons 2a60430
Create vectors.mdx
vilkinsons 8596401
Update scalars.mdx
vilkinsons fff10ac
Create variables.mdx
vilkinsons 8d67261
Create arrays.mdx
vilkinsons be0d284
Update WIP_labeled-property-graphs.mdx
vilkinsons ec274ff
Update WIP_labeled-property-graphs.mdx
vilkinsons 7141319
Merge branch 'main' into d/docs
vilkinsons File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
10 changes: 10 additions & 0 deletions
10
apps/hashdotai/glossary/WIP_event-driven-knowledge-graphs.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
--- | ||
title: Event-Driven Knowledge Graphs | ||
description: "Event-driven knowledge graphs are commonly used to power decision support and simulation tools." | ||
slug: event-driven-knowledge-graphs | ||
tags: ["Graphs"] | ||
--- | ||
|
||
**Event-driven knowledge graphs** sit at the intersection of [discrete event models](/glossary/discrete-event-modeling) which help simulate processes and the state of real-world systems, and [knowledge graphs](/glossary/knowledge-graphs) which help represent objects, events, situations, or concepts (often using [graph databases](/glossary/graph-databases)). | ||
|
||
By introducing an event-driven approach (aka. “event sourcing via streaming platforms”), platforms like HASH can extract and link data from multiple data silos in near real time. In practice, an event-sourcing pipeline streams key data changes (events) from disparate systems, deduplicates and unifies them, and updates a knowledge graph accordingly – resulting in an up-to-date, event-driven knowledge graph. This differs from traditional knowledge graphs that are typically entity-centric and updated in batches or via periodic processes. Traditional KGs capture mostly static facts and relationships, whereas an event-driven KG continually incorporates dynamic, temporal information (events) as first-class data. As such, event-driven graphs are always evolving to reflect the latest state of the business. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
--- | ||
title: Labeled Property Graphs | ||
description: "Labeled Property Graphs consist of nodes (also called vertices) and relationships (edges), each of which can hold associated data." | ||
slug: labeled-property-graphs | ||
tags: ["Graphs"] | ||
--- | ||
|
||
# **Data Modeling Principles** | ||
|
||
## **Structure of Labeled Property Graphs** | ||
|
||
A **labeled property graph (LPG)** consists of *entities* (also called nodes or vertices) and *links* (relationships or edges), each of which can have a series of attributes associated with it which contain relevant information. In an LPG, entities have simple textual labels like `Person` or `Product`, and so do links - for example, `Friends With` or `Purchased`. These help categorize entities and links. | ||
|
||
In more advanced "LPG inspired" systems like HASH, instead of using simple textual labels, entities and links are assigned one or more [entity types](/guide/types/entity-types) or [link types](/guide/types/link-types) (as appropriate). | ||
|
||
In LPGs, both entities and links can have any number of _properties_, which are key–value pairs storing additional information (for example, a Person node might have properties like `name:"Alice"` and `age:30`). In HASH, both entities and links can contain any number of properties, or other _links_ (allowing links to point to other links, as required). | ||
|
||
In LPGs and HASH, links are usually **directed**, meaning they have start and end at specific entities, though in some use cases direction can be ignored or traversed in both ways as needed. | ||
|
||
The LPG model is best understood with a simple example. Consider two `Person` entities and a friendship between them: | ||
|
||
``` | ||
CREATE (p:Person { name: "Alice", age: 30 }); | ||
CREATE (q:Person { name: "Bob", age: 32 }); | ||
CREATE (p)-[:Friends With]->(q); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a suggestion but just to note that in a directed graph, this example means that Alice is friends with Bob but Bob is not known to be friends with Alice (which is a possible real-world situation!) |
||
``` | ||
|
||
Here we created two entities with the **Person** label (or entity type, if using HASH), which allows certain properties to be associated with them. We've also created a link with the label (or link type) **Friends With** that points from from Alice to Bob. In LPGs and HASH, this data is stored as a graph structure – you can later query it by traversing from `Alice` to find all `Friends With` connections, for example. In both LPGs and HASH this works very similarly: except in LPGs we are querying by "label" and in HASH instead by "type". Both the entities and links between them have labels/types. In HASH, these types indicate what information can be associated with an entity (the expected attributes: properties and links). For example, one could add a property `Since` and corresponding value of `2020` on the `Friends With` link to indicate when a friendship started. This enriched graph structure makes LPGs and typed alternatives like HASH extremely expressive for modeling complex domains. | ||
|
||
## **LPG vs. Other Graph Models (e.g. RDF)** | ||
|
||
Labeled property graphs are one of the two major graph data modeling paradigms in wide use today, the other being the **RDF** (Resource Description Framework) triple model. While both represent data as networks of connected entities, there are fundamental differences in how data is structured and annotated: | ||
|
||
- **Node properties _vs._ Triples**: In LPGs/HASH, an entity can have attributes stored directly as properties (as in the Alice example above). In RDF, by contrast, there is *no concept of an attribute on an entity* – every piece of information is expressed as a separate triple (subject–predicate–object). For example, to represent a person’s birthdate in RDF, one would create a triple like `(BarackObama) -[birthDate]-> ("1961")`, essentially treating the date "1961" as an object node or literal connected via a predicate. In an LPG, that same fact could simply be a property `birthDate: 1961` on the Barack Obama entity, with no extra link needed. This means RDF tends to produce many more small connecting elements, whereas LPGs can store richer information per entity/link (more analogous to an object in object-oriented programming with fields). | ||
- **Global _vs._ Local identification**: RDF uses globally unique identifiers (URIs) for each entity and link type, aiming for web-scale data integration. Every predicate (link type) and often entities are defined by URIs that can link across datasets. LPG systems meanwhile typically use application-local identifiers (like string names for link types and entity types, which they call "labels") and do not inherently link across databases. This makes property graphs simpler to work with in a closed-world context, whereas RDF is built for interoperability at the cost of some verbosity. [HASH](https://hash.ai/) is a next-generation platform that combines the interoperability and mutual intelligibility of RDF with the expressiveness and customizability of LPG, with all entities, links, entity types, link types and associated resources having fixed URIs which can be relied upon internally (or even published to the world wide web) as desired. | ||
- **Atomic unit of data**: The atomic unit in RDF is the triple. Even a single entity with multiple attributes is essentially a collection of triples sharing the same subject. LPGs do not have a single fixed atomic structure; an entity with properties is a self-contained data structure, and a link with its properties is another. This means an LPG can be thought of as a collection of entities and links (each a small record with key-values), rather than a collection of triples. | ||
- **Schema and semantics**: RDF is tightly connected to the Semantic Web and has a rich standard stack for defining ontologies and schemas (RDF Schema, OWL) that let you formally specify classes, relationships, and even logical inference rules. An RDF graph can be “self-describing” to a degree, as the meaning of relationships and nodes can be defined through shared vocabularies/ontologies. Property graphs, on the other hand, do not enforce any specific global schema or ontology layer; the **interpretation of the labels and properties is left to the consumer** or defined at the application level. This gives LPGs more flexibility (you can add any property to any node without prior schema setup), but it also means that understanding the data’s meaning relies on external documentation or conventions rather than inherent semantics. As a hybrid of RDF and LPG-based approaches, HASH relies upon a [type system](/guide/types) to describe labeled property graphs and ensure interoperability. These types can be kept private or publicly shared, and users can fork on, extend, re-use and crosswalk between standardized definitions of entities created by anyone else. This makes HASH well-suited to collaboration within and across companies, while HASH's UI abstracts away complexity and ensures type creation and editing remains simple and easy. | ||
|
||
In summary: | ||
|
||
- **LPGs** emphasize a pragmatic, object-like approach to graph data modeling: entities and links have attributes which make them suitable for straightforward querying and mutation in graph databases. | ||
- **RDF** emphasizes a web-standard, triple-based approach with powerful integration and reasoning capabilities. | ||
- While RDF is common in open knowledge graphs and linked data scenarios, LPGs are frequently found in graph databases for operational or analytic applications. | ||
- **HASH** extends the "LPG" model, replacing simple text "labels" with formally-defined types, while supporting the common RDF paradigm of stable, referenceable URIs. As such, HASH combines the benefits of both LPG and RDF approaches into a single new approach. | ||
|
||
## Best Practices for Graph Data Modeling | ||
|
||
Designing a graph data model requires careful thought to fully leverage the power of the LPG model while keeping the graph efficient and comprehensible. Here are some core data modeling principles and best practices: | ||
|
||
1. **Identify nodes and relationships from entities**: Start by identifying the main kinds of entities you store information about, to map to [entity types](/guide/types/entity-types) (your node labels), and the important relationships between them, to map to [link types](/guide/types/link-types). | ||
* If you have an Entity Relationship (ER) diagram or an object model, it can often be translated: entities/node labels map to *entity types*, and relationships/edges/associations map to *link types*. For example, in a retail scenario you might have nodes labeled `Customer`, `Product`, and `Order`, with relationships like `(:Customer)-[:PLACED]->(:Order)` and `(:Order)-[:CONTAINS]->(:Product)`. | ||
* In HASH, connecting to existing data sources via [integrations](/integrations) automatically populates your web with properly typed entities, eliminating any need for time-consuming transformation or manual mapping of data to entities. | ||
2. **Use properties for simple attributes**: For attributes that don’t naturally need to be separate nodes, use properties. In a relational database, you might normalize certain data into separate tables, but in a graph it’s often unnecessary unless you plan to traverse or query that attribute as a relationship. For instance, an `email` or `age` of a person can be a property on the Person node. On the other hand, something like an `Address` might be a node of its own if you want to connect people living at the same address or perform graph queries on the network of locations. A key difference from RDF here is that in LPG such as HASH you don’t need to create intermediate nodes for every value. Properties on entities and links in HASH help keep the graph compact and performant. | ||
3. **Use links when appropriate**: Oftentimes data can be modeled either as a property or as a link (relationship between two entities). A good general rule of thumb is if the data item is primarily an attribute of *one* entity (and not something you'd traverse or connect to from other entities), a property is appropriate. If the data item represents a connection or entity in its own right that *other things* may relate to, create a separate entity and link to it. For example, if modeling a person’s employer: if you only care to store the employer’s name, you could use an `Employer Name` property type. But if you want to connect the `Person` to another entity, `Company` (which in turn might have its own properties or connect to other companies), create a `Works For` link type and model `(:Person)-[:WORKS_FOR]->(:Company)` instead. As an LPG, HASH supports both approaches, and you should circumstantially pick the one that makes querying for data most natural and avoids duplication of data across your [web](/guide/webs). | ||
4. **Avoid superfluous nodes or relationships**: Every entity (node) and link (relationship) in your web should represent something meaningful. If you find entities of a given type that have only one link and no properties, ask if they’re actually necessary, or if they could just be properties on another entity or link. Unnecessary indirection can slow down queries, and make information harder to understand at a glance. Similarly, avoid introducing link types that duplicate what could be captured via properties or existing links. In general, you want to ensure information is only represented once in your graph (eliminating a need to sync distinct values), and your ontology is as simple as possible (to make understanding it and checking for consistency easier), while still representing all distinctions you care about. | ||
5. **Leverage entity types properly**: In many LPGs, entity types (labels on nodes) and link types (relationship types) can be indexed or used to efficiently select subgraphs. In HASH, all entities are typed, and this is handled automatically. | ||
* Whether you’re using HASH or another LPG, make sure that entities are assigned the correct entity types – e.g. that an individual entity is assigned both the entity type (label) `Employee` and `Customer` if they happen to fall into both categories. | ||
* When creating entity types, avoid becoming too fine-grained. For example, having an entity type per country of citizenship (both `US Person` and `UK Person`) may be overkill, if a `Country` property on a `Person` entity would suffice. However, if you want to associate unique attributes (properties or links) with people in your graph, depending on their place of residence (e.g. `SSN` in the US, and `National Insurance Number` in the UK) such granularity could be appropriate. | ||
6. **Minimize redundant data**: One sign of a suboptimal graph model is a large number of duplicate entities scattered across a web. Instead of duplicating entities to indicate multiple roles, add multiple entity types (labels) to a single entity, storing its values in one place, and linking to it from elsewhere if required, rather than duplicating information in different places (necessitating they then be kept in sync, lest they drift causing confusion). Graphs by nature can represent many-to-many connections without duplication. If you notice identical subgraphs repeated, you may need to refactor your model. In graph design, it's generally often better to increase the number of relationships rather than duplicate nodes. This means linking data points with new relationships so they can be shared or traversed, rather than copying data into separate parts of the graph. | ||
7. **Watch for modeling anti-patterns**: Three common issues can signal a need to adjust your data model: | ||
1. *Sparse or tree-like graph structure*: If your web has very few links (like a shallow tree), you aren’t leveraging graph traversal much. Webs, graph databases, and linked property graphs show their strength when data is highly connected; a purely hierarchical or isolated data model might perform just as well in a relational system. | ||
2. *Data duplication*: as mentioned, repeated entities/links usually indicate the data model could be more normalized within the graph. | ||
3. *Overly complex queries*: If you find yourself writing very convoluted graph queries to get simple answers, the model might be forcing workarounds. The ideal is that queries align with how you naturally think of the problem. Complex, multi-step queries might mean some important link is missing in the data model, or data is embedded in properties when it should be connected via edges. Revisit the model to see if a different arrangement of entities/links would answer that query more directly (for example, adding a shortcut link for a frequently needed connection). | ||
|
||
Following these principles helps maintain a graph model that is both expressive and performant. A well-designed LPG will make it easier to formulate queries, ensure the database can traverse efficiently, and reduce the chances of anomalies (like contradictory data) by storing each fact in an appropriate place. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
--- | ||
title: Arrays | ||
description: "Arrays are collections of variables or values, each of which can be identified by at least one key (e.g. their order in an array)." | ||
slug: arrays | ||
tags: ["Data Science"] | ||
--- | ||
|
||
In programming, **arrays** are collections of [variables](/glossary/variables) or [values](/glossary/values), each of which can be identified within the array by at least one "key" (typically its position in the array). Arrays themselves can in turn also be variables or values. | ||
|
||
Some data types are arrays. For example, `Color` when expressed as an `RGB` value, contains three numbers which refer to the relative amount of red, green, and blue light that make up a color (each on a scale of 0 to 255). For example, `0,0,0` is white, `255,0,0` is red, `0,0,255` is blue, and `255,255,255` is black. | ||
|
||
In HASH, arrays are found in the context of [property types](/guide/types/property-types). Property types describe the acceptable value(s) that a property can have. They are either expressed as [data types](/guide/types/data-types), property objects (other property types, nested within the parent), or arrays (which can contain data-typed values, property objects, or further nested arrays). |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.