Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H-4088: Add knowledge graph glossary and tutorial content #6498

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
6cd29c5
Re-order docs
vilkinsons Jan 11, 2025
763cbd0
Update copy
vilkinsons Feb 10, 2025
60ea7cc
Update README.md
vilkinsons Feb 12, 2025
d12f213
Update README.md
vilkinsons Feb 12, 2025
cf4f913
Update README.md
vilkinsons Feb 12, 2025
f6a0542
Update README.md
vilkinsons Feb 12, 2025
6537739
Update README.md
vilkinsons Feb 12, 2025
ae31226
Update README.md
vilkinsons Feb 12, 2025
5aac237
Update knowledge-graphs.mdx
vilkinsons Feb 20, 2025
867f5ea
Delete `apps/hashdotai/glossary/url_map.json`
vilkinsons Feb 20, 2025
745e7a0
Update knowledge-graphs.mdx
vilkinsons Feb 20, 2025
0de5f72
Create knowledge-graphs-in-healthcare-and-life-sciences.mdx
vilkinsons Feb 20, 2025
0f03118
Create knowledge-graphs-in-supply-chain-management.mdx
vilkinsons Feb 20, 2025
ce177a9
Create knowledge-graphs-in-finance.mdx
vilkinsons Feb 20, 2025
92bc2af
Create knowledge-graphs-in-retail-and-ecommerce.mdx
vilkinsons Feb 20, 2025
03e16ea
Update knowledge-graphs-in-finance.mdx
vilkinsons Feb 20, 2025
24a291a
Update knowledge-graphs-in-retail-and-ecommerce.mdx
vilkinsons Feb 20, 2025
5379b9f
Create knowledge-graphs-in-enterprise-knowledge-management.mdx
vilkinsons Feb 20, 2025
5dd0ddd
Merge branch 'main' into d/docs
vilkinsons Feb 20, 2025
15663ff
Create WIP_event-driven-knowledge-graphs.mdx
vilkinsons Feb 20, 2025
b38b18a
Create WIP_labeled-property-graphs.mdx
vilkinsons Feb 21, 2025
1d392e8
temp
vilkinsons Feb 26, 2025
c72624d
Create `scalars.mdx`
vilkinsons Mar 4, 2025
2a60430
Create vectors.mdx
vilkinsons Mar 4, 2025
8596401
Update scalars.mdx
vilkinsons Mar 4, 2025
fff10ac
Create variables.mdx
vilkinsons Mar 4, 2025
8d67261
Create arrays.mdx
vilkinsons Mar 4, 2025
be0d284
Update WIP_labeled-property-graphs.mdx
vilkinsons Mar 4, 2025
ec274ff
Update WIP_labeled-property-graphs.mdx
vilkinsons Mar 4, 2025
7141319
Merge branch 'main' into d/docs
vilkinsons Mar 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 0 additions & 18 deletions apps/hash/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,24 +218,6 @@ Transactional emails templates are located in the following locations:
To use `AwsSesEmailTransporter` instead, set `export HASH_EMAIL_TRANSPORTER=aws_ses` in your terminal before running the app.
Note that you will need valid AWS credentials for this email transporter to work.

## Integration with the Block Protocol

HASH is built around the open [Block Protocol](https://blockprotocol.org) ([@blockprotocol/blockprotocol](https://github.com/blockprotocol/blockprotocol) on GitHub).

### Using blocks

Blocks published to the [Þ Hub](https://blockprotocol.org/hub) can be run within HASH via the 'insert block' (aka. 'slash') menu.

While running the app in development mode, you can also test local blocks out in HASH by going to any page, clicking on the menu next to an empty block, and pasting in the URL to your block's distribution folder (i.e. the one containing `block-metadata.json`, `block-schema.json`, and the block's code). If you need a way of serving your folder, try [`serve`](https://github.com/vercel/serve).

### HASH blocks

The code pertaining to HASH-developed blocks can be found in the [`/blocks` directory](https://github.com/hashintel/hash/tree/main/blocks) in the root of this monorepo.

### Creating new blocks

See the [Developing Blocks](https://blockprotocol.org/docs/developing-blocks) page in the [Þ Docs](https://blockprotocol.org/docs) for instructions on developing and publishing your own blocks.

## Development

[//]: # "TODO: Pointers to where to update/modify code"
Expand Down
10 changes: 10 additions & 0 deletions apps/hashdotai/glossary/WIP_event-driven-knowledge-graphs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Event-Driven Knowledge Graphs
description: "Event-driven knowledge graphs are commonly used to power decision support and simulation tools."
slug: event-driven-knowledge-graphs
tags: ["Graphs"]
---

**Event-driven knowledge graphs** sit at the intersection of [discrete event models](/glossary/discrete-event-modeling) which help simulate processes and the state of real-world systems, and [knowledge graphs](/glossary/knowledge-graphs) which help represent objects, events, situations, or concepts (often using [graph databases](/glossary/graph-databases)).

By introducing an event-driven approach (aka. “event sourcing via streaming platforms”), platforms like HASH can extract and link data from multiple data silos in near real time. In practice, an event-sourcing pipeline streams key data changes (events) from disparate systems, deduplicates and unifies them, and updates a knowledge graph accordingly – resulting in an up-to-date, event-driven knowledge graph. This differs from traditional knowledge graphs that are typically entity-centric and updated in batches or via periodic processes. Traditional KGs capture mostly static facts and relationships, whereas an event-driven KG continually incorporates dynamic, temporal information (events) as first-class data. As such, event-driven graphs are always evolving to reflect the latest state of the business.
65 changes: 65 additions & 0 deletions apps/hashdotai/glossary/WIP_labeled-property-graphs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: Labeled Property Graphs
description: "Labeled Property Graphs consist of nodes (also called vertices) and relationships (edges), each of which can hold associated data."
slug: labeled-property-graphs
tags: ["Graphs"]
---

# **Data Modeling Principles**

## **Structure of Labeled Property Graphs**

A **labeled property graph (LPG)** consists of *entities* (also called nodes or vertices) and *links* (relationships or edges), each of which can have a series of attributes associated with it which contain relevant information. In an LPG, entities have simple textual labels like `Person` or `Product`, and so do links - for example, `Friends With` or `Purchased`. These help categorize entities and links.

In more advanced "LPG inspired" systems like HASH, instead of using simple textual labels, entities and links are assigned one or more [entity types](/guide/types/entity-types) or [link types](/guide/types/link-types) (as appropriate).

In LPGs, both entities and links can have any number of _properties_, which are key–value pairs storing additional information (for example, a Person node might have properties like `name:"Alice"` and `age:30`). In HASH, both entities and links can contain any number of properties, or other _links_ (allowing links to point to other links, as required).

In LPGs and HASH, links are usually **directed**, meaning they have start and end at specific entities, though in some use cases direction can be ignored or traversed in both ways as needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In LPGs and HASH, links are usually **directed**, meaning they have start and end at specific entities, though in some use cases direction can be ignored or traversed in both ways as needed.
In LPGs and HASH, links are usually **directed**, meaning that one entity is the source and the other is the target, and which is which is important – though in some use cases direction can be ignored or traversed in both ways as needed.


The LPG model is best understood with a simple example. Consider two `Person` entities and a friendship between them:

```
CREATE (p:Person { name: "Alice", age: 30 });
CREATE (q:Person { name: "Bob", age: 32 });
CREATE (p)-[:Friends With]->(q);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a suggestion but just to note that in a directed graph, this example means that Alice is friends with Bob but Bob is not known to be friends with Alice (which is a possible real-world situation!)

```

Here we created two entities with the **Person** label (or entity type, if using HASH), which allows certain properties to be associated with them. We've also created a link with the label (or link type) **Friends With** that points from from Alice to Bob. In LPGs and HASH, this data is stored as a graph structure – you can later query it by traversing from `Alice` to find all `Friends With` connections, for example. In both LPGs and HASH this works very similarly: except in LPGs we are querying by "label" and in HASH instead by "type". Both the entities and links between them have labels/types. In HASH, these types indicate what information can be associated with an entity (the expected attributes: properties and links). For example, one could add a property `Since` and corresponding value of `2020` on the `Friends With` link to indicate when a friendship started. This enriched graph structure makes LPGs and typed alternatives like HASH extremely expressive for modeling complex domains.

## **LPG vs. Other Graph Models (e.g. RDF)**

Labeled property graphs are one of the two major graph data modeling paradigms in wide use today, the other being the **RDF** (Resource Description Framework) triple model. While both represent data as networks of connected entities, there are fundamental differences in how data is structured and annotated:

- **Node properties _vs._ Triples**: In LPGs/HASH, an entity can have attributes stored directly as properties (as in the Alice example above). In RDF, by contrast, there is *no concept of an attribute on an entity* – every piece of information is expressed as a separate triple (subject–predicate–object). For example, to represent a person’s birthdate in RDF, one would create a triple like `(BarackObama) -[birthDate]-> ("1961")`, essentially treating the date "1961" as an object node or literal connected via a predicate. In an LPG, that same fact could simply be a property `birthDate: 1961` on the Barack Obama entity, with no extra link needed. This means RDF tends to produce many more small connecting elements, whereas LPGs can store richer information per entity/link (more analogous to an object in object-oriented programming with fields).
- **Global _vs._ Local identification**: RDF uses globally unique identifiers (URIs) for each entity and link type, aiming for web-scale data integration. Every predicate (link type) and often entities are defined by URIs that can link across datasets. LPG systems meanwhile typically use application-local identifiers (like string names for link types and entity types, which they call "labels") and do not inherently link across databases. This makes property graphs simpler to work with in a closed-world context, whereas RDF is built for interoperability at the cost of some verbosity. [HASH](https://hash.ai/) is a next-generation platform that combines the interoperability and mutual intelligibility of RDF with the expressiveness and customizability of LPG, with all entities, links, entity types, link types and associated resources having fixed URIs which can be relied upon internally (or even published to the world wide web) as desired.
- **Atomic unit of data**: The atomic unit in RDF is the triple. Even a single entity with multiple attributes is essentially a collection of triples sharing the same subject. LPGs do not have a single fixed atomic structure; an entity with properties is a self-contained data structure, and a link with its properties is another. This means an LPG can be thought of as a collection of entities and links (each a small record with key-values), rather than a collection of triples.
- **Schema and semantics**: RDF is tightly connected to the Semantic Web and has a rich standard stack for defining ontologies and schemas (RDF Schema, OWL) that let you formally specify classes, relationships, and even logical inference rules. An RDF graph can be “self-describing” to a degree, as the meaning of relationships and nodes can be defined through shared vocabularies/ontologies. Property graphs, on the other hand, do not enforce any specific global schema or ontology layer; the **interpretation of the labels and properties is left to the consumer** or defined at the application level. This gives LPGs more flexibility (you can add any property to any node without prior schema setup), but it also means that understanding the data’s meaning relies on external documentation or conventions rather than inherent semantics. As a hybrid of RDF and LPG-based approaches, HASH relies upon a [type system](/guide/types) to describe labeled property graphs and ensure interoperability. These types can be kept private or publicly shared, and users can fork on, extend, re-use and crosswalk between standardized definitions of entities created by anyone else. This makes HASH well-suited to collaboration within and across companies, while HASH's UI abstracts away complexity and ensures type creation and editing remains simple and easy.

In summary:

- **LPGs** emphasize a pragmatic, object-like approach to graph data modeling: entities and links have attributes which make them suitable for straightforward querying and mutation in graph databases.
- **RDF** emphasizes a web-standard, triple-based approach with powerful integration and reasoning capabilities.
- While RDF is common in open knowledge graphs and linked data scenarios, LPGs are frequently found in graph databases for operational or analytic applications.
- **HASH** extends the "LPG" model, replacing simple text "labels" with formally-defined types, while supporting the common RDF paradigm of stable, referenceable URIs. As such, HASH combines the benefits of both LPG and RDF approaches into a single new approach.

## Best Practices for Graph Data Modeling

Designing a graph data model requires careful thought to fully leverage the power of the LPG model while keeping the graph efficient and comprehensible. Here are some core data modeling principles and best practices:

1. **Identify nodes and relationships from entities**: Start by identifying the main kinds of entities you store information about, to map to [entity types](/guide/types/entity-types) (your node labels), and the important relationships between them, to map to [link types](/guide/types/link-types).
* If you have an Entity Relationship (ER) diagram or an object model, it can often be translated: entities/node labels map to *entity types*, and relationships/edges/associations map to *link types*. For example, in a retail scenario you might have nodes labeled `Customer`, `Product`, and `Order`, with relationships like `(:Customer)-[:PLACED]->(:Order)` and `(:Order)-[:CONTAINS]->(:Product)`.
* In HASH, connecting to existing data sources via [integrations](/integrations) automatically populates your web with properly typed entities, eliminating any need for time-consuming transformation or manual mapping of data to entities.
2. **Use properties for simple attributes**: For attributes that don’t naturally need to be separate nodes, use properties. In a relational database, you might normalize certain data into separate tables, but in a graph it’s often unnecessary unless you plan to traverse or query that attribute as a relationship. For instance, an `email` or `age` of a person can be a property on the Person node. On the other hand, something like an `Address` might be a node of its own if you want to connect people living at the same address or perform graph queries on the network of locations. A key difference from RDF here is that in LPG such as HASH you don’t need to create intermediate nodes for every value. Properties on entities and links in HASH help keep the graph compact and performant.
3. **Use links when appropriate**: Oftentimes data can be modeled either as a property or as a link (relationship between two entities). A good general rule of thumb is if the data item is primarily an attribute of *one* entity (and not something you'd traverse or connect to from other entities), a property is appropriate. If the data item represents a connection or entity in its own right that *other things* may relate to, create a separate entity and link to it. For example, if modeling a person’s employer: if you only care to store the employer’s name, you could use an `Employer Name` property type. But if you want to connect the `Person` to another entity, `Company` (which in turn might have its own properties or connect to other companies), create a `Works For` link type and model `(:Person)-[:WORKS_FOR]->(:Company)` instead. As an LPG, HASH supports both approaches, and you should circumstantially pick the one that makes querying for data most natural and avoids duplication of data across your [web](/guide/webs).
4. **Avoid superfluous nodes or relationships**: Every entity (node) and link (relationship) in your web should represent something meaningful. If you find entities of a given type that have only one link and no properties, ask if they’re actually necessary, or if they could just be properties on another entity or link. Unnecessary indirection can slow down queries, and make information harder to understand at a glance. Similarly, avoid introducing link types that duplicate what could be captured via properties or existing links. In general, you want to ensure information is only represented once in your graph (eliminating a need to sync distinct values), and your ontology is as simple as possible (to make understanding it and checking for consistency easier), while still representing all distinctions you care about.
5. **Leverage entity types properly**: In many LPGs, entity types (labels on nodes) and link types (relationship types) can be indexed or used to efficiently select subgraphs. In HASH, all entities are typed, and this is handled automatically.
* Whether you’re using HASH or another LPG, make sure that entities are assigned the correct entity types – e.g. that an individual entity is assigned both the entity type (label) `Employee` and `Customer` if they happen to fall into both categories.
* When creating entity types, avoid becoming too fine-grained. For example, having an entity type per country of citizenship (both `US Person` and `UK Person`) may be overkill, if a `Country` property on a `Person` entity would suffice. However, if you want to associate unique attributes (properties or links) with people in your graph, depending on their place of residence (e.g. `SSN` in the US, and `National Insurance Number` in the UK) such granularity could be appropriate.
6. **Minimize redundant data**: One sign of a suboptimal graph model is a large number of duplicate entities scattered across a web. Instead of duplicating entities to indicate multiple roles, add multiple entity types (labels) to a single entity, storing its values in one place, and linking to it from elsewhere if required, rather than duplicating information in different places (necessitating they then be kept in sync, lest they drift causing confusion). Graphs by nature can represent many-to-many connections without duplication. If you notice identical subgraphs repeated, you may need to refactor your model. In graph design, it's generally often better to increase the number of relationships rather than duplicate nodes. This means linking data points with new relationships so they can be shared or traversed, rather than copying data into separate parts of the graph.
7. **Watch for modeling anti-patterns**: Three common issues can signal a need to adjust your data model:
1. *Sparse or tree-like graph structure*: If your web has very few links (like a shallow tree), you aren’t leveraging graph traversal much. Webs, graph databases, and linked property graphs show their strength when data is highly connected; a purely hierarchical or isolated data model might perform just as well in a relational system.
2. *Data duplication*: as mentioned, repeated entities/links usually indicate the data model could be more normalized within the graph.
3. *Overly complex queries*: If you find yourself writing very convoluted graph queries to get simple answers, the model might be forcing workarounds. The ideal is that queries align with how you naturally think of the problem. Complex, multi-step queries might mean some important link is missing in the data model, or data is embedded in properties when it should be connected via edges. Revisit the model to see if a different arrangement of entities/links would answer that query more directly (for example, adding a shortcut link for a frequently needed connection).

Following these principles helps maintain a graph model that is both expressive and performant. A well-designed LPG will make it easier to formulate queries, ensure the database can traverse efficiently, and reduce the chances of anomalies (like contradictory data) by storing each fact in an appropriate place.
12 changes: 12 additions & 0 deletions apps/hashdotai/glossary/arrays.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Arrays
description: "Arrays are collections of variables or values, each of which can be identified by at least one key (e.g. their order in an array)."
slug: arrays
tags: ["Data Science"]
---

In programming, **arrays** are collections of [variables](/glossary/variables) or [values](/glossary/values), each of which can be identified within the array by at least one "key" (typically its position in the array). Arrays themselves can in turn also be variables or values.

Some data types are arrays. For example, `Color` when expressed as an `RGB` value, contains three numbers which refer to the relative amount of red, green, and blue light that make up a color (each on a scale of 0 to 255). For example, `0,0,0` is white, `255,0,0` is red, `0,0,255` is blue, and `255,255,255` is black.

In HASH, arrays are found in the context of [property types](/guide/types/property-types). Property types describe the acceptable value(s) that a property can have. They are either expressed as [data types](/guide/types/data-types), property objects (other property types, nested within the parent), or arrays (which can contain data-typed values, property objects, or further nested arrays).
Loading
Loading