Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: Experiments for building knowledge graph #36

Merged
merged 6 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions graph_rag/experiments/EXPERIMENTS.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Experiments

The major portion of my time in the first phase of the GSoC project has been spent experimenting with different models, embeddings, and libraries.

## Knowledge Graph from Documentation

The majority of the documentation for libraries is stored in the form of HTML and markdown files in their GitHub repositories.

We first used llama-index document loaders to load all documents with the .md extension. We then performed chunking and created a Document instance of them.

## Knowledge Graph Using Code Embeddings

Implementation of the idea can be found here: [Colab](https://colab.research.google.com/drive/1uguR76SeMAukN4uAhKuXU_ja8Ik0s8Wj#scrollTo=CUgtX5D1Tl_x).

The idea is to separate code blocks or take code and split it using a code splitter, then pass it to a model for building a Knowledge Graph using code embeddings. I used:
- Salesforce/codegen2-7B_P quantized (4-bit)
- Salesforce/codet5p-110m-embedding
- Python files in Keras-io

### Model Selection

We need a model that is open source and can work on the free Colab version to begin with. For a better knowledge graph, we quantized models above 20GB to 4 bits using bitsandbytes configuration. We tried the following LLMs:
- gemini pro
- [QuantiPhy/zephyr-7b-beta(4bit-quantized)**](https://huggingface.co/QuantiPhy/zephyr-7b-beta-4bit-quantized)
- llama3 (Ollama version)
- codellama (Ollama version)
- [QuantiPhy/aya-23-8B (4bit quantized)**](https://huggingface.co/QuantiPhy/aya-23-8B-4bq)
- gpt-neo-2.7B(4bit-quantized)
- [Salesforce/codegen2-7B_P(4bit-quantized)**](https://huggingface.co/QuantiPhy/Salesforce_codegen2-7B_P)
- phi3 (Ollama)
- phi3:medium (Ollama)
- neural-chat (Ollama)
- gemma2 (Ollama)
- mistral (Ollama)
** all these models,I have 4bit-quantized them using bitsandbytes
### Embeddings

For embeddings, we tried:
- microsoft/codebert-base
- Salesforce/codet5p-110m-embedding

### Libraries

In the initial phase, we are looking for libraries in the community that solve the problem of building Knowledge Graphs:
- [llama-index knowledge-graph builder](https://github.com/run-llama/llama_index/tree/main/llama-index-core/llama_index/core/indices/knowledge_graph)
- [llm-graph-builder](https://github.com/neo4j-labs/llm-graph-builder)
- [graph_builder](https://github.com/sarthakrastogi/graph-rag)

### Table

| Model | Embeddings | Libraries | Remarks | Documents | Artifacts |
|:----------------------------|:---------------------|:---------------------------|:------------|:-------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| gemma2 (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/gemma2/Graph_visualization_gemma2_mscb.html)<br/>[index](artifacts/gemma2/gemma2graphIndex.pkl)<br/>[collab](https://colab.research.google.com/drive/1q7FED2Lapk3D7ibqkO3NkqZ6iPNZ_x6H?usp=sharing) |
| mistral (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/mistral/Graph_visualization_mistral_mscb.html)<br/>[index](artifacts/mistral/mistralgraphIndex.pkl)<br/>[collab](https://colab.research.google.com/drive/1q7FED2Lapk3D7ibqkO3NkqZ6iPNZ_x6H?usp=sharing) |
| neural-chat (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/neural_chat/Graph_visualization_neuralchat_mscb.html)<br/>[index](artifacts/neural_chat/graphIndex_neuralchat_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing) |
| phi3:medium (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/phi3-med/Graph_visualization_phi3-med_mscb.html)<br/>[index](artifacts/phi3-med/graphIndex_phi3_medium_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing) |
| phi3 (Ollama) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/phi3/Graph_visualization_phi3_mscb.html)<br/>[index](artifacts/phi3/graphIndex_phi3_mscb.pkl)<br/>[collab](https://colab.research.google.com/drive/1cM6ujhiKM1v0bRYVN9F9UEgjYlwkBTt9?usp=sharing) |
| gpt-4o | open-ai | Neo4jGraphBuilder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/vizualization/visualisation.png) |
| Gemini | gemini | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/ex1.html) |
| Gemini | gemini | llama-index graph builder | Rate-error | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | |
| Gemini | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/gem_mcode_k_nlp.html) |
| Zypher (4-bit) | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/zy_knlp.html) |
| Zypher (4-bit) | microsoft/codebert-base | llama-index graph builder | nil | [keras-io](https://github.com/keras-team/keras-io/tree/master/templates) | [viz](artifacts/vizualization/examp.html) |
| llama3 (Ollama version) | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/Graph_visualization.html) |
| codellama (Ollama version) | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/code_1.html) |
| gpt-neo-2.7B-4bit-quantized | microsoft/codebert-base | llama-index graph builder | nil | [keras-nlp](https://github.com/keras-team/keras-io/blob/master/templates/keras_nlp/index.md) | [viz](artifacts/vizualization/graph_gpt3-neo.html) |

### Notes
- ### [graph_builder](https://github.com/sarthakrastogi/graph-rag)

- I explored graph_rag by Sarthak. It is fundamentally based on function calling (JSON output), and it works very well for powerful models. However, small-sized LLMs tend to make mistakes regardless of how well the prompt is crafted.
- I tried and debugged the library, and this was my experience with it. I modified the system prompts, which led to fewer mistakes, and added a method to download .html files for visualization. Additionally, I added methods to use Ollama OS models.
- [rough_codes](https://colab.research.google.com/drive/1q6T8mK-O2XKqY-iGFz6xdrzvqLzu73lm#scrollTo=H0QG6QUVub8T) contains codes/modification/implementation for the rep0
93 changes: 93 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# KerasTuner

<a class="github-button" href="https://github.com/keras-team/keras-tuner" data-size="large" data-show-count="true" aria-label="Star keras-team/keras-tuner on GitHub">Star</a>

KerasTuner is an easy-to-use, scalable hyperparameter optimization framework
that solves the pain points of hyperparameter search. Easily configure your
search space with a define-by-run syntax, then leverage one of the available
search algorithms to find the best hyperparameter values for your models.
KerasTuner comes with Bayesian Optimization, Hyperband, and Random Search algorithms
built-in, and is also designed to be easy for researchers to extend in order to
experiment with new search algorithms.

---
## Quick links

* [Getting started with KerasTuner](/guides/keras_tuner/getting_started/)
* [KerasTuner developer guides](/guides/keras_tuner/)
* [KerasTuner API reference](/api/keras_tuner/)
* [KerasTuner on GitHub](https://github.com/keras-team/keras-tuner)


---
## Installation

Install the latest release:

```
pip install keras-tuner --upgrade
```

You can also check out other versions in our
[GitHub repository](https://github.com/keras-team/keras-tuner).


---
## Quick introduction

Import KerasTuner and TensorFlow:

```python
import keras_tuner
import keras
```

Write a function that creates and returns a Keras model.
Use the `hp` argument to define the hyperparameters during model creation.

```python
def build_model(hp):
model = keras.Sequential()
model.add(keras.layers.Dense(
hp.Choice('units', [8, 16, 32]),
activation='relu'))
model.add(keras.layers.Dense(1, activation='relu'))
model.compile(loss='mse')
return model
```

Initialize a tuner (here, `RandomSearch`).
We use `objective` to specify the objective to select the best models,
and we use `max_trials` to specify the number of different models to try.

```python
tuner = keras_tuner.RandomSearch(
build_model,
objective='val_loss',
max_trials=5)
```

Start the search and get the best model:

```python
tuner.search(x_train, y_train, epochs=5, validation_data=(x_val, y_val))
best_model = tuner.get_best_models()[0]
```

To learn more about KerasTuner, check out [this starter guide](https://keras.io/guides/keras_tuner/getting_started/).


---
## Citing KerasTuner

If KerasTuner helps your research, we appreciate your citations.
Here is the BibTeX entry:

```bibtex
@misc{omalley2019kerastuner,
title = {KerasTuner},
author = {O'Malley, Tom and Bursztein, Elie and Long, James and Chollet, Fran\c{c}ois and Jin, Haifeng and Invernizzi, Luca and others},
year = 2019,
howpublished = {\url{https://github.com/keras-team/keras-tuner}}
}
```
146 changes: 146 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# KerasNLP

<a class="github-button" href="https://github.com/keras-team/keras-nlp" data-size="large" data-show-count="true" aria-label="Star keras-team/keras-nlp on GitHub">Star</a>

KerasNLP is a natural language processing library that works natively
with TensorFlow, JAX, or PyTorch. Built on Keras 3, these models, layers,
metrics, and tokenizers can be trained and serialized in any framework and
re-used in another without costly migrations.

KerasNLP supports users through their entire development cycle. Our workflows
are built from modular components that have state-of-the-art preset weights when
used out-of-the-box and are easily customizable when more control is needed.

This library is an extension of the core Keras API; all high-level modules are
[`Layers`](/api/layers/) or
[`Models`](/api/models/) that receive that same level of polish
as core Keras. If you are familiar with Keras, congratulations! You already
understand most of KerasNLP.

See our [Getting Started guide](/guides/keras_nlp/getting_started)
to start learning our API. We welcome
[contributions](https://github.com/keras-team/keras-nlp/blob/master/CONTRIBUTING.md).

---
## Quick links

* [KerasNLP API reference](/api/keras_nlp/)
* [KerasNLP on GitHub](https://github.com/keras-team/keras-nlp)
* [List of available pre-trained models](/api/keras_nlp/models/)

## Guides

* [Getting Started with KerasNLP](/guides/keras_nlp/getting_started/)
* [Uploading Models with KerasNLP](/guides/keras_nlp/upload/)
* [Pretraining a Transformer from scratch](/guides/keras_nlp/transformer_pretraining/)

## Examples

* [GPT-2 text generation](/examples/generative/gpt2_text_generation_with_kerasnlp/)
* [Parameter-efficient fine-tuning of GPT-2 with LoRA](/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/)
* [Semantic Similarity](/examples/nlp/semantic_similarity_with_keras_nlp/)
* [Sentence embeddings using Siamese RoBERTa-networks](/examples/nlp/sentence_embeddings_with_sbert/)
* [Data Parallel Training with tf.distribute](/examples/nlp/data_parallel_training_with_keras_nlp/)
* [English-to-Spanish translation](/examples/nlp/neural_machine_translation_with_keras_nlp/)
* [GPT text generation from scratch](/examples/generative/text_generation_gpt/)
* [Text Classification using FNet](/examples/nlp/fnet_classification_with_keras_nlp/)

---
## Installation

KerasNLP supports both Keras 2 and Keras 3. We recommend Keras 3 for all new
users, as it enables using KerasNLP models and layers with JAX, TensorFlow and
PyTorch.

### Keras 2 Installation

To install the latest KerasNLP release with Keras 2, simply run:

```
pip install --upgrade keras-nlp
```

### Keras 3 Installation

There are currently two ways to install Keras 3 with KerasNLP. To install the
stable versions of KerasNLP and Keras 3, you should install Keras 3 **after**
installing KerasNLP. This is a temporary step while TensorFlow is pinned to
Keras 2, and will no longer be necessary after TensorFlow 2.16.

```
pip install --upgrade keras-nlp
pip install --upgrade keras
```

To install the latest nightly changes for both KerasNLP and Keras, you can use
our nightly package.

```
pip install --upgrade keras-nlp-nightly
```

**Note:** Keras 3 will not function with TensorFlow 2.14 or earlier.

See [Getting started with Keras](/getting_started/) for more information on
installing Keras generally and compatibility with different frameworks.

---
## Quickstart

Fine-tune BERT on a small sentiment analysis task using the
[`keras_nlp.models`](/api/keras_nlp/models/) API:

```python
import os
os.environ["KERAS_BACKEND"] = "tensorflow" # Or "jax" or "torch"!

import keras_nlp
import tensorflow_datasets as tfds

imdb_train, imdb_test = tfds.load(
"imdb_reviews",
split=["train", "test"],
as_supervised=True,
batch_size=16,
)
# Load a BERT model.
classifier = keras_nlp.models.BertClassifier.from_preset(
"bert_base_en_uncased",
num_classes=2,
)
# Fine-tune on IMDb movie reviews.
classifier.fit(imdb_train, validation_data=imdb_test)
# Predict two new examples.
classifier.predict(["What an amazing movie!", "A total waste of my time."])
```

---
## Compatibility

We follow [Semantic Versioning](https://semver.org/), and plan to
provide backwards compatibility guarantees both for code and saved models built
with our components. While we continue with pre-release `0.y.z` development, we
may break compatibility at any time and APIs should not be consider stable.

## Disclaimer

KerasNLP provides access to pre-trained models via the `keras_nlp.models` API.
These pre-trained models are provided on an "as is" basis, without warranties
or conditions of any kind. The following underlying models are provided by third
parties, and subject to separate licenses:
BART, DeBERTa, DistilBERT, GPT-2, OPT, RoBERTa, Whisper, and XLM-RoBERTa.

## Citing KerasNLP

If KerasNLP helps your research, we appreciate your citations.
Here is the BibTeX entry:

```bibtex
@misc{kerasnlp2022,
title={KerasNLP},
author={Watson, Matthew, and Qian, Chen, and Bischof, Jonathan and Chollet,
Fran\c{c}ois and others},
year={2022},
howpublished={\url{https://github.com/keras-team/keras-nlp}},
}
```
21 changes: 21 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# KerasCV Bounding Boxes

All KerasCV components that process bounding boxes require a `bounding_box_format`
argument. This argument allows you to seamlessly integrate KerasCV components into
your own workflows while preserving proper behavior of the components themselves.

Bounding boxes are represented by dictionaries with two keys: `'boxes'` and `'classes'`:

```
{
'boxes': [batch, num_boxes, 4],
'classes': [batch, num_boxes]
}
```

To ensure your bounding boxes comply with the KerasCV specification, you can use [`keras_cv.bounding_box.validate_format(boxes)`](https://github.com/keras-team/keras-cv/blob/master/keras_cv/bounding_box/validate_format.py).

The bounding box formats supported in KerasCV
[are listed in the API docs](/api/keras_cv/bounding_box/formats)
If a format you would like to use is missing,
[feel free to open a GitHub issue on KerasCV](https://github.com/keras-team/keras-cv/issues)!
7 changes: 7 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# KerasCV

These guides cover the [KerasCV](/keras_cv/) library.

## Available guides

{{toc}}
7 changes: 7 additions & 0 deletions graph_rag/experiments/artifacts/data_keras/index5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# KerasNLP

These guides cover the [KerasNLP](/keras_nlp/) library.

## Available guides

{{toc}}
Loading
Loading