Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poetrize and Dockerize the DataVerse project so students can do a "docker compose up" and run all the notebooks :) #30

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.git
data
logs
Empty file added .env
Empty file.
20 changes: 20 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[flake8]
min_python_version = 3.10.0
max-line-length = 100
ignore =
# Whitespace before ':' (E203)
E203
# Line lengths are recommended to be no greater than 79 characters. (E501)
E501
# Line break occurred before a binary operator (W503)
W503
# Line break occurred after a binary operator (W504) - both are required
W504
max-complexity = 10
# Enforce numpy docstring format
docstring-convention = numpy
# Make ppl use f-strings
format-greedy = 2
# Double quotes are preferred
inline-quotes = double
multiline-quotes = double
32 changes: 32 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Start from a Jupyter Docker Stacks version
FROM jupyter/scipy-notebook:python-3.10.11

# Needed for poetry package management: no venv, latest poetry, GRANT_SUDO don't work :(
ENV POETRY_VIRTUALENVS_CREATE=false \
POETRY_VERSION=1.4.2 \
GRANT_SUDO=yes

# The docker stacks make sudo very difficult, so we [just be root™]
USER root
RUN sudo apt update && \
sudo apt upgrade -y && \
sudo apt install curl -y && \
rm -rf /var/lib/apt/lists/*

# Go back to jovyan user so we don't have permission problems
USER ${NB_USER}

# Install poetry so we can install our package requirements
RUN curl -sSL https://install.python-poetry.org | python3 -
ENV PATH "/home/jovyan/.local/bin:$PATH"

# Copy our poetry configuration files as jovyan user
COPY --chown=${NB_UID}:${NB_GID} pyproject.toml "/home/${NB_USER}/work/"
COPY --chown=${NB_UID}:${NB_GID} poetry.lock "/home/${NB_USER}/work/"

# Install our package requirements via poetry. No venv. Squash max-workers error.
WORKDIR "/home/${NB_USER}/work"
RUN poetry config virtualenvs.create false && \
poetry config installer.max-workers 10 && \
poetry install --no-interaction --no-ansi --no-root -vvv && \
poetry cache clear pypi --all -n
154 changes: 151 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,163 @@
# DataVerse
)
Welcome to DataVerse, a repository dedicated to sharing code snippets and notebooks used in my blog posts, articles, and conference talks on machine learning, data science, and related topics.

## Contents
This repository contains a collection of code snippets, notebooks and other materials.

This repository contains a collection of code snippets, notebooks, my Langchain 101 Course and other materials.

## Usage
Feel free to use the code in this repository for your own projects, or to learn from it.
Kindly give credit to DataVerse and link back to the original post or article where the code was used.

Feel free to use the code in this repository for your own projects, or to learn from it. Kindly give credit to DataVerse and link back to the original post or article where the code was used.

## License

All code in this repository is licensed under the MIT License. This means that you are free to use, modify, and distribute the code as long as you give credit to DataVerse and include the original license in your work.

## Contact

If you have any questions or comments about this repository, please feel free to reach out to us at [[email protected]](mailto:[email protected]).

## Code Environment Setup

I provide a Docker image for this course that uses [Jupyter Notebooks](https://jupyter.org/). Docker allows you to run the class's code in an environment precisely matching the one in which the code was developed and tested. You can also use the Docker image to run the course code in VSCode or another editor (see below).

In addition to Docker you can also setup an environment locally using the instructions below.

### Install Docker

[Install docker](https://docs.docker.com/engine/install/) and then check the [Get Started](https://www.docker.com/get-started/) page if you aren't familiar.

There are several docker containers used in this course:

- `jupyter`: Jupyter Notebook server where we will interactively write and run code.
- `chroma`: Chroma vector database server where we will store and query vector embeddings of documents for RAG.
- `neo4j`: Neo4j graph database server where we will store and query graph data for prompt engineering and fine-tuning LLMs.
- `opensearch`: OpenSearch server where we will store and query documents for RAG.

### Docker Compose

Bring up the course environment with the following command:

```bash
docker compose up -d
```

Find the Jupyter Notebook url via this command:

```bash
docker logs jupyter -f --tail 100
```

Look for the url with `127.0.0.1` and open it. You should see the Jupyter Notebook home page.

NOTE: Insert an image of Jupyter home page for this course.

### Docker and VSCode

NOTE: add instructions.

## Code-Level Environment Setup

We use a Docker image to run the course, but you can also setup the environment so the code will work in VSCode or another editor. We provide a development tools setup using `black`, `flake8`, `isort`, `mypy` and `pre-commit` for you to modify and use as you see fit.

### Install Anaconda Python

We use Anaconda Python, Python version 3.10.0, for this course. You can download Anaconda Python from [here](https://www.anaconda.com/products/individual). Once you have installed Anaconda Python, you can create a new environment for this course by running the following command:

```bash
conda create -n chatbot-class python=3.10 -y
```

When you create a new environment or start a new shell, you will need to activate the `chatbot-class` conda environment with the following command:

```bash
conda activate chatbot-class
```

Now you are running Python 3.10 in the `chatbot-class` environment. To use this Python in VSCode, hit SHIFT-CMD-P (on Mac) and select `Python: Select Interpreter`. Then select the `chatbot-class` environment's Python.

To deactivate this environment, run:

```bash
conda deactivate
```

#### Other Virtual Environments

Note: I don't support other environments, but you can actually use any Python 3.10 if you are smart enough to make that work. :) You will need to manage your own virtual environments. Python 3's [`venv`](https://docs.python.org/3/library/venv.html) are easy to use.

To create a `venv` for the project, run:

```bash
python3 -m venv chatbot-class
```

To activate this venv run:

```bash
source chatbot-class/bin/activate
```

To deactivate this environment, run:

```bash
deactivate
```

### Install Poetry for Dependency Management

We use [Poetry](https://python-poetry.org/) for dependency management, as it makes things fairly painless.

Verify [Poetry installation instructions here](https://python-poetry.org/docs/#installation) so you know the URL `https://install.python-poetry.org` is legitimate to execute in `python3`.

Then install Poetry with the following command:

```bash
curl -sSL https://install.python-poetry.org | python3 -
```

It is less "clean" in terms of environmental isolation, but alternatively you can install poetry via `pip`:

```bash
pip install poetry
```

### Install Dependencies via Poetry

```bash
poetry install
```

## Essential Tools

### [`langchain`](https://www.langchain.com/) ([docs](https://python.langchain.com/docs/get_started/introduction))

> LangChain is a framework for developing applications powered by language models. It enables applications that:
>
> * Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
> * Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)
>
> The main value props of LangChain are:
>
> * Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not
> * Off-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks
Off-the-shelf chains make it easy to get started. For complex applications, components make it easy to customize existing chains and build new ones.

### [`langchain-hub`](https://github.com/hwchase17/langchain-hub)

> Taking inspiration from Hugging Face Hub, LangChainHub is collection of all artifacts useful for working with LangChain primitives such as prompts, chains and agents. The goal of this repository is to be a central resource for sharing and discovering high quality prompts, chains and agents that combine together to form complex LLM applications.

See [example usage here](https://python.langchain.com/docs/use_cases/question_answering/).

### [`llama-index`](https://github.com/jerryjliu/llama_index)

> LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:
>
> * Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
> * Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
> * Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
> * Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
>
> LlamaIndex provides tools for both beginner users and advanced users. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. Our lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

113 changes: 113 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
version: "3.8"

services:

jupyter:
# TODO: Upgrade me to a RAPIDS image
# image: jupyter/scipy-notebook:python-3.10.11
image: rjurney/dataverse
build:
context: .
dockerfile: Dockerfile
container_name: jupyter
ports:
- 8888:8888
volumes:
- .:/home/jovyan/work
- ./data:/home/jovyan/data
environment:
- JUPYTER_ENABLE_LAB=yes
- DGLBACKEND=pytorch
env_file:
- envs/search.env
- envs/openai.env
- envs/wandb.env
- .env
restart: always

neo4j:
image: neo4j:5.11.0
container_name: neo4j
ports:
- 7474:7474
- 7687:7687
networks:
- opensearch-net
volumes:
- ./data/neo4j:/data
- ./logs/neo4j:/logs
environment:
- NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
env_file:
- envs/neo4j.env
- .env
restart: always

opensearch-node1:
image: opensearchproject/opensearch:latest
container_name: opensearch-node1
environment:
- cluster.name=opensearch-cluster # Name the cluster
- node.name=opensearch-node1 # Name the node that will run in this container
- discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
- cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
- bootstrap.memory_lock=true # Disable JVM heap memory swapping
- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
- "DISABLE_INSTALL_DEMO_CONFIG=true" # Prevents execution of bundled demo script which installs demo certificates and security configurations to OpenSearch
- "DISABLE_SECURITY_PLUGIN=true" # Disables Security plugin
ulimits:
memlock:
soft: -1 # Set memlock to unlimited (no soft or hard limit)
hard: -1
nofile:
soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
hard: 65536
volumes:
- opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
ports:
- 9200:9200 # REST API
- 9600:9600 # Performance Analyzer
networks:
- opensearch-net # All of the containers will join the same Docker bridge network
opensearch-node2:
image: opensearchproject/opensearch:latest
container_name: opensearch-node2
environment:
- cluster.name=opensearch-cluster # Name the cluster
- node.name=opensearch-node2 # Name the node that will run in this container
- discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
- cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
- bootstrap.memory_lock=true # Disable JVM heap memory swapping
- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
- "DISABLE_INSTALL_DEMO_CONFIG=true" # Prevents execution of bundled demo script which installs demo certificates and security configurations to OpenSearch
- "DISABLE_SECURITY_PLUGIN=true" # Disables Security plugin
ulimits:
memlock:
soft: -1 # Set memlock to unlimited (no soft or hard limit)
hard: -1
nofile:
soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
hard: 65536
volumes:
- opensearch-data2:/usr/share/opensearch/data # Creates volume called opensearch-data2 and mounts it to the container
networks:
- opensearch-net # All of the containers will join the same Docker bridge network
opensearch-dashboards:
image: opensearchproject/opensearch-dashboards:latest
container_name: opensearch-dashboards
ports:
- 5601:5601 # Map host port 5601 to container port 5601
expose:
- "5601" # Expose port 5601 for web access to OpenSearch Dashboards
environment:
- 'OPENSEARCH_HOSTS=["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
- "DISABLE_SECURITY_DASHBOARDS_PLUGIN=true" # disables security dashboards plugin in OpenSearch Dashboards
networks:
- opensearch-net

volumes:
opensearch-data1:
opensearch-data2:

networks:
opensearch-net:
8 changes: 8 additions & 0 deletions envs/neo4j.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
NEO4J_AUTH=none
NEO4J_dbms_transaction_concurrent_maximum=0
NEO4J_dbms_memory_heap_max__size=16g
NEO4J_PLUGINS='["apoc","apoc-extended","bloom","graph-data-science","graphql"]'
NEO4J_apoc_import_file_enabled=true
NEO4J_apoc_export_file_enabled=true
NEO4j_apoc_export_csv_data=true

2 changes: 2 additions & 0 deletions envs/openai.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# OpenAPI API Key
OPENAI_API_KEY=
5 changes: 5 additions & 0 deletions envs/search.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# OpenSearch Cluster Info
OPENSEARCH_HOST=
OPENSEARCH_PORT=
OPENSEARCH_USER=
OPENSEARCH_PASSWD=
2 changes: 2 additions & 0 deletions envs/wandb.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Weights & Biases API Setup
WANDB_API_KEY=
Loading