IvanReznikov · rjurney · Dec 13, 2023 · Dec 14, 2023 · Dec 14, 2023 · Dec 14, 2023
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,3 @@
+.git
+data
+logs
diff --git a/.env b/.env
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,20 @@
+[flake8]
+min_python_version = 3.10.0
+max-line-length = 100
+ignore =
+    # Whitespace before ':' (E203)
+    E203
+    # Line lengths are recommended to be no greater than 79 characters. (E501)
+    E501
+    # Line break occurred before a binary operator (W503)
+    W503
+    # Line break occurred after a binary operator (W504) - both are required
+    W504
+max-complexity = 10
+# Enforce numpy docstring format
+docstring-convention = numpy
+# Make ppl use f-strings
+format-greedy = 2
+# Double quotes are preferred
+inline-quotes = double
+multiline-quotes = double
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,32 @@
+# Start from a Jupyter Docker Stacks version
+FROM jupyter/scipy-notebook:python-3.10.11
+
+# Needed for poetry package management: no venv, latest poetry, GRANT_SUDO don't work :(
+ENV POETRY_VIRTUALENVS_CREATE=false \
+    POETRY_VERSION=1.4.2 \
+    GRANT_SUDO=yes
+
+# The docker stacks make sudo very difficult, so we [just be root™]
+USER root
+RUN sudo apt update && \
+    sudo apt upgrade -y && \
+    sudo apt install curl -y && \
+    rm -rf /var/lib/apt/lists/*
+
+# Go back to jovyan user so we don't have permission problems
+USER ${NB_USER}
+
+# Install poetry so we can install our package requirements
+RUN curl -sSL https://install.python-poetry.org | python3 -
+ENV PATH "/home/jovyan/.local/bin:$PATH"
+
+# Copy our poetry configuration files as jovyan user
+COPY --chown=${NB_UID}:${NB_GID} pyproject.toml "/home/${NB_USER}/work/"
+COPY --chown=${NB_UID}:${NB_GID} poetry.lock    "/home/${NB_USER}/work/"
+
+# Install our package requirements via poetry. No venv. Squash max-workers error.
+WORKDIR "/home/${NB_USER}/work"
+RUN poetry config virtualenvs.create false && \
+    poetry config installer.max-workers 10 && \
+    poetry install --no-interaction --no-ansi --no-root -vvv && \
+    poetry cache clear pypi --all -n
diff --git a/README.md b/README.md
@@ -1,15 +1,163 @@
 # DataVerse
+)
 Welcome to DataVerse, a repository dedicated to sharing code snippets and notebooks used in my blog posts, articles, and conference talks on machine learning, data science, and related topics.
 
 ## Contents
-This repository contains a collection of code snippets, notebooks and other materials.
+
+This repository contains a collection of code snippets, notebooks, my Langchain 101 Course and other materials.
 
 ## Usage
-Feel free to use the code in this repository for your own projects, or to learn from it. 
-Kindly give credit to DataVerse and link back to the original post or article where the code was used.
+
+Feel free to use the code in this repository for your own projects, or to learn from it. Kindly give credit to DataVerse and link back to the original post or article where the code was used.
 
 ## License
+
 All code in this repository is licensed under the MIT License. This means that you are free to use, modify, and distribute the code as long as you give credit to DataVerse and include the original license in your work.
 
 ## Contact
+
 If you have any questions or comments about this repository, please feel free to reach out to us at [[email protected]](mailto:[email protected]).
+
+## Code Environment Setup
+
+I provide a Docker image for this course that uses [Jupyter Notebooks](https://jupyter.org/). Docker allows you to run the class's code in an environment precisely matching the one in which the code was developed and tested. You can also use the Docker image to run the course code in VSCode or another editor (see below).
+
+In addition to Docker you can also setup an environment locally using the instructions below.
+
+### Install Docker
+
+[Install docker](https://docs.docker.com/engine/install/) and then check the [Get Started](https://www.docker.com/get-started/) page if you aren't familiar.
+
+There are several docker containers used in this course:
+
+- `jupyter`: Jupyter Notebook server where we will interactively write and run code.
+- `chroma`: Chroma vector database server where we will store and query vector embeddings of documents for RAG.
+- `neo4j`: Neo4j graph database server where we will store and query graph data for prompt engineering and fine-tuning LLMs.
+- `opensearch`: OpenSearch server where we will store and query documents for RAG.
+
+### Docker Compose
+
+Bring up the course environment with the following command:
+
+```bash
+docker compose up -d
+```
+
+Find the Jupyter Notebook url via this command:
+
+```bash
+docker logs jupyter -f --tail 100
+```
+
+Look for the url with `127.0.0.1` and open it. You should see the Jupyter Notebook home page.
+
+NOTE: Insert an image of Jupyter home page for this course.
+
+### Docker and VSCode
+
+NOTE: add instructions.
+
+## Code-Level Environment Setup
+
+We use a Docker image to run the course, but you can also setup the environment so the code will work in VSCode or another editor. We provide a development tools setup using `black`, `flake8`, `isort`, `mypy` and `pre-commit` for you to modify and use as you see fit.
+
+### Install Anaconda Python
+
+We use Anaconda Python, Python version 3.10.0, for this course. You can download Anaconda Python from [here](https://www.anaconda.com/products/individual). Once you have installed Anaconda Python, you can create a new environment for this course by running the following command:
+
+```bash
+conda create -n chatbot-class python=3.10 -y
+```
+
+When you create a new environment or start a new shell, you will need to activate the `chatbot-class` conda environment with the following command:
+
+```bash
+conda activate chatbot-class
+```
+
+Now you are running Python 3.10 in the `chatbot-class` environment. To use this Python in VSCode, hit SHIFT-CMD-P (on Mac) and select `Python: Select Interpreter`. Then select the `chatbot-class` environment's Python.
+
+To deactivate this environment, run:
+
+```bash
+conda deactivate
+```
+
+#### Other Virtual Environments
+
+Note: I don't support other environments, but you can actually use any Python 3.10 if you are smart enough to make that work. :) You will need to manage your own virtual environments. Python 3's [`venv`](https://docs.python.org/3/library/venv.html) are easy to use.
+
+To create a `venv` for the project, run:
+
+```bash
+python3 -m venv chatbot-class
+```
+
+To activate this venv run:
+
+```bash
+source chatbot-class/bin/activate
+```
+
+To deactivate this environment, run:
+
+```bash
+deactivate
+```
+
+### Install Poetry for Dependency Management
+
+We use [Poetry](https://python-poetry.org/) for dependency management, as it makes things fairly painless. 
+
+Verify [Poetry installation instructions here](https://python-poetry.org/docs/#installation) so you know the URL `https://install.python-poetry.org` is legitimate to execute in `python3`.
+
+Then install Poetry with the following command:
+
+```bash
+curl -sSL https://install.python-poetry.org | python3 -
+```
+
+It is less "clean" in terms of environmental isolation, but alternatively you can install poetry via `pip`:
+
+```bash
+pip install poetry
+```
+
+### Install Dependencies via Poetry
+
+```bash
+poetry install
+```
+
+## Essential Tools
+
+### [`langchain`](https://www.langchain.com/) ([docs](https://python.langchain.com/docs/get_started/introduction))
+
+> LangChain is a framework for developing applications powered by language models. It enables applications that:
+>
+> * Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
+> * Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)
+>
+> The main value props of LangChain are:
+>
+> * Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not
+> * Off-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks
+Off-the-shelf chains make it easy to get started. For complex applications, components make it easy to customize existing chains and build new ones.
+
+### [`langchain-hub`](https://github.com/hwchase17/langchain-hub)
+
+> Taking inspiration from Hugging Face Hub, LangChainHub is collection of all artifacts useful for working with LangChain primitives such as prompts, chains and agents. The goal of this repository is to be a central resource for sharing and discovering high quality prompts, chains and agents that combine together to form complex LLM applications.
+
+See [example usage here](https://python.langchain.com/docs/use_cases/question_answering/).
+
+### [`llama-index`](https://github.com/jerryjliu/llama_index)
+
+> LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:
+>
+> * Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
+> * Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
+> * Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
+> * Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
+>
+> LlamaIndex provides tools for both beginner users and advanced users. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. Our lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.
+
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,113 @@
+version: "3.8"
+
+services:
+
+  jupyter:
+    # TODO: Upgrade me to a RAPIDS image
+    # image: jupyter/scipy-notebook:python-3.10.11
+    image: rjurney/dataverse
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: jupyter
+    ports:
+      - 8888:8888
+    volumes:
+      - .:/home/jovyan/work
+      - ./data:/home/jovyan/data
+    environment:
+      - JUPYTER_ENABLE_LAB=yes
+      - DGLBACKEND=pytorch
+    env_file:
+      - envs/search.env
+      - envs/openai.env
+      - envs/wandb.env
+      - .env
+    restart: always
+
+  neo4j:
+    image: neo4j:5.11.0
+    container_name: neo4j
+    ports:
+      - 7474:7474
+      - 7687:7687
+    networks:
+      - opensearch-net
+    volumes:
+      - ./data/neo4j:/data
+      - ./logs/neo4j:/logs
+    environment:
+      - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
+    env_file:
+      - envs/neo4j.env
+      - .env
+    restart: always
+
+  opensearch-node1:
+    image: opensearchproject/opensearch:latest
+    container_name: opensearch-node1
+    environment:
+      - cluster.name=opensearch-cluster # Name the cluster
+      - node.name=opensearch-node1 # Name the node that will run in this container
+      - discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
+      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
+      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
+      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
+      - "DISABLE_INSTALL_DEMO_CONFIG=true" # Prevents execution of bundled demo script which installs demo certificates and security configurations to OpenSearch
+      - "DISABLE_SECURITY_PLUGIN=true" # Disables Security plugin
+    ulimits:
+      memlock:
+        soft: -1 # Set memlock to unlimited (no soft or hard limit)
+        hard: -1
+      nofile:
+        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
+        hard: 65536
+    volumes:
+      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
+    ports:
+      - 9200:9200 # REST API
+      - 9600:9600 # Performance Analyzer
+    networks:
+      - opensearch-net # All of the containers will join the same Docker bridge network
+  opensearch-node2:
+    image: opensearchproject/opensearch:latest
+    container_name: opensearch-node2
+    environment:
+      - cluster.name=opensearch-cluster # Name the cluster
+      - node.name=opensearch-node2 # Name the node that will run in this container
+      - discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
+      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
+      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
+      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
+      - "DISABLE_INSTALL_DEMO_CONFIG=true" # Prevents execution of bundled demo script which installs demo certificates and security configurations to OpenSearch
+      - "DISABLE_SECURITY_PLUGIN=true" # Disables Security plugin
+    ulimits:
+      memlock:
+        soft: -1 # Set memlock to unlimited (no soft or hard limit)
+        hard: -1
+      nofile:
+        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
+        hard: 65536
+    volumes:
+      - opensearch-data2:/usr/share/opensearch/data # Creates volume called opensearch-data2 and mounts it to the container
+    networks:
+      - opensearch-net # All of the containers will join the same Docker bridge network
+  opensearch-dashboards:
+    image: opensearchproject/opensearch-dashboards:latest
+    container_name: opensearch-dashboards
+    ports:
+      - 5601:5601 # Map host port 5601 to container port 5601
+    expose:
+      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
+    environment:
+      - 'OPENSEARCH_HOSTS=["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
+      - "DISABLE_SECURITY_DASHBOARDS_PLUGIN=true" # disables security dashboards plugin in OpenSearch Dashboards
+    networks:
+      - opensearch-net
+
+volumes:
+  opensearch-data1:
+  opensearch-data2:
+
+networks:
+  opensearch-net:
diff --git a/envs/neo4j.env b/envs/neo4j.env
@@ -0,0 +1,8 @@
+NEO4J_AUTH=none
+NEO4J_dbms_transaction_concurrent_maximum=0
+NEO4J_dbms_memory_heap_max__size=16g
+NEO4J_PLUGINS='["apoc","apoc-extended","bloom","graph-data-science","graphql"]'
+NEO4J_apoc_import_file_enabled=true
+NEO4J_apoc_export_file_enabled=true
+NEO4j_apoc_export_csv_data=true
+
diff --git a/envs/openai.env b/envs/openai.env
@@ -0,0 +1,2 @@
+# OpenAPI API Key
+OPENAI_API_KEY=
diff --git a/envs/search.env b/envs/search.env
@@ -0,0 +1,5 @@
+# OpenSearch Cluster Info
+OPENSEARCH_HOST=
+OPENSEARCH_PORT=
+OPENSEARCH_USER=
+OPENSEARCH_PASSWD=
diff --git a/envs/wandb.env b/envs/wandb.env
@@ -0,0 +1,2 @@
+# Weights & Biases API Setup
+WANDB_API_KEY=
-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    .git
+    data
+    logs