from chatlas import ChatAnthropic, ChatOpenAI
= """
@@ -262,11 +266,11 @@ question Code generation
Basic flavour
When I don’t provide a system prompt, I sometimes get answers in a different language (like R):
-
+
= ChatAnthropic()
chat = chat.chat(question) _
-
+
@@ -284,14 +288,14 @@ Basic flavour
)))
# Alternative base R approach
-aggregate(. ~ age + sex, data = df[,c("age", "sex", letters[1:26])],
+aggregate(. ~ age + sex, data = df[,c("age", "sex", letters)],
FUN = function(x) c(mean = mean(x), median = median(x)))
This will: 1. Group the data by age and sex 2. Calculate both mean and median for each variable a through z 3. Handle missing values with na.rm = TRUE 4. Return a dataframe with results for each age-sex combination
The output will have columns for age, sex, and mean/median values for each variable.
I can ensure that I always get Python code by providing a system prompt:
-
+
= "You are a helpful Python (not R) programming assistant."
chat.system_prompt = chat.chat(question) _
@@ -299,34 +303,29 @@ Basic flavour
-Here’s how to compute mean and median for variables a through z in Python using pandas:
+Here’s how to compute mean and median for variables a through z using Python, likely with pandas:
import pandas as pd
# Assuming your data is in a DataFrame called df
-# Group by age and sex, then calculate mean and median for all variables a through z
-= df.groupby(['age', 'sex'])[list('abcdefghijklmnopqrstuvwxyz')].agg(['mean', 'median'])
- result
-# If you want to handle missing values:
-= df.groupby(['age', 'sex'])[list('abcdefghijklmnopqrstuvwxyz')].agg({
- result 'mean', 'median'] for col in list('abcdefghijklmnopqrstuvwxyz')
- col: [0) # or use .dropna() instead of fillna() if you prefer to remove NaN values
- }).fillna(
-# To make the output more readable, you can flatten the column names:
-= [f'{col}_{stat}' for col, stat in result.columns] result.columns
-This will: 1. Group your data by age and sex 2. Calculate mean and median for each variable from ‘a’ to ‘z’ 3. Return a DataFrame where: - The index will be the age and sex combinations - The columns will be each variable’s mean and median - Column names will be in the format ‘variable_statistic’ (e.g., ‘a_mean’, ‘a_median’, etc.)
-Example output might look like:
- a_mean a_median b_mean b_median ... z_mean z_median
-age sex ...
-20 M 23.5 24.0 45.2 46.0 ... 12.3 12.0
- F 22.1 21.0 44.8 45.0 ... 11.9 12.0
-21 M 24.2 25.0 46.1 46.0 ... 12.5 13.0
- F 23.8 24.0 45.9 46.0 ... 12.2 12.0
-...
+= df.groupby(['age', 'sex'])[list('abcdefghijklmnopqrstuvwxyz')].agg(['mean', 'median'])
+ result
+# If you want to reset the index to make age and sex regular columns
+= result.reset_index()
+ result
+# Alternative way with more explicit column selection:
+= [chr(i) for i in range(ord('a'), ord('z')+1)]
+ columns_to_analyze = df.groupby(['age', 'sex'])[columns_to_analyze].agg(['mean', 'median']) result
+This will: 1. Group the data by age and sex 2. Calculate both mean and median for each variable from ‘a’ to ‘z’ 3. Handle missing values automatically 4. Create a multi-level column structure where each variable has both mean and median
+The resulting DataFrame will have: - age and sex as index (unless reset_index() is used) - A hierarchical column structure where each variable (a-z) has both mean and median values
+If you need to handle missing values differently, you can modify the aggregation like this:
+= df.groupby(['age', 'sex'])[columns_to_analyze].agg({
+ result 'mean', 'median'] for col in columns_to_analyze
+ col: [# or use .fillna(0) to replace NAs with zeros }).dropna()
Note that I’m using both a system prompt (which defines the general behaviour) and a user prompt (which asks the specific question). You could put all of the content in the user prompt and get similar results, but I think it’s helpful to use both to cleanly divide the general framing of the response from the specific questions that you want to ask.
Since I’m mostly interested in the code, I ask it to drop the explanation:
-
+
= """
chat.system_prompt You are a helpful Python (not R) programming assistant.
Just give me the code without any text explanation.
@@ -339,12 +338,13 @@ Basic flavour
import pandas as pd
-= df.groupby(['age', 'sex'])[list('abcdefghijklmnopqrstuvwxyz')].agg(['mean', 'median'])
- result = [f'{col}_{stat}' for col, stat in result.columns] result.columns
+= list('abcdefghijklmnopqrstuvwxyz')
+ columns_to_analyze = df.groupby(['age', 'sex'])[columns_to_analyze].agg(['mean', 'median'])
+ result = result.reset_index() result
In this case, I seem to mostly get pandas code. But if you want a different style, you can ask for it:
-
+
= """
chat.system_prompt You are a helpful Python (not R) programming assistant who prefers polars to pandas.
Just give me the code without any text explanation.
@@ -357,18 +357,17 @@ Basic flavour
import polars as pl
-= df.groupby(['age', 'sex']).agg([
- result f'{c}_mean') for c in 'abcdefghijklmnopqrstuvwxyz'
- pl.col(c).mean().alias(+ [
- ] f'{c}_median') for c in 'abcdefghijklmnopqrstuvwxyz'
- pl.col(c).median().alias( ])
+= list('abcdefghijklmnopqrstuvwxyz')
+ columns_to_analyze = (df.groupby(['age', 'sex'])
+ result f'{col}_mean') for col in columns_to_analyze] +
+ .agg([pl.col(col).mean().alias(f'{col}_median') for col in columns_to_analyze])) [pl.col(col).median().alias(
Retrieval-Augmented Generation (RAG)
+Retrieval-Augmented Generation (RAG) helps LLMs gain the context they need to accurately answer a question. Nowadays, LLMs are trained on a vast amount of data, but they can’t possibly know everything, especially when it comes to real-time or sensitive information that isn’t publicly available. In this article, we’ll walk through a simple example of how to leverage RAG in combination with chatlas
.
The core idea of RAG is fairly simple, yet general: given a set of documents and a user query, find the document(s) that are the most “similar” to the query and supply those documents as additional context to the LLM. The LLM can then use this context to generate a response to the user query. There are many ways to measure similarity between a query and a document, but one common approach is to use embeddings. Embeddings are dense, low-dimensional vectors that represent the semantic content of a piece of text. By comparing the embeddings of the query and each document, we can compute a similarity score that tells us how closely related the query is to each document.
+There are also many different ways to generate embeddings, but one popular method is to use pre-trained models like Sentence Transformers. Different models are trained on different datasets and thus have different strengths and weaknesses, so it’s worth experimenting with a few to see which one works best for your particular use case. In our example, we’ll use the all-MiniLM-L12-v2
model, which is a popular choice thanks to its balance of speed and accuracy.
from sentence_transformers import SentenceTransformer
+
+= SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2") embed_model
Supplied with an embedding model, we can now compute embeddings for each document in our set and for a user_query
, then compare the query embedding to each document embedding to find the most similar document(s). A common way to measure similarity between two vectors is to compute the cosine similarity. The following code demonstrates how to do this:
import numpy as np
+
+# Our list of documents (one document per list element)
+= [
+ documents "The Python programming language was created by Guido van Rossum.",
+ "Python is known for its simple, readable syntax.",
+ "Python supports multiple programming paradigms.",
+
+ ]
+# Compute embeddings for each document (do this once for performance reasons)
+= [embed_model.encode([doc])[0] for doc in documents]
+ embeddings
+
+def get_top_k_similar_documents(
+
+ user_query,
+ documents,
+ embeddings,
+ embed_model,=3,
+ top_k
+ ):# Compute embedding for the user query
+ = embed_model.encode([user_query])[0]
+ query_embedding
+# Calculate cosine similarity between the query and each document
+ = np.dot(embeddings, query_embedding) / (
+ similarities =1) * np.linalg.norm(query_embedding)
+ np.linalg.norm(embeddings, axis
+ )
+# Get the top-k most similar documents
+ = np.argsort(similarities)[-top_k:][::-1]
+ top_indices return [documents[i] for i in top_indices]
+
+
+= "Who created Python?"
+ user_query
+= get_top_k_similar_documents(
+ top_docs
+ user_query,
+ documents,
+ embeddings,
+ embed_model,=3,
+ top_k )
And, now that we have the most similar documents, we can supply them to the LLM as context for generating a response to the user query. Here’s how we might do that using chatlas
:
from chatlas import ChatAnthropic
+
+= ChatAnthropic(
+ chat ="""
+ system_prompt You are a helpful AI assistant. Using the provided context,
+ answer the user's question. If you cannot answer the question based on the
+ context, say so.
+"""
+
+ )
+= chat.chat(
+ _ f"Context: {top_docs}\nQuestion: {user_query}"
+ )
Based on the context, Python was created by Guido van Rossum.
+