Skip to content

Commit

Permalink
Add Hunting Anomalies in the Stock Market scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
justinpolygon committed Nov 4, 2024
1 parent 53808f8 commit f6a96b1
Show file tree
Hide file tree
Showing 5 changed files with 478 additions and 0 deletions.
50 changes: 50 additions & 0 deletions examples/tools/hunting-anomalies/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Hunting Anomalies in the Stock Market

This repository contains all the necessary scripts and data directories used in the [Hunting Anomalies in the Stock Market](https://polygon.io/blog/hunting-anomalies-in-stock-market/) tutorial, hosted on Polygon.io's blog. The tutorial demonstrates how to detect statistical anomalies in historical US stock market data through a comprehensive workflow that involves downloading data, building a lookup table, querying for anomalies, and visualizing them through a web interface.

### Prerequisites

- Python 3.8+
- Access to Polygon.io's historical data via Flat Files
- An active Polygon.io API key, obtainable by signing up for a Stocks paid plan

### Repository Contents

- `README.md`: This file, outlining setup and execution instructions.
- `aggregates_day`: Directory where downloaded CSV data files are stored.
- `build-lookup-table.py`: Python script to build a lookup table from the historical data.
- `query-lookup-table.py`: Python script to query the lookup table for anomalies.
- `gui-lookup-table.py`: Python script for a browser-based interface to explore anomalies visually.

### Running the Tutorial

1. **Ensure Python 3.8+ is installed:** Check your Python version and ensure all required libraries (polygon-api-client, pandas, pickle, and argparse) are installed.

2. **Set up your API key:** Make sure you have an active paid Polygon.io Stock subscription for accessing Flat Files. Set up your API key in your environment or directly in the scripts where required.

3. **Download Historical Data:** Use the MinIO client to download historical stock market data:
```bash
mc alias set s3polygon https://files.polygon.io YOUR_ACCESS_KEY YOUR_SECRET_KEY
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/08/ ./aggregates_day/
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/09/ ./aggregates_day/
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/10/ ./aggregates_day/
gunzip ./aggregates_day/*.gz
```
Adjust the commands and paths based on the data you're interested in.

4. **Build the Lookup Table:** This script processes the downloaded data and builds a lookup table, saving it as `lookup_table.pkl`.
```bash
python build-lookup-table.py
```

5. **Query Anomalies:** Replace `2024-10-18` with the date you want to analyze for anomalies.
```bash
python query-lookup-table.py 2024-10-18
```

6. **Run the GUI:** Access the web interface at `http://localhost:8888` to explore the anomalies visually.
```bash
python gui-lookup-table.py
```

For a complete step-by-step guide on each phase of the anomaly detection process, including additional configurations and troubleshooting, refer to the detailed [tutorial on our blog](https://polygon.io/blog/hunting-anomalies-in-stock-market).
1 change: 1 addition & 0 deletions examples/tools/hunting-anomalies/aggregates_day/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Download flat files into here.
94 changes: 94 additions & 0 deletions examples/tools/hunting-anomalies/build-lookup-table.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import os
import pandas as pd
from collections import defaultdict
import pickle
import json

# Directory containing the daily CSV files
data_dir = './aggregates_day/'

# Initialize a dictionary to hold trades data
trades_data = defaultdict(list)

# List all CSV files in the directory
files = sorted([f for f in os.listdir(data_dir) if f.endswith('.csv')])

print("Starting to process files...")

# Process each file (assuming files are named in order)
for file in files:
print(f"Processing {file}")
file_path = os.path.join(data_dir, file)
df = pd.read_csv(file_path)
# For each stock, store the date and relevant data
for _, row in df.iterrows():
ticker = row['ticker']
date = pd.to_datetime(row['window_start'], unit='ns').date()
trades = row['transactions']
close_price = row['close'] # Ensure 'close' column exists in your CSV
trades_data[ticker].append({
'date': date,
'trades': trades,
'close_price': close_price
})

print("Finished processing files.")
print("Building lookup table...")

# Now, build the lookup table with rolling averages and percentage price change
lookup_table = defaultdict(dict) # Nested dict: ticker -> date -> stats

for ticker, records in trades_data.items():
# Convert records to DataFrame
df_ticker = pd.DataFrame(records)
# Sort records by date
df_ticker.sort_values('date', inplace=True)
df_ticker.set_index('date', inplace=True)

# Calculate the percentage change in close_price
df_ticker['price_diff'] = df_ticker['close_price'].pct_change() * 100 # Multiply by 100 for percentage

# Shift trades to exclude the current day from rolling calculations
df_ticker['trades_shifted'] = df_ticker['trades'].shift(1)
# Calculate rolling average and standard deviation over the previous 5 days
df_ticker['avg_trades'] = df_ticker['trades_shifted'].rolling(window=5).mean()
df_ticker['std_trades'] = df_ticker['trades_shifted'].rolling(window=5).std()
# Store the data in the lookup table
for date, row in df_ticker.iterrows():
# Convert date to string for JSON serialization
date_str = date.strftime('%Y-%m-%d')
# Ensure rolling stats are available
if pd.notnull(row['avg_trades']) and pd.notnull(row['std_trades']):
lookup_table[ticker][date_str] = {
'trades': row['trades'],
'close_price': row['close_price'],
'price_diff': row['price_diff'],
'avg_trades': row['avg_trades'],
'std_trades': row['std_trades']
}
else:
# Store data without rolling stats if not enough data points
lookup_table[ticker][date_str] = {
'trades': row['trades'],
'close_price': row['close_price'],
'price_diff': row['price_diff'],
'avg_trades': None,
'std_trades': None
}

print("Lookup table built successfully.")

# Convert defaultdict to regular dict for JSON serialization
lookup_table = {k: v for k, v in lookup_table.items()}

# Save the lookup table to a JSON file
with open('lookup_table.json', 'w') as f:
json.dump(lookup_table, f, indent=4)

print("Lookup table saved to 'lookup_table.json'.")

# Save the lookup table to a file for later use
with open('lookup_table.pkl', 'wb') as f:
pickle.dump(lookup_table, f)

print("Lookup table saved to 'lookup_table.pkl'.")
Loading

0 comments on commit f6a96b1

Please sign in to comment.