Add Hunting Anomalies in the Stock Market scripts

polygon-io · Nov 4, 2024 · f6a96b1 · f6a96b1
1 parent 53808f8
commit f6a96b1
Show file tree

Hide file tree

Showing 5 changed files with 478 additions and 0 deletions.
diff --git a/examples/tools/hunting-anomalies/README.md b/examples/tools/hunting-anomalies/README.md
@@ -0,0 +1,50 @@
+# Hunting Anomalies in the Stock Market
+
+This repository contains all the necessary scripts and data directories used in the [Hunting Anomalies in the Stock Market](https://polygon.io/blog/hunting-anomalies-in-stock-market/) tutorial, hosted on Polygon.io's blog. The tutorial demonstrates how to detect statistical anomalies in historical US stock market data through a comprehensive workflow that involves downloading data, building a lookup table, querying for anomalies, and visualizing them through a web interface.
+
+### Prerequisites
+
+- Python 3.8+
+- Access to Polygon.io's historical data via Flat Files
+- An active Polygon.io API key, obtainable by signing up for a Stocks paid plan
+
+### Repository Contents
+
+- `README.md`: This file, outlining setup and execution instructions.
+- `aggregates_day`: Directory where downloaded CSV data files are stored.
+- `build-lookup-table.py`: Python script to build a lookup table from the historical data.
+- `query-lookup-table.py`: Python script to query the lookup table for anomalies.
+- `gui-lookup-table.py`: Python script for a browser-based interface to explore anomalies visually.
+
+### Running the Tutorial
+
+1. **Ensure Python 3.8+ is installed:** Check your Python version and ensure all required libraries (polygon-api-client, pandas, pickle, and argparse) are installed.
+
+2. **Set up your API key:** Make sure you have an active paid Polygon.io Stock subscription for accessing Flat Files. Set up your API key in your environment or directly in the scripts where required.
+
+3. **Download Historical Data:** Use the MinIO client to download historical stock market data:
+   ```bash
+   mc alias set s3polygon https://files.polygon.io YOUR_ACCESS_KEY YOUR_SECRET_KEY
+   mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/08/ ./aggregates_day/
+   mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/09/ ./aggregates_day/
+   mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/10/ ./aggregates_day/
+   gunzip ./aggregates_day/*.gz
+   ```
+   Adjust the commands and paths based on the data you're interested in.
+
+4. **Build the Lookup Table:** This script processes the downloaded data and builds a lookup table, saving it as `lookup_table.pkl`.
+   ```bash
+   python build-lookup-table.py
+   ```
+
+5. **Query Anomalies:** Replace `2024-10-18` with the date you want to analyze for anomalies.
+   ```bash
+   python query-lookup-table.py 2024-10-18
+   ```
+
+6. **Run the GUI:** Access the web interface at `http://localhost:8888` to explore the anomalies visually.
+   ```bash
+   python gui-lookup-table.py
+   ```
+
+For a complete step-by-step guide on each phase of the anomaly detection process, including additional configurations and troubleshooting, refer to the detailed [tutorial on our blog](https://polygon.io/blog/hunting-anomalies-in-stock-market).
diff --git a/examples/tools/hunting-anomalies/aggregates_day/README.md b/examples/tools/hunting-anomalies/aggregates_day/README.md
@@ -0,0 +1 @@
+Download flat files into here.
diff --git a/examples/tools/hunting-anomalies/build-lookup-table.py b/examples/tools/hunting-anomalies/build-lookup-table.py
@@ -0,0 +1,94 @@
+import os
+import pandas as pd
+from collections import defaultdict
+import pickle
+import json
+
+# Directory containing the daily CSV files
+data_dir = './aggregates_day/'
+
+# Initialize a dictionary to hold trades data
+trades_data = defaultdict(list)
+
+# List all CSV files in the directory
+files = sorted([f for f in os.listdir(data_dir) if f.endswith('.csv')])
+
+print("Starting to process files...")
+
+# Process each file (assuming files are named in order)
+for file in files:
+    print(f"Processing {file}")
+    file_path = os.path.join(data_dir, file)
+    df = pd.read_csv(file_path)
+    # For each stock, store the date and relevant data
+    for _, row in df.iterrows():
+        ticker = row['ticker']
+        date = pd.to_datetime(row['window_start'], unit='ns').date()
+        trades = row['transactions']
+        close_price = row['close']  # Ensure 'close' column exists in your CSV
+        trades_data[ticker].append({
+            'date': date,
+            'trades': trades,
+            'close_price': close_price
+        })
+
+print("Finished processing files.")
+print("Building lookup table...")
+
+# Now, build the lookup table with rolling averages and percentage price change
+lookup_table = defaultdict(dict)  # Nested dict: ticker -> date -> stats
+
+for ticker, records in trades_data.items():
+    # Convert records to DataFrame
+    df_ticker = pd.DataFrame(records)
+    # Sort records by date
+    df_ticker.sort_values('date', inplace=True)
+    df_ticker.set_index('date', inplace=True)
+
+    # Calculate the percentage change in close_price
+    df_ticker['price_diff'] = df_ticker['close_price'].pct_change() * 100  # Multiply by 100 for percentage
+
+    # Shift trades to exclude the current day from rolling calculations
+    df_ticker['trades_shifted'] = df_ticker['trades'].shift(1)
+    # Calculate rolling average and standard deviation over the previous 5 days
+    df_ticker['avg_trades'] = df_ticker['trades_shifted'].rolling(window=5).mean()
+    df_ticker['std_trades'] = df_ticker['trades_shifted'].rolling(window=5).std()
+    # Store the data in the lookup table
+    for date, row in df_ticker.iterrows():
+        # Convert date to string for JSON serialization
+        date_str = date.strftime('%Y-%m-%d')
+        # Ensure rolling stats are available
+        if pd.notnull(row['avg_trades']) and pd.notnull(row['std_trades']):
+            lookup_table[ticker][date_str] = {
+                'trades': row['trades'],
+                'close_price': row['close_price'],
+                'price_diff': row['price_diff'],
+                'avg_trades': row['avg_trades'],
+                'std_trades': row['std_trades']
+            }
+        else:
+            # Store data without rolling stats if not enough data points
+            lookup_table[ticker][date_str] = {
+                'trades': row['trades'],
+                'close_price': row['close_price'],
+                'price_diff': row['price_diff'],
+                'avg_trades': None,
+                'std_trades': None
+            }
+
+print("Lookup table built successfully.")
+
+# Convert defaultdict to regular dict for JSON serialization
+lookup_table = {k: v for k, v in lookup_table.items()}
+
+# Save the lookup table to a JSON file
+with open('lookup_table.json', 'w') as f:
+    json.dump(lookup_table, f, indent=4)
+
+print("Lookup table saved to 'lookup_table.json'.")
+
+# Save the lookup table to a file for later use
+with open('lookup_table.pkl', 'wb') as f:
+    pickle.dump(lookup_table, f)
+
+print("Lookup table saved to 'lookup_table.pkl'.")