AI於釣魚郵件辨別之應用

Report Project

📽️Demo link：https://youtu.be/aACeqIV1ORc
📝Slides：I1_final.pdf

(2) Backround

Motivation

During the course, the combination of generative AI with fraud was mentioned, which proved to be quite an intriguing perspective. With this initial idea and data collection, we believed that we could focus on the topic of AI analyzing URLs for phishing website links. Phishing website links are commonly associated with cybersecurity breaches, appearing frequently in attacks through mediums such as emails and text messages. Despite their long history, they continue to be a persistent threat.

Furthermore, there is another advantage to analyzing website links. Phishing emails often target businesses, and the emails themselves involve privacy concerns. By simply analyzing the submitted URL to determine whether it leads to a phishing website, we can better protect privacy.

Existing phishing website databases are typically populated through manual submissions and subsequent verification. Using AI for detection provides a quicker and more real-time approach to the task, aligning with the need for swifter detection. We have chosen to employ GPT-3.5 for this purpose.

(3) Solutions

Database
Initial training
- Web Crawler
- Add OCR texts to simplified HTMLs
- Identify chatGPT with Prompt
Deep Learning
- Provide a website URL directly, and it will display whether it's safe or a warning, along with a score.
- Utilize supervised learning to label the characteristics of phishing URLs, optimizing the recognition process.
Conclusion
- Employed a Confusion Matrix to compare the accuracy of our Natural language processing model with traditional machine learning in the context of phishing email recognition.
- Suggested potential directions for future research.

Reference

機器學習分析垃圾&釣魚郵件標頭檔
Learning OpenAI API
OpenAI 專屬助理--網頁部分
ChatGPT Writer
Detecting Phishing Sites Using ChatGPT-2023/06/09
AnomalyDetectioninEmailsusingMachine LearningandHeaderInformation-2022/03/19
Phishing by Form: The Abuse of Form Sites-2011/10/18 IEEE

Error

一直沒辦法使用 chatGPT 的機器人，後來跟可以使用的組員比對後發現缺少了一個 persist 資料夾

一鍵搜索資料夾

Confusion Matrix：計算與挑選模糊矩陣的樣本
crawl：bug1只抓文字；bug2抓取整個html；craw以firefox引擎為模板；web抓取 html 並截圖已做後續OCR處理
de-identification：一鍵將儲存資料夾裡的html抓取文本。coool有 meta 值、html、url，有資料夾；lighter沒有 meta 值
weeeb：如果 chatGPT 無法分析的備案，再傳入URL時將初步訓練的所有步驟全部在後台跑一次
main：後端訓練
gpt-master：串接與網站 Demo