-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Simple Twitter Scraper Tweepy is a small personal project that I made with the goal to better understand web scraping while studying Twitter's API. This version of the scraper search for specific words, phrases or hashtags and gets a specified number of tweets older than a specified date.
This code was based on the project of Griffin Leow and his article .
Since we are using the Tweepy library we need to create a twitter developer account so we can access Twitter's API. You can do that by clicking this link
After that you'll need to create an app at the "apps" tab in your profile picture, so you can have access to the "keys and tokens" section, so you can get the necessary IDs. After that copy them in the consumer_key, consumer_secret, etc. variables.
search_words: str - words, phrases or hashtags that the scraper will search for. Separated by "OR". Example:
search_words = '#landmark OR #photo OR python OR github is neat'
date_until: str or datetime - specify which will be the UTC date of the first scrapped tweet, all tweets scrapped after will be older. Remember that Twitter's API only allow us to get tweets less than 7 days old. Date must be 'YYYY-MM-DD' format. You can't specify hours, minutes neither seconds. Example:
date_until = '2020-03-31'
In this case the date of creation of the first tweet scrapped will be 2020-03-30 23:59:59
or older.
num_tweets: int - The numbers of tweets to be analyzed in a run. Remember that Twitter's API only allow us to a maximum of 2500 tweets per 15 minutes. Example:
num_tweets = 2500
num_runs : int - The number of runs (calls) that the scraper will perform. This allow us to get more than 2500 tweets in one call of the function.
num_runs = 6
So, if num_tweets is equal to 2500 we will get a total of 15000 tweets in 90 minutes.
tweet_max_id: optional, int - ID-1 of the oldest tweet to be analyzed. Since we can't specify the exact hour, minute and second of the first tweet to be scrapped you can try to get the ID of a tweet that occurred in that time and place it here, so you can scrap all tweets older than it. If this is used, date_until will be discarded.
tweet_max_id = 1050118621198921728
You can read about the tweet object and its attributes in this article.
username : str - Screen name of the tweet's user.
acctdesc: str, Nullable - Bio of the tweets's user.
location: str, Nullable - location of the tweet's user . following: int - Number of accounts followed by the tweet's user.
followers: int - Number of followers of the tweet's user.
totaltweets : int - Number of tweets (including retweets) of the tweet's user.
usercreatedts: str - UTC date of the creation of the tweet's user.
tweetcreatedts: str - UTC date of the creation of the tweet.
retweetcount: int - Number of times this tweet was retweed.
hashtags: Array of hashtags objects - Hashtags inside that tweet.
id_int: int - ID of the tweet.
text: str - Text of the tweet (including retweets).
In your work space with the name YYYY-MM-DD_hhmmss_scraping_tweets.csv