Downloader for Twitter data for Media Analytics Tool project.
git clone [email protected]:lvtffuk/mat-twitter-downloader.git
cd mat-twitter-downloader
npm install
npm start
The settings are set with environment variables.
Variable | Description | Required | Default value |
---|---|---|---|
TOKENS_FILE_PATH |
The filepath of the csv file with access tokens. |
✔️ | |
PROFILES_FILE_PATH |
The filepath of the csv file with profile list. |
✔️ | |
OUT_DIR |
The directory where the output is stored. | ✔️ | |
WORKERS |
Worker names separated by commas. Possible values: tweets , followers , followings |
❌ | |
AFFINITY |
Indicates if the affinity of followers should be calculated. For affinity followers worker must be enabled. |
❌ | 0 |
AFFINITY_FOLLOWING_THRESHOLD |
Percents of common following users to analyzing user's followers. | ❌ | 10 |
CSV_SEPARATOR |
The separator of the input csv files. |
❌ | ; |
WORKER_CONCURRENCY |
The count of parallel runs of the downloading ads archive. | ❌ | 5 |
CLEAR |
Indicates if the output dir should be cleared before the run. All downloads are starting again. | ❌ | 0 |
USER_COUNT |
Total count of user on the twitter segment. | ❌ | 500000 |
IGNORE_USERS |
Indicates if the app should download only tweets. DEPRECATED Equivalent of WORKERS=tweets |
❌ | 0 |
If the affinity is enabled followings of followers are downloaded in the downloading process. It can download thousands of following users for one profile.
For access to the Twitter API is required a review from the twitter. After that the projects and apps are accessible in developer portal.
The tokens must be stored in csv
files.
"app";"token"
"appId";"accessToken"
First line is header.
The app
is app ID from developer portal.
The access token can be also retrieved in the developer portal in the project section.
The rate limit of the accessing the followers / followings is 15 requests per 15 minutes. Large profiles should be downloaded with large amount of valid tokens.
The profile list is simple csv
file with one column
"username"
"user1"
"user2"
"user3"
First line is header.
The output is stored in csv
files in the output directory.
File | Description |
---|---|
affinities |
Directory containing calculated affinities of analyzed profiles. |
profiles |
Directory containing info about analyzed profiles. |
tweets.csv |
Latest tweets of the users. |
followers.csv |
The followers of the users. |
followers.nsd.csv |
Calculated normalized social distance for followers. |
followings.csv |
Followings of the users. |
followings.nsd.csv |
Calculated normalized social distance for followings. |
friends.csv |
Users following each other. It's not create if one of user workers followers , followings is disabled. |
CSV files except affinity and nsd files are saved without headers.
"userId";"username";"tweetId";"tweet";"createdTime"
"userId";"username";"followerId";"followerUsername"
The directory contains json
files for each of the profile which is analyzed in the downloading process.
The directory contains csv
files of affinities for analyzed profiles. For each user the [username].csv
file is created and the matrix of affinity users as [username].matrix.csv
.
In addition normalized social distance is calculated for the followers. For each user the [username].nsd.csv
file is created.
Normalized social distance (NSD) needs at least to profiles to analyze.
The image is stored in GitHub packages registry and the app can be run in the docker environment.
docker pull ghcr.io/lvtffuk/mat-twitter-downloader:latest
docker run \
--name=mat-twitter-downloader \
-e 'TOKENS_FILE_PATH=./input/tokens.csv' \
-e 'PROFILES_FILE_PATH=./input/profiles.csv' \
-e 'OUT_DIR=./output' \
-v '/absolute/path/to/output/dir:/usr/src/app/output' \
-v '/absolute/path/to/input/dir:/usr/src/app/input' \
ghcr.io/lvtffuk/mat-twitter-downloader:latest
The volumes must be set for accessing input and output data.
This work was supported by the European Regional Development Fund-Project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (No. CZ.02.1.01/0.0/0.0/16_019/0000734)."