City_Description_Dataset_Generator

This kernal is created to collect data on many cities around the world to categorize them based on their descriptions. The list of cities are collected from Simplemaps as part of their free plan.

Inroduction:

In this kernal, there were series of decisions that were made to collect the data on each city. Some cities are filetred out due to various reasons which will be discussed in coming sections. Data was collected on 4016 cities.

Loading Source:

There are 15,000 cities(CSV file) in the source. The source file is available here, make sure you have both python file and source file in the same directory when you run the script.

Cleaning and filtering data:

There are some cities with same names which are be in different states or countries.
I considered such cities as duplicates as I couldn't automate WIKI_Travel link for them.
Removing cities with same name is done based on their population.
The city with a common name but has greater population than the rest of the cities is kept and remaining are dropped.
After removing the cities with common name, columns "city_ascii" & "country" are kept in the dataframe dropping the unwanted columns like lattitude, longitude etc.,

Fetching Article data of city:

City name has special characters so it is cleaned and a WikiTravel URL is generated for every city with the below code

city = row['city_ascii'].replace(' ', '_')
# creating a URL with city name
URL = 'https://wikitravel.org/en/' + city

There are some cities with a dedicated webpage but has no information on it, there are some cities with very less information that would not be enough to make an analysis. Such cities are dropped based on the size of article data fetched. The below code takes care of the filtering.

if len_article >700:
    new_row = {'City':row['city_ascii'], 'Country': row['country'], 'Description':article_data}
    city_descriptions = city_descriptions.append(new_row, ignore_index=True)
    counter += 1
    print(counter, row['city_ascii'])

Saving Dataset:

The dataframe finally consists of 4016 cities descriptions, all the data is stored in the form of "city_description.csv" and also "city_description.h5" file. Both the files are uploaded in this repository.

End Note:

This dataset is available here under GPL-3.0 License and at Kaggle. Feel free to use this data to make clustering of cities based on their description 👍 Happy Learning! 🤘

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Dataset_Generator.ipynb		Dataset_Generator.ipynb
LICENSE		LICENSE
README.md		README.md
city_description.csv		city_description.csv
city_description.h5		city_description.h5
worldcities.csv		worldcities.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

City_Description_Dataset_Generator

Inroduction:

Loading Source:

Cleaning and filtering data:

Fetching Article data of city:

Saving Dataset:

End Note:

About

Releases

Packages

Languages

License

rajadevineni/City_Description_Dataset_Generator

Folders and files

Latest commit

History

Repository files navigation

City_Description_Dataset_Generator

Inroduction:

Loading Source:

Cleaning and filtering data:

Fetching Article data of city:

Saving Dataset:

End Note:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages