Basic statistics

# of names	# of last names )	# of first names)	# of M	# of F	# of Unknown gender
3658109	808	710594	2054134	1509650	94325

sample image (pinyin version)：

Pinyin Version

Due to the large size of the original files, this repository does not store any Pinyin version of the corpus (CCNC). You can click here to download the three versions of CCNC (Chinese, Pinyin with tones & Pinyin without tones) in a zip file.

Sources

CCNC is compiled from the following two sources:

姓名大全, which provides 2513097 examples. Scraping Script
中文人名语料库, which provides 1145012 examples.

Things to note：

中文人名语料库 does not differentiate between last names and first names, but CCNC does.
There were about 300k overlapping examples that have been filtered.
However, if a same name is shared among different genders, the name will be deemed as unique for each gender.
All the names with unknown genders come from 中文人名语料库.

Romanized Chinese Last Names Dictionary (i.e., Ch-Last-Names-Dict)

The dictionary collects 1606 Chinese last names along with their Pinyins and 1534 of them were scraped from 名霸百家姓， the other 72 rare last names come from the corpus and were annotated with Pinyin by myself. These self-annotated last names include: 滕, 刁, 牧, 欧阳, 徐离, 傲, 宾, 博, 采, 恩, 凡, 格, 冠, 好, 昊, 浩, 荷, 恒, 鸿, 湖, 化, 基, 继, 见, 杰, 静, 菊, 俊, 卡, 科, 奎, 立, 丽, 刘付, 绿, 麦, 曼, 美, 梦, 名, 默, 沐, 娜, 乃, 尼, 日, 如, 润, 若, 上, 升, 桃, 天, 拓, 旺, 未, 溪, 夏候, 湘, 晓, 雄, 雅, 岩, 彦, 艳, 依, 远, 悦, 忠, 珠。

Train/dev/test set

Here is a simple script to split CCNC into train, dev and test sets. The default splitting ratio is 6:2:2，and this is the split CCNC compressed in a sample zip file for the pure Chinese version. You can also apply the script to the Pinyin versions and get similar results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Basic statistics

Pinyin Version

Sources

Romanized Chinese Last Names Dictionary (i.e., Ch-Last-Names-Dict)

Train/dev/test set

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Basic statistics

Pinyin Version

Sources

Romanized Chinese Last Names Dictionary (i.e., Ch-Last-Names-Dict)

Train/dev/test set