Education-ner-dataset

EduNER is a Chinese named entity recognition dataset for education research.

├── Models
│   ├── BERT-CRF
│   ├── BERT-NER
│   ├── BiLSTM-CRF
│   ├── CLNER
│   ├── Flat-Lattice-Transformer
│   ├── Flert
│   ├── LEBERT
│   ├── LexiconAugmentedNER
│   ├── LGN
│   ├── LR-CNN
│   ├── MECT4CNER
│   ├── SLK-NER
│   └── TENER
└── sample_EduNER

EduNER：the full version dataset is coming soon...

sample_EduNER/ directory contains the sampling version of our dataset.
The related resource paper ✨ is currently under review and a sampled version of the dataset is currently released. After final proofing, the full version of the EduNER dataset will be publicly accessible.
A snapshot of entity types

Models

basic

models/ directory contains the recent SOTA models.
LexiconAugementedNER includes SoftLexicon+CNN/Transformer/LSTM models.
CLNER includes the CL-KL and CL-L₂ models.

tutorial

Pre-trained embedding

we use the Chinese pre-trained character or word embeddings, e.g., ctb.50d, gigaword_chn.all.a2b.bi.ite50, and gigaword_chn.all.a2b.uni.ite50 in line with (Yang et al., 2017). As pre-trained language model, we use the Chinese BERT:bert-base-chinese.

Hyper parameters

models	epoch	batch size	max length	learning rate	dropout rate	crf learning rate
example	100	10	256	0.001	0.5	X
BiLSTM+CRF
BERT+CRF
LR-CNN
TENER
LGN
FLAT+BERT	100	10	200	0.0006	0.5
SoftLexicon (CNN)
SoftLexicon (Transformer)
SoftLexicon (LSTM)
MECT4CNER	100	10	200	0.0014	0.2
SLK-NER	30	32	256	5e-5	0.5
LEBERT	20	4	256	1e-5	0.1
FLERT	10	4	512	5e-6	0.1
CL-KL	10	1	512	5e-6	0.1
CL-L₂	10	2	512	5e-6	0.1

Online Annotation Platform

We provide a temporary account to test the annotation tool

username: edu
password:

Update plan

EduNER dataset project is a long-term plan, we expect the dataset to cover more languages and disciplines in higher eduercation. Althgouh this goal is obviously not achieved in a short duration, the dataset will expand to one or two discipline, and will acquire a bigger scale dataset can be used for teaching or learning context.

Pedagogic Psychology discipline will be added in the next year (about: 06.2022 ~ 06.2023).
Policy, Conference related corpus will be added in the next phase (about: 08.2022 ~ 01.2023).

Beta application

A beta educational tool ( EDUNERScore ) based on our dataset can be accessed. The tool is based on NER technology and allows for the analysis of unstructured educational texts in real time. Specifically, the tool can extract the discipline entity from a large-scale unstructured texts, e.g., discourse content, online forums, writing documents etc. It will help the stakeholder to better understand the learning or teaching activity.
Due to limited computing resources, only cached results can be viewed at current. In addition, only the Chinese version is now available.
Instruction

License

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
img		img
models		models
sampled_EduNER		sampled_EduNER
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Education-ner-dataset

EduNER：the full version dataset is coming soon...

Models

basic

tutorial

Online Annotation Platform

Update plan

Beta application

License

About

Releases

Packages

Contributors 2

Languages

License

meng-wenlong/EduNER

Folders and files

Latest commit

History

Repository files navigation

Education-ner-dataset

EduNER：the full version dataset is coming soon...

Models

basic

tutorial

Online Annotation Platform

Update plan

Beta application

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages