EduNER is a Chinese named entity recognition dataset for education research.
├── Models
│ ├── BERT-CRF
│ ├── BERT-NER
│ ├── BiLSTM-CRF
│ ├── CLNER
│ ├── Flat-Lattice-Transformer
│ ├── Flert
│ ├── LEBERT
│ ├── LexiconAugmentedNER
│ ├── LGN
│ ├── LR-CNN
│ ├── MECT4CNER
│ ├── SLK-NER
│ └── TENER
└── sample_EduNER
sample_EduNER/
directory contains the sampling version of our dataset.- The related resource paper ✨ is currently under review and a sampled version of the dataset is currently released. After final proofing, the full version of the EduNER dataset will be publicly accessible.
- A snapshot of entity types
models/
directory contains the recent SOTA models.- LexiconAugementedNER includes SoftLexicon+CNN/Transformer/LSTM models.
- CLNER includes the CL-KL and CL-L2 models.
- Pre-trained embedding
we use the Chinese pre-trained character or word embeddings, e.g., ctb.50d, gigaword_chn.all.a2b.bi.ite50, and gigaword_chn.all.a2b.uni.ite50 in line with (Yang et al., 2017). As pre-trained language model, we use the Chinese BERT:bert-base-chinese.
- Hyper parameters
models | epoch | batch size | max length | learning rate | dropout rate | crf learning rate | embeddings |
---|---|---|---|---|---|---|---|
example | 100 | 10 | 256 | 0.001 | 0.5 | X | |
BiLSTM+CRF | |||||||
BERT+CRF | |||||||
LR-CNN | |||||||
TENER | |||||||
LGN | |||||||
FLAT+BERT | 100 | 10 | 200 | 0.0006 | 0.5 | ||
SoftLexicon (CNN) | |||||||
SoftLexicon (Transformer) | |||||||
SoftLexicon (LSTM) | |||||||
MECT4CNER | 100 | 10 | 200 | 0.0014 | 0.2 | ||
SLK-NER | 30 | 32 | 256 | 5e-5 | 0.5 | ||
LEBERT | 20 | 4 | 256 | 1e-5 | 0.1 | ||
FLERT | 10 | 4 | 512 | 5e-6 | 0.1 | ||
CL-KL | 10 | 1 | 512 | 5e-6 | 0.1 | ||
CL-L2 | 10 | 2 | 512 | 5e-6 | 0.1 |
- We provide a temporary account to test the annotation tool
username: edu
password:
EduNER dataset project is a long-term plan, we expect the dataset to cover more languages and disciplines in higher eduercation. Althgouh this goal is obviously not achieved in a short duration, the dataset will expand to one or two discipline, and will acquire a bigger scale dataset can be used for teaching or learning context.
- Pedagogic Psychology discipline will be added in the next year (about: 06.2022 ~ 06.2023).
- Policy, Conference related corpus will be added in the next phase (about: 08.2022 ~ 01.2023).
- A beta educational tool ( EDUNERScore ) based on our dataset can be accessed. The tool is based on NER technology and allows for the analysis of unstructured educational texts in real time. Specifically, the tool can extract the discipline entity from a large-scale unstructured texts, e.g., discourse content, online forums, writing documents etc. It will help the stakeholder to better understand the learning or teaching activity.
- Due to limited computing resources, only cached results can be viewed at current. In addition, only the Chinese version is now available.
- Instruction
This work is licensed under a Creative Commons Attribution 4.0 International License.