Skip to content

meng-wenlong/EduNER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Education-ner-dataset

EduNER is a Chinese named entity recognition dataset for education research.

├── Models
│   ├── BERT-CRF
│   ├── BERT-NER
│   ├── BiLSTM-CRF
│   ├── CLNER
│   ├── Flat-Lattice-Transformer
│   ├── Flert
│   ├── LEBERT
│   ├── LexiconAugmentedNER
│   ├── LGN
│   ├── LR-CNN
│   ├── MECT4CNER
│   ├── SLK-NER
│   └── TENER
└── sample_EduNER

EduNER:the full version dataset is coming soon...

  • sample_EduNER/ directory contains the sampling version of our dataset.
  • The related resource paper ✨ is currently under review and a sampled version of the dataset is currently released. After final proofing, the full version of the EduNER dataset will be publicly accessible.
  • A snapshot of entity typesEduNER schema

Models

basic

  • models/ directory contains the recent SOTA models.
  • LexiconAugementedNER includes SoftLexicon+CNN/Transformer/LSTM models.
  • CLNER includes the CL-KL and CL-L2 models.

tutorial

  • Pre-trained embedding

we use the Chinese pre-trained character or word embeddings, e.g., ctb.50d, gigaword_chn.all.a2b.bi.ite50, and gigaword_chn.all.a2b.uni.ite50 in line with (Yang et al., 2017). As pre-trained language model, we use the Chinese BERT:bert-base-chinese.

  • Hyper parameters
models epoch batch size max length learning rate dropout rate crf learning rate embeddings
example 100 10 256 0.001 0.5 X
BiLSTM+CRF
BERT+CRF
LR-CNN
TENER
LGN
FLAT+BERT 100 10 200 0.0006 0.5
SoftLexicon (CNN)
SoftLexicon (Transformer)
SoftLexicon (LSTM)
MECT4CNER 100 10 200 0.0014 0.2
SLK-NER 30 32 256 5e-5 0.5
LEBERT 20 4 256 1e-5 0.1
FLERT 10 4 512 5e-6 0.1
CL-KL 10 1 512 5e-6 0.1
CL-L2 10 2 512 5e-6 0.1

Online Annotation Platform

username: edu
password: 

Update plan

EduNER dataset project is a long-term plan, we expect the dataset to cover more languages and disciplines in higher eduercation. Althgouh this goal is obviously not achieved in a short duration, the dataset will expand to one or two discipline, and will acquire a bigger scale dataset can be used for teaching or learning context.

  • Pedagogic Psychology discipline will be added in the next year (about: 06.2022 ~ 06.2023).
  • Policy, Conference related corpus will be added in the next phase (about: 08.2022 ~ 01.2023).

Beta application

  • A beta educational tool ( EDUNERScore ) based on our dataset can be accessed. The tool is based on NER technology and allows for the analysis of unstructured educational texts in real time. Specifically, the tool can extract the discipline entity from a large-scale unstructured texts, e.g., discourse content, online forums, writing documents etc. It will help the stakeholder to better understand the learning or teaching activity.
  • Due to limited computing resources, only cached results can be viewed at current. In addition, only the Chinese version is now available.
  • Instruction operation

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages