Skip to content

gilliankwk/COMP4331-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COMP4331-project

Load data

pandas

Data pre-processing

  1. handle duplicate values
  2. handle missing values
  3. handle impossible data combinations
  4. handle noisy data (DBSCAN?)
  5. feature processing (e.g. calculate number of dates)

Data pre-processing techniques

  1. feature selection: selecting a subset of relevant features for use in model construction https://en.wikipedia.org/wiki/Feature_selection
  2. data cleaning (handle noisy data)
  3. data integration
  4. data reduction: remove unimportant arrtibutes (why drop them?)

Requirement

no activity @log_train.csv in past 10 days --> may drop the course

Data to classifiers

date.csv

  • course id
  • to: 10 days after the date have no related records in log_train => this is a dropout
  • each line contains the timespan of each course in our log data (both train and test data). The timespans of each course for calculating dropouts is 10 days after the last day of that course, i.e., course C is from 2014.4.1 to 2014.4.30 in the given data, a user enrolled the course C will be treated as a dropout if he/she leaves no record from 2014.5.1 to 2014.5.10.

object.csv

  • course id
  • module id <--> log_train: object
  • category <--> log_train: event
  • children (separate from one record [pre-processing])

enrollment_train.csv

  • enrollment id
  • username
  • course id <--> object: course id

log_train.csv

  • enrollment id
  • time: only keep recent 10 days activities
  • event (problem, video, discussion) <--> object: category
  • object <--> object: module id, children
  • an user access an event but he/she may not be enrolled into that particular course
  • need to further check if he/she enrolled in to that course (enrollment_train)

true_train.csv

  • enrollment id
  • ground truth

Classifiers

  • cross validation
  • boosting
  • report precision for both training & testing sets
  • ensemble method
  • KNN (Anson)
  • SVM (Gillian)
  • random forest (Heidi)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages