Skip to content

WeSearch_Adaptation_Background

RebeccaDridan edited this page Dec 5, 2013 · 18 revisions

Working out the lay of the land.

Corpus Differences

  • Maria Wolters and Mathias Kirsten. Exploring the Use of Linguistic Features in Domain and Genre Classification. EACL'99

    http://acl.ldc.upenn.edu/E/E99/E99-1019.pdf

  • Barbara Plank, Gertjan van Noord: Effective Measures of Domain Similarity for Parsing. ACL 2011

    http://aclweb.org/anthology//P/P11/P11-1157.pdf

    • Different granularity. For each test article, found 'most similar' articles in training data and re-trained. Topic modelling based representation (from MALLET) worked slightly better than word based. Similarity functions tested:
    • Kullback-Leiber divergence
    • Jensen-Shannon divergence (smoothed, symmetric approximation of KL divergence)
    • skew divergence
    • cosine
    • euclidean
    • variational (Manhattan)
    • Renyi divergence

    Claim to show automatic selection better than human labelling, but only over WSJ (ie not always obvious domain diffs). In domain adaptation set-up, in-domain data does better than automatic selection, up to the amount of data available (but not by much - their selection works quite well). Similar accuracies from very different selections of articles from WSJ.

  • Bonnie Webber. Genre distinctions for discourse in the Penn TreeBank http://aclweb.org/anthology//P/P09/P09-1076.pdf

  • Tom Lippincott; Diarmuid Ó Séaghdha; Lin Sun; Anna Korhonen. Exploring variation across biomedical subdomains http://aclweb.org/anthology//C/C10/C10-1078.pdf

  • Sujith Ravi; Kevin Knight; Radu Soricut. Automatic Prediction of Parser Accuracy http://aclweb.org/anthology//D/D08/D08-1093.pdf

  • Satoshi Sekine. 1997. The Domain Dependence of Parsing

    http://aclweb.org/anthology/A/A97/A97-1015.pdf‎

    • Looked at cross-entropy between Brown genres, using probabilities of PCFG rules. Shows fairly wide variation in frequent subtrees for each domain.
  • Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing

    http://aclweb.org/anthology//W/W10/W10-26.pdf

  • Daniel Gildea. Corpus Variation and Parser Performance

    http://aclweb.org/anthology//W/W01/W01-0521.pdf

    • train WSJ, test WSJ: 86.35 F1, train WSJ, test Brown: 80.65 F1 (<=40 words). Lexical bigrams take up largest part of model, only add 0.5 F1 to WSJ, don't help Brown at all. Most significant bigrams (as judged by the pruning mechansim) for WSJ are very specific and all in NPs: New York, Stock Exchange, vice president etc, but for Brown, very generic: It was, Of course, had been. Pointer to earlier work (Roland & Jurafsky, 1998, Roland et al, 2000) showing verb subcat varies much less in WSJ than Brown. Conclusion: the standard WSJ task seems to be simplified by its homogenous style.
  • Douglas Roland and Daniel Jurafsky. 1998. How Verb Subcategorization Frequencies are Affected by Corpus Choice.

    http://aclweb.org/anthology//P/P98/P98-2184.pdf

    • "The probabilistic relation between verbs and their arguments plays an important role in modern statistical parsers and supertaggers..."
  • Adam Kilgarriff. 2001. Comparing Corpora.

    http://www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf‎

    • Different ways of looking at corpora:
    • which words are characteristic of a text/corpus:
      • chi-squared
      • Mann-Whitney (Wilcoxon) ranks test
      • t-test
      • MI
      • log-likelihood
      • Fisher's exact test (what log-likelyhood is approximating?)
      • TF.IDF
    • how similar are two corpora? how homogeneous is a corpus?
      • uses known-similarity corpora to evaluate similarity measures
      • Spearman rank correlation co-efficient
      • chi-squared
      • perplexity

    Low-frequency and high-frequency (closed-class) words should generally be treated separately, since they have very different statistical properties.

UGC

  • Baldwin et al. How Noisy Social Media Text, How Diffrnt Social Media Sources?

    http://aclweb.org/anthology//I/I13/I13-1041.pdf

    • Use forum, blog and wiki, as in WDC, plus Twitter, comments and BNC. All preprocessed with TweetNLP, which is probably not good for the non-twitter data. Language mix not so relevant to us, except as a footnote. OOV % against aspell, but not training data. We could do both. Text normalisation (learning standard forms and replacing) helped with OOV in twitter and comments, but not much in the other text types. Grammaticality tested by parsing with ERG, and looking at unparsed, root conditions and full vs fragment. Unparsed seems much too high. Beauty and the Beast style analysis for 100 unparseable sentences per corpus. Says 59% of the 26% of unparsed from Wiki caused by grammar gaps? Chi-squared and language models for intra and inter corpus similarity.
  • Jennifer Foster; Ozlem Cetinoglu; Joachim Wagner; Joseph Le Roux; Joakim Nivre; Deirdre Hogan; Josef van Genabith. From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0 http://aclweb.org/anthology//I/I11/I11-1100.pdf

  • Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner and Josef van Genabith, 2011. Comparing the Use of Edited and Unedited Test in Parser Self-Training http://aclweb.org/anthology//W/W11/W11-2925.pdf

Domain Adaptation

Clone this wiki locally