Skip to content
This repository has been archived by the owner on Aug 14, 2021. It is now read-only.

Data Collection

Soumendra kumar sahoo edited this page Sep 22, 2019 · 1 revision

Data Collection

This part deals with collecting data from multiple sources. The data can be processed parallel pairs or in raw format. In a high level classification let us divide the process into three parts:

  1. Continuous data collectors or Active Agents &
  2. One time/Batch data collectors
  3. Community generated corpus

1. Active Agents/Bots

Agents means Bots, which will be running continuously 24x7 or a specific period of time to find out prospective parallel pairs. There can be many types of Agents:

  1. News website crawlers
    : Collecting and matching the headlines/content of the News websites across En and Or
  2. Localized website crawlers
    : Websites which are translated into English as well as Odia
  3. Social Media Tweets/Posts across Twitter/Facebook.
    : People posting Tweets/Posts in social media across multiple languages
Process

It may consists of the following steps:

  1. Get a list of prospective
    1. News websites
    2. Localized Websites &
    3. Social media Accounts
  2. Crawl through the content from present to past posts.
  3. Detect Odia text
    We can write an Odia Matra detector
  4. Detect the parallel English text. This can be achieved by:
    1. Matching Dictionary based words
    2. Followed by sentences
    3. Transliteration can help here too to match the Pronouns.

2. Batch data collectors

The cases where we will get the data from a specific location that needs to be accessed for a short time period until we collect the entire data. This may be an one time activity or long source data refresh period. For example:

  1. Wikipedia Dumps
    : Wikipedia generates En-Or paragraph aligned dumps every week.
  2. Open sourced repositories/papers describing the location to collect the data
Process

The detecting of the data set is completely manual process. There is no need of automate that also. May be for the Wikipedia dump we can run an Agent in a weekly schedule to get only the incremental data.
The flow will be:

  1. manually detect the data set
  2. Write a Data collector to get the data from the location or manually download the data set from that location.

That's it. The Data collector work is done.

3. Community generated corpus

Other than the above two methods of collecting the data. We will need help from the community to manually prepare the data. We may need data in following categories:

  1. Parallel pairs
  2. Parallel pairs with POS tags &
  3. Parallel pairs with Named Entity Recognition (NER) data
  4. Parallel pairs with domain classification

These are not distinct categories. That means Parallel pairs are mandatory to provide. The rest all POS, NER and Domain are optional. If community can provide pairs with all POS, NER and Domain, that data will be treated as pure Gold.
The format to collect the POS, NER and Domain information yet to be decided.