Data Collection

This part deals with collecting data from multiple sources. The data can be processed parallel pairs or in raw format. In a high level classification let us divide the process into three parts:

Continuous data collectors or Active Agents &
One time/Batch data collectors
Community generated corpus

1. Active Agents/Bots

Agents means Bots, which will be running continuously 24x7 or a specific period of time to find out prospective parallel pairs. There can be many types of Agents:

News website crawlers
: Collecting and matching the headlines/content of the News websites across En and Or
Localized website crawlers
: Websites which are translated into English as well as Odia
Social Media Tweets/Posts across Twitter/Facebook.
: People posting Tweets/Posts in social media across multiple languages

Process

It may consists of the following steps:

Get a list of prospective
1. News websites
2. Localized Websites &
3. Social media Accounts
Crawl through the content from present to past posts.
Detect Odia text
We can write an Odia Matra detector
Detect the parallel English text. This can be achieved by:
1. Matching Dictionary based words
2. Followed by sentences
3. Transliteration can help here too to match the Pronouns.

2. Batch data collectors

The cases where we will get the data from a specific location that needs to be accessed for a short time period until we collect the entire data. This may be an one time activity or long source data refresh period. For example:

Wikipedia Dumps
: Wikipedia generates En-Or paragraph aligned dumps every week.
Open sourced repositories/papers describing the location to collect the data

Process

The detecting of the data set is completely manual process. There is no need of automate that also. May be for the Wikipedia dump we can run an Agent in a weekly schedule to get only the incremental data.
The flow will be:

manually detect the data set
Write a Data collector to get the data from the location or manually download the data set from that location.

That's it. The Data collector work is done.

3. Community generated corpus

Other than the above two methods of collecting the data. We will need help from the community to manually prepare the data. We may need data in following categories:

Parallel pairs
Parallel pairs with POS tags &
Parallel pairs with Named Entity Recognition (NER) data
Parallel pairs with domain classification

These are not distinct categories. That means Parallel pairs are mandatory to provide. The rest all POS, NER and Domain are optional. If community can provide pairs with all POS, NER and Domain, that data will be treated as pure Gold.
The format to collect the POS, NER and Domain information yet to be decided.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Collection

Data Collection

1. Active Agents/Bots

Process

2. Batch data collectors

Process

3. Community generated corpus

Clone this wiki locally