Skip to content

NoraLucene

JohanEvensberget edited this page Sep 8, 2009 · 8 revisions

Overview

The current state of the Nora-Pipeline.

Everything is deployed to /logon/johanbev/wescience0. Everyone on the ps.titan should have full access to this, this is also a potential security breach. Please help fix this.

This setup uses Suns java 1.6, which is stored at my root at ps. The needed jars are at /logon/johanbev/jars.

We have built a "Hello-World" app, extracting text from one pdf file and indexing it with lucene. This app resideds at /logon/johanbev/wescience0/Luctest. This hello world is used to help check the environment at ps for our project. We have provided a helpful run.sh which will invoke java with the right parameters to start the app.

Integration w/ rest of Nora

This has not been discussed in great detail, but johanbev suggest to decouple the text-extraction/correction/pdf-stuff part of this and the indexing proper. This will make nice interfaces for everyone to program against, and make us ress reliant on day-to-day communication with the lucene-team. For example the extractor could build a dirtree with txt-fields, annotated with the agreed upon fields in the header. Later the lucene-indexer/scraper reads this and builds its index.

Clone this wiki locally