Skip to content

Document Harvesting Process

Andy Jackson edited this page Aug 10, 2015 · 3 revisions
  1. Login with your personal credentials
    1. Email address
    2. Password: DDHAPT is now making use of the W3ACT production database. This means that if you have a W3ACT account, you should log in to DDHAPT with the same credentials.
      1. NOTE: If you do not have a W3ACT account, your default password may be secret. If this also does not work, contact the W3ACT SysAdmin.
    3. Browser full-screen mode might be helpful (F11 key)
    4. You can see with what identity you are logged in with in the upper right-hand corner
    5. You can change your password by selecting the “Users & Agencies” drop-down menu item “Change Password”
    6. If your home page (where you start when you log in or when you click on the W3ACT logo in the upper left corner) is not your personal “Document Harvesting” home page, please request an Archivist or SysAdmin to select “show DDHAPT start page” in your user profile.
  2. Create a new Watched Target
    1. From the “home” screen, enter a URL in the Target URL Field [User Story 53]
    2. Example: http://www.open.ac.uk/business-school/research/quarterly-survey/reports
    3. A search on this URL is carried out automatically and the results are listed on a new page.
      1. If this is really a new Target, there should be no results, in which case, click “Add”
      2. If the Target already exists, it should be displayed in the results list. In this case, you should not attempt to create a new Target, but rather, convert the existing Target to a Watched Target.
        1. Such a conversion is only possible if you are the Selector (owner) of that Target. If you are not the Selector, you must request the System Administrator to make you the Selector.
  3. Edit Watched Target Metadata
    1. Be sure you have selected the “Edit” tab rather than the “View” tab.
      1. NOTE: New Targets now switch automatically to the Edit mode
    2. On the Overview tab
      1. Define a Title (required) - Example: "The Open University Business School"
        1. NOTE: This is the title of the Watched Target. The Journal Title will be defined in a later step
        2. IMPORTANT RELEASE 3 CHANGE: related specifically to gov.uk Targets – the Title of your Watched Target should be the same or similar to the Title on the landing page of the Documents of interest. This is so that it is possible to link gov.uk Documents (which are all served from the same root URL) to their associated Watched Targets.
      2. NEW: It is possible to add additional Seed URLs to a given target. This may be necessary in the case where the published PDFs are hosted on a different domain than the entry pages of the site.
        1. Add additional URLs to the “Seed URL(s)” field, separating them with a comma
        2. NOTE: The “Live Crawl” function will only crawl the first URL on this list.
    3. On the Metadata tab
      1. Give a description (optional) - Example "The Open University Business School provides flexible, high quality management and professional development by distance learning across the globe."
      2. Select a Browse Subject (optional) - Example "Business, Economy & Industry" [User Story 5]
        1. IMPORTANT RELEASE 3 CHANGE: The Browse Subject is NOT the subject that is exported to Aleph; that is the FAST Subject, which is set under the Watched Target tab.
      3. Set Flags (optional) - Under Flag(s) you have the opportunity to manually set QA flags for the Target in question by selecting the appropriate check boxes. The two flags presently relevant for DDHAPT are the “Missing PDF” and the “No Documents Found” flags. [User Story 42, 44]
        1. NOTE: New QA flags can be defined by a System Administrator if required.
        2. NOTE: The “No Documents Found” flag will be automatically set if a Live Crawl or an overnight crawl produce no new Documents. This is to be expected during the test phase when the crawls are performed more often than new Documents are published, but in the production phase, this should flag a serious problem (e.g. either the crawl frequency is incorrect, or expected Documents have not been discovered by the crawl process).
    4. On the Crawl policy and Schedule tab
      1. Select the crawl Scope – “All URLs that match this host or any subdomains”
        1. NOTE: any other selection may results in missing Documents; publishers often provide Documents under a (perhaps common) URL path that is different from that of the Landing Page or Watched Target URL.
      2. Select a Crawl Frequency - Example "Quarterly"
        1. NOTE: in the testing phase, you might want to select “Daily”
      3. Select a Start date
        1. This now automatically defaults to today’s date.
        2. Clicking on the “Start date” field will bring up a calendar widget. Selecting today’s date means that the first crawl will take place tonight at midnight, with a frequency of “Crawl Frequency” thereafter.
        3. If the Start date is undefined, then the default will be the first day of the next month, so new Documents would not appear until then!
      4. OPTIONAL: Select an End date
        1. Presently, the way to deactivate a Target (that is, stop it from being crawled) is to set the End date (to a date in the past).
    5. On the NPLD scope tab
      1. Note that the W3ACT application will automatically detect if a Target in scope for a NPLD crawl. This can be seen in the sub-tab heading as either “NPLD scope (+)” or “NPLD scope (-)” [User Story 44]
      2. It is also possible to define a Target to be in scope directly by providing information regarding a local postal address or by direct correspondence with the publisher [User Story 35]
    6. On the UKWA Licensing tab
      1. Here it is possible for an Archivist or SysAdmin to select a permissions-based license for the Target [User Story 35, 37].
        1. If a permissions-based license for the Target in question (or a higher-level domain associated with the Target) already exists, the user will be informed of this.
        2. NOTE: Permissions-based license requests are relevant for W3ACT, but not for DDHAPT.
    7. On the Watched Target tab
      1. Select option "watched"
        1. WARNING: if a Target is already marked as “watched,” and the user de-selects the “watched” option and clicks “Save,” ALL DOCUMENTS ASSOCIATED WITH THE WATCHED TARGET WILL BE DELETED!
      2. Define the Document URL Scheme (NEW: this step is OPTIONAL! As this is one of the most common sources of crawler error, only Expert Users should make use of this feature.)
        1. This is a kind of white-list for extracting PDFs from the source that should help to avoid harvesting too many uninteresting files
        2. Example: www.open.ac.uk/business-school/sites/www.open.ac.uk.business-school/files/files/publications/download
          1. NOTE: do NOT include the "http://" leader in this filter, and be sure there are no trailing spaces in the filter
          2. Refer to Section 7 below for more examples.
      3. Select a FAST Subject (optional) - Example "Business" [User Story 5]
    8. Click “Save”
      1. NOTE: if the Target URL is a duplicate in the system, the application will not allow you to define this as a Watched Target.
      2. NEW: Behind Barrier Sites
        1. Select the “Watched Target” sub-tab
        2. Click on “Add Login Credentials”
        3. Enter the Login URL for the behind-barrier site
          1. This is the page where the crawler goes to enter the login credentials
          2. Refer to Section 8 below for examples
        4. Enter the Logout URL for the behind-barrier site
          1. This is the link the crawler should use to leave the site after the crawl is completed.
          2. Refer to Section 8 below for examples
        5. Enter the Username for the site
        6. Enter the Password for the site
        7. Click “Save”
          1. NOTE: This process uses the BL “Secret Server” in the background, which is a high-security password management system. For this reason, you will not be able to see the password that you have entered. If the crawl fails due to a wrong user/name and password combination, you must store these credentials a second time, overwriting the previous ones.
  4. Add a Journal Title
    1. From the Targets: View: Watched Target tab, click “New Journal Title”
    2. Edit Journal Title Metadata
      1. Journal Title (required) - Example "Quarterly Survey of Small Business"
      2. ISSN (optional) – Example “2046-7990”
      3. Frequency (optional) - Example "Quarterly"
      4. Publisher Name (required) - Example "Open University Business School"
      5. BL Collection Subset (optional) – Example “Management and Business Studies”
        1. NOTE: multiple selections are possible
      6. Subject (optional) – Example: “Business planning”
        1. Multiple selections are possible.
        2. NOTE: These are selections from the FAST Subject hierarchy and are distinct from the “Browse Subjects” defined for Targets
        3. NOTE: Subjects are initially inherited from the associated Watched Target, but these can always be manually altered or extended.
    3. Click Save
      1. NOTE: It is not possible to save a Journal Title with a Title or ISSN number identical to an existing Journal Title
    4. If you click on the Target Name in the “Targets > [Target Name]” breadcrumb heading, you should see the new Journal Title listed there (in the “View: Watched Target” tab)
  5. Perform a test crawl
    1. Select “Document Harvesting: Watched Targets” (from the top menu bar)
    2. Alternatively: click on the W3ACT logo in the upper left corner – this will lead you to your personal Watched Target list.
    3. You should see the newly created Watched Target in this list
      1. If you click on the Title link, you can edit the properties of this Watched Target
      2. If you click on the Target URL, the URL in question will open in a new browser tab
    4. Click on the Live Crawl button
      1. A pop-up window appears, asking for confirmation to proceed with the crawl. Click OK.
      2. It may take some time (minutes) for this crawl to complete. The browser is waiting on a synchronous process that is fetching all PDFs from the live web (that fulfil the filtering criteria) and is also asynchronously performing all of the PDF-to-HTML conversions
      3. When the crawl is complete, you are re-directed to a Documents List.
    5. (Optional) Click on the Live Crawl button again
      1. This should demonstrate that Documents that have been discovered will not be duplicated in the list if discovered during subsequent crawls [implied by User Story 11].
  6. Examine Documents [User Story 11]
    1. You are now on the Documents page, you should now see a list of discovered PDFs
      1. The search filter has been pre-set with your User Name and the Watched Target in question.
    2. Previously discovered and subsequently submitted documents will appear under the “Submitted” tab
      1. The process of submission is described below
    3. Previously discovered and subsequently ignored documents will appear under the “Ignored” tab
      1. If a PDF is discovered that should not be considered a Document, the Selector can click the “Ignore” action button. This Document will no longer appear in the New documents list but need not be edited or submitted.
        1. It is possible to move an ignored PDF from the Ignored list back to the New list by clicking on the “Restore” action button.
    4. Documents should never be duplicated; that is, if a Document has been submitted, it will never show up under the New list; or if a Document is already on the New list, it will not appear a second time when the site is re-crawled.
  7. Edit Document - Type Journal Issue
    1. Click on document title - Example “QS 2014 Q1 Networks FINAL”
      1. Note that it may take some time between the Live Crawl and the ability to display the HTML-converted PDF version of the Document. If you see an error message inside the Document display panel, this is the reason. Try re-visiting the Document again after a few minutes.
      2. Note that it is possible to open the original PDF by clicking on the “Open Document” link. The PDF will be opened in a pop-up window. From here it is possible to download the file locally [User Story 41]. Now this document is fetched from the live web site, but in production this will come from the BL Wayback instance.
    2. Edit Document metadata [User Story 30]
      1. NOTE: It is possible to zoom in and out of the Document window
        1. Click inside the viewer frame
        2. Hold down the “Ctrl” key of your keyboard while rolling the mouse wheel
      2. Title
        1. Note that for some sites, the Title will be automatically extracted from the landing page information [User Story 45]
        2. Note that in other cases an incorrect title may be automatically extracted from the file name, but you can overwrite this
        3. Highlight "Special topic: networking trends" on the front page of the Document. On mouse-up, you will see a list of possible attributes; click on "Title" and the value "Special topic: networking trends" appears in the corresponding metadata field.
      3. Publication Date - Example "March 2014"
        1. Highlight "March 2014" at the bottom of the second page of the Document. On mouse-up, you will see a list of attributes; click on "Publication Date" and the value "01-03-2014" will appear in the corresponding metadata field.
          1. Note that the Publication Year (2014) will be automatically extracted from this date
            1. If you do not define a publication date, your are required to select a Publication Year
          2. NOTE: [User Story 12] sometimes more specific information (for example a more exact publication date) can be found on the Document Landing Page. You can click on the “Open Landing Page” link to have this page open in a pop-up window. However, the MEX-like interactions will not work in this window.
          3. NOTE: If you are having difficulty selecting a particular section of text in the Document, it can help to zoom in using Ctrl-mouse-wheel.
      4. Priority Cataloguing [User Story 50]
        1. If the Document in question should be queued for priority cataloguing, select this check-box.
      5. Services Selection
        1. If the Watched Target associated with this Document has an appropriate permissions-based license, then it is possible to select one or more of the Services, which will also publish the Document to those catalogues.
          1. NOTE: The actual values for these options, and their relationships to license types, are configurable by a System Administrator.
      6. Subject
        1. Here you can optionally select one or more Subject tags from the FAST Subject hierarchy, for example “Business” and/or “Business enterprises”
          1. NOTE: Subjects are initially inherited from the associated Watched Target or Journal Title, but these can always be manually altered.
      7. Select Submission Type Journal Issue
        1. This will present the Journal Issue specific metadata
        2. Select a Journal Title from the drop down menu - Example " Quarterly Survey of Small Business "
          1. If the desired Journal Title is not available, you can create a new one by clicking New Journal Title
        3. Define the Volume - Example "30"
          1. Highlight "30" at the top of page two of the Document. On mouse-up, you will see a list of attributes; click on "Volume" and the value "30" appears in the corresponding metadata field.
        4. Define the Volume - Example "1"
          1. Highlight "1" at the top of page two of the Document. On mouse-up, you will see a list of attributes; click on "Issue/Part" and the value "1" appears in the corresponding metadata field.
            1. Note that in this case it would be faster to simply type a “1” directly in to the metadata field.
      8. Adding Authors
        1. At present, one should highlight both the first name and the last name of the author in question, then click on the Author field. Each time this is done, the author will be added to the list, until all three authors have been defined. Each additional action after this point will only change the value for the third author.
        2. It is clear that this approach only works in the case when authors are presented as [First Name][space][Last Name]. It remains to be determined by the user tests, whether this interaction pattern is desirable of not.
    3. Click Save
      1. This will automatically switch you to the “View” tab of the Document
      2. You will received confirmation “Your changes have been saved”
  8. Submit a Document
    1. Click the View tab of the Document in question (if you are not already there)
    2. Click the Submit button
      1. Note that you will be prompted "are you sure?" This is because once you Submit, it will no longer be possible to edit the Document metadata in the DDHAPT application.
      2. Once submitted, the Document can only be viewed
    3. Your will receive a confirmation message “The document has been submitted.”
    4. The document no longer appears under the "New" tab of the Watched Target, but rather under the "Submitted" tab
    5. The converted HTML is no longer displayed; rather, the original PDF is.
  9. View SIP
    1. Click on the “Documents” link in the Breadcrumb header
      1. Alternatively: Select Document Harvesting: Documents from the menu
    2. Click on the “Submitted” tab
    3. Click on the “SIP” link in the SIP column of the submitted Document entry
      1. The SIP METS file opens in a new browser tab