Skip to content

nuxeo-sandbox/nuxeo-ldt-parser

Repository files navigation

nuxeo-ldt-parser

Note

This README is Work In Progress. Most is described and explained, some parts are missing (i.e. using page range to extract ony some pages of a multi-pages LDT record, using a compressed LDT file)

nuxeo-ldt-parser provides a configurable service for parsing LDT files.

LDT files usually contain a lot of text data, several thousands, sometimes hundreds of thousands, and this makes billions after a while (example: LDT file holding bank statements for all the customers).

The parser aims to store only retrieval information and optional custom fields:

  • Retrieval information: See below, it is about quickly getting all the info inside the LDT file without re-parsing it
  • Custom fields: Only the fields needed for the business, typically for search (like a clientID for example)

So, still with the bank statement example, we would store only some bytes for retrieval and a couple fields for searching, but not the dozens of lines of transactions, these will be retrieved only when it is time to download a rendition of the statement, with all its lines of transaction.

Description

Note

For the rest of this documentation, we name "Record" the set of information saved in the LDT file. Such file holds n lines defining m records. A record typically has one or more headers and then several items.

Parse the File, extract Records, with header(s) and items

The plugin parses an LDT file and extracts records based on configuration. Records can be saved as Nuxeo Documents, with utility fields required for a fast retrieval (see below) plus any custom field you need (if the ldt holds bank statements, you would store client name, amounts, dates, …).

Optionaly, it can compress the source LDT file. As it is text file, compression rate can be around 70-80%. When a ldt is compressed, the retrieval uses the same mechanism (it gets the compressed bytes from technical fields).

The plugin parses the file line by line. For each record, It expects:

  • A startRecordToken and an endRecordToken. Set in the XML contribution, they are required. This is how the plugin knows a record starts and ends, so it can parse and analyse all the lines between the start and the end (both included in the parsing)
  • And 2 kinds of lines: A line is either a header or an item

So, typically:

Header line with the startRecordToken
0 or more header(s)
Lines with items, last one contains the endRecordToken

A record can hold several pages in the LDT. In this case you have the following (example with 3 pages):

Header line with the startRecordToken
0 or more header(s)
Lines with items, none contains the endRecordToken
Header line with the startRecordToken
0 or more header(s)
Lines with items, none contains the endRecordToken
Header line with the startRecordToken
0 or more header(s)
Lines with items, last one contains the endRecordToken

If you want to handle multi-pages records, do not forget to the endofPage property for the item(s) in the XML configuration . See the example in ldtparser-service.xml, where the "IntermediateBalance" and the "ClosingBalance" items define <endOfPage>true</endOfPage>.

Headers and Items:

  • A header usually defines fields/values shared by the record. In a bank statement example, it will be the date of the statement, the client Id, the bank account, etc.

Important

The plugin expects that every record has at least one header and this header starts with the startRecordToken defined in the contribution.

  • After the header(s), come the items. In the bank statement example, it will be several lines with the date, the reason, the amount, etc. Last line of the items (aka of the record) must contains the endRecordToken.
    • A record can have several pages (see below), the plugin gets all the lines between the startRecordToken and the endRecordToken

Rendering a Record

To render a record, you will first get its JSON and then render it as you need. In the unit tests, we provide rendering examples: Render to html (using freemarker for templating), render to pdf (which actually is a rendering to html then a conversion to pdf)

Retrieving a Record inside the LDT File

The plugin provides the LDTRecord facet that comes with the ldtrecord schema, and a LDTRecord document type that has this facet (the configuration allows for using another document type, as long as you give it the LDTRecord facet.)

The ldtrecord schema stores:

  • The ID of the related Nuxeo document storing the LDT file
  • If the LDT is not compressed: The start offset and record size (in bytes) of the record inside the ldt file.
  • If the ldt file was compressed (and is a .cldt file), the record size is negative, this is how the plugin knows it has to expand the bytes once retrieved.

So, when a record needs to be fetched from the document, we just get the required bytes in the original ldt/cldt file, no need to re-parse the whole file to find a record. Notice this also works if you store your binaries in S3 using nuxeo-s3-binary-storage (see below, "S3 BlobProvider Configuration"). And this is fundamental: We don't want to download locally a 500MB ldt file from s3 to parse it and extract 1kb of text data, this would not scale and would cost more.

Configuration

Note

See ldtparser-service.xml for a full example and all the configuration properties, with detailed documentation.

To configure a parser, you contribute the LDTParserService, "ldtParser" extension point to define:

  • The Start and End of record tokens. These are required.

  • As many Regex as needed to parse the header(s) and the items of the record, with field mapping to your Nuxeo document (so you map the "clientId" header to the statement:clientId field of your document)

    For items

    [!IMPORTANT] The plugin uses java.util.regex.Pattern for handling Regex. Make sure your expressions are compatible with this usage.

  • The JSON template to use when reading a record inside the LDT
  • Document type(s) to create (they must be declared with the "LDTRecord" facet)
  • Callbacks (java) for fine tuning when needed (mainly, when a Regex can't resolve a line). There also is an Automation Callback for quick test/POC (see below "Automation Callback for Items")

Simple example with 2 records

$12345ABCD$    TYPE=BANK0003  CLIENT TYPE: A     TAX ID: 12345678901234    CLIENT ID: 1234567890ABC12
003090         John & Marie DOE          MARCH-2023      098765432
1   01/03    OPENING BALANCE                                                    999.77
2   01/03     The label for the item           657.20-          NF99
. . . more lines . . .
3   26/03    CLOSING BALANCE                                                   8575.55-
$12345ABCD$    TYPE=BANK0003  CLIENT TYPE: B     TAX ID: 12345678901567    CLIENT ID: 9874567890ABC12
003090         ACME Ltd                  MARCH-2023      098765000
1   01/03    OPENING BALANCE                                                    10567.89
2   01/03     The label for the item           100.00          T999
. . . more lines . . .
3   31/03    CLOSING BALANCE                                                     9554.26

In this file, each Record:

  • Starts with exactly $12345ABCD$
  • Has 2 lines of headers with fields like bank type, client id, client name, etc. Then has n lines of items.
  • Items start with an OPENING BALANCE and end with a CLOSING BALANCE
  • The record itself ends with CLOSING BALANCE

=> See the "default" pdtParser contributed in /resources/OSGI-INF/ldtparser-service.xml for the values used to parse this ldt. Here we just put the fields needed for the example, assuming we are mapping to the default LDTRecord document type.

The contribution must define/declare:

  • A new parser with a unique name
  • Start/end record tokens
  • 2 regex and captured fields for the headers
  • 3 regex and captured fiels for the items
  • A mapping between the fields captured and XPaths in an LDTRecord document
  • The JSON properties we want when getting the JSON of the record

Details.

  • Declare the contribution and a new parser with a unique name:
<extension target="nuxeo.ldt.parser.service.LDTParser"
		    point="ldtParser">
  <ldtParser>
    <name>MyParser</name>
  • Declare the start/end of a record:
    <recordStartToken>$12345ABCD$</recordStartToken>
    <recordEndToken>CLOSING BALANCE    </recordEndToken>
  • Now parse the headers. We have 2 lines of header here:
    • First one starts with the startRecordToken and has some fields we want to capture in a Regex:
    <headers>
      <header>
        <name>firstLine</name>
        <pattern>^\$12345ABCD\$ *TYPE=(BANK.{4}) *CLIENT TYPE: *([a-zA-Z]) *TAX ID: *([A-Z0-9]*) *CLIENT ID: *([A-Z0-9]*)</pattern>
        <!-- fields MUST BE same number and order as the pattern groups captured above -->
        <fields>
          <field>bankType</field>
          <field>clientType</field>
          <field>taxId</field>
          <field>clientId</field>
        </fields>
      </header>
  • Second has more info and fields we want to capture
    <header>
      <name>secondLine</name>
      <pattern>^([A-Z0-9]*) *(.*?) *(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)-(\d{4}) *([A-Z0-9]*)</pattern>
      <fields>
        <field>bankId</field>
        <field>clientName</field>
        <field>month</field>
        <field>year</field>
        <field>customRef</field>
      </fields>
   </header>
 </headers>
  • Then we have items. Here we have 3 kinds:
    • An opening balance:
  <itemLine>
    <type>OpeningBalance</type>
    <pattern>^([0-9]*) *([0-9]{2}/[0-9]{2}) *OPENING BALANCE *([0-9]*.[0-9]{2}-?) *</pattern>
    <fields>
      <field>lineCode</field>
      <field>date</field>
      <field>amount</field>
    </fields>
  </itemLine>
  • A closing balance:
  <itemLine>
    <type>ClosingBalance</type>
    <pattern>^([0-9]*) *([0-9]{2}/[0-9]{2}) *CLOSING BALANCE *([0-9]*.[0-9]{2}-?) *</pattern>
    <fields>
      <field>lineCode</field>
      <field>date</field>
      <field>amount</field>
    </fields>
  </itemLine>
  • And each item. We capture 5 groups here:
  <itemLine>
    <type>ItemLine</type>
    <pattern>^([0-9]*) *(\d{2}/\d{2}) *(.*?) *(\d+\.\d{2}-?) *([A-Z0-9]*)</pattern>
    <fields>
      <field>lineCode</field>
      <field>date</field>
      <field>label</field>
      <field>amount</field>
      <field>ref</field>
    </fields>
  </itemLine>
  • Now we want to use the provided LDTRecord. In Nuxeo Studio, we added to it a custom schema, statement, which contains the XPaths we want to map to the fields defined above
  <recordDocType>LDTRecord</recordDocType>
  <recordFieldsMapping>
    <field xpath="statement:clientId">clientId</field>
    <field xpath="statement:taxId">taxId</field>
    <field xpath="statement:month">month</field>
    <field xpath="statement:year">year</field>
  </recordFieldsMapping>
  • For each `LDTRecord created, we want a title concatenating some fields (by default, the parser adds an incremented number)
  <recordTitleFields>
    <field>clientId</field>
    <field>taxId</field>
  </recordTitleFields>
  • Last, when getting the JSON of the record, we want define the template to use:
<recordJsonTemplate>
  <root>statement</root>
  <properties>
    <property>bankType</property>
    <property>clientType</property>
    <property>taxId</property>
    <property>clientId</property>
    <property>bankId</property>
    <property>clientName</property>
    <property>month</property>
    <property>year</property>
    <property>customRef</property>
  </properties>
</recordJsonTemplate>

Mapping Records to Documents

LDTParser#parseAndCreateRecords parses the input Document (whose file:content must store an LDT file) and creates as many LDTRecord as found in the LDT file. The doc type is configurable, see recordDocType. If using a custom one, it must have the LDTRecord facet.

This document type has:

  • The ldtrecord schema that is used internally to retrieve the record: ID of the source LDT document, the startBytes and the recordSize
  • Optionally, any other fields used in the XML configuration for the mapping (for example, a "statement:clientId", "statement:month", "statement:year", etc.

Note

There is no mapping for the items

These documents are created in a container (doc type configurable, recordsContainerDocType) at the same level as the LDT document itself. The title of this container is {LDT document title}-Records. This suffix is also configurable (recordsContainerSuffix).

The Services.LDTParseAndCreateRecords operation parses the input Document (whose file:content must store an LDT file) and creates as many LDTRecord as found in the LDT file. (doc type is configurable, see recordDocType. If using a custom one, it must have the LDTRecord facet).

Typically, you would use these in a listener/Studio Event Handler when a new document containing an LDT file in its main blob is created. Of course we strongly recommend using an asynchronous-post commit event.

Note

An entry is added to server.log every 1,000 records. This is done at INFO level, so if you want to see this, change the log level for this class in the log4j2.xml file:

. . .
  <Loggers>
    . . .
    <Logger name="nuxeo.ldt.parser.service.LDTParser" level="info" />
    . . .
  </Loggers>
. . .

Automation

The plugin provides:

  • Automation Operations
  • And an Automation Callback for quick test/POC (see below "Automation Callback for Items")

Automation Operations are:

Services.LDTParseAndCreateRecords

  • Input:

    • A Document whose file:content contains the LDT file to parse
    • This document must have the ldt schema.
  • Parameter:

    • parserName, string. Must be the name of an "ldtParser" contribution. If not passed or empty, the operation uses the "default" configuration
    • compressLdt, boolean, optional (false by default).

    [!IMPORTANT] If true, the blob of the input document is replaced with a compressed LDT (extension .cldt, mime-type "application/cldt"), and all the recordSize are negative (this is a flog when retrieving a record)

  • Output:

    • The input Document with its ldt schema updated.
    • For now, this schema ha sa single field, ldt:countRecords.
  • The operation creates as many records as found in the LDT field, using the parserName configuration for detecting start/end record tokens, Regex to use, fields to map to XPaths, etc.

Services.GetLDTJsonRecord

(See below for details on input and parameters)

  • Input:
    • document, optional.
  • Parameters:
    • parserName, string, optional. Must be the name of an "ldtParser" contribution. If not passed or empty, the operation uses the "default" configuration
    • sourceLdtDocId, String, optional
    • startOffset, long, optional
    • recordSize, long, optional
    • firstPage, long, optional
    • lastPage, long, optional
  • Output:
    • A JSON Blob containing the record
    • Fields of the JSON are defined in the XML configuration, using recordJsonTemplate (see above).

Input is a document (optional). If passed, it must have the ldtrecord schema, the related LDT document must exist, and current user must have read permission on it. Also, if Input is passed, sourceLdtDocId/startOffset/recordSize are ignored (the operation reads the values from the ldtrecord schema).

If input is not passed, then sourceLdtDocId/startOffset/recordSize are required.

Whatever the input, it is possible to get only some pages from the record. This happens typically when you know it can be big, contains dozens and dozens of pages and retrieving it could lead to timeout or OutOfMemory (ht JSON is always built in memory)

Callbacks

Java Callbacks

As it may happen that defining Regex Patterns may not fullfill a custom business rule, the plugin provides 3 callbacks that can be used instead: One for parsing a header, one for an item and one for the whole record.

This is set at configuration level, and it is not possible to mix Regex pattern and callbacks. If a callback is defined it is used in place of the regex.

So, you can set useCallbackForHeaders, useCallbackForItems and useCallbackForRecord. See the Callbacks interface for the signature, and CallbacksExample for an example (also look at the unit tests). Notice that if you use a callback for the whole record, the callbacks on header/item will never be called.

Automation Callback for Items

For quick tests and POC, it may be convenient to use Automation. This is implemented only when parsing items (not headers, not the whole record), since headers should be a fixed format anyway.

Warning

The callback will be called for every line that is not a header. This is done only when getting a record inside the LDT file, not when parsing it to create all the LDTRecord. Still, for each line, all the machinery of Automation will be instantiated, which means it will be far slower that using a java callback.

If a record has a dozen items and there are not dozens, hundreds of concurrent request, it will be fine. But remember it does not scale. It is still very usefull for quick tests and POC.

You set up the chain to call in the configuration, using the parseItemAutomationCallback configuration property. The chain receives no input and two parameters:

  • line, the line to parse
  • And config, the LDTParserDescriptor, that allows, should you need so, for accessing the other configuration parameters (record start/end tokens for example)

Your chain must return a JSONBlob, a JSON that defines all the properties of an Item object:

{
  "type": "TheLineType",
   "fieldList": ["field1", "field2, "field3"], // String array:
   "fieldsAndValues": [ // array of objects key/value. ALL VALUES ARE STRING
     {"field": "field1", "value": "value1"},
     {"field": "field2", "value2": "value3"},
     {"field": "field3", "value4": "value3"}
   ]
}

Here is a small example:

function run(input, params) {
  
  var line = params.line;
  var config = params.config;
  var regex;
  var match = null;
  var jsonItem = {};
  
  // =============================================
  // Detect last line
  // =============================================
  if(line.indexOf(config.getRecordEndToken()) > -1) {
    // Last item of the record, easy to parse with a Regex
    regex = /(\d+)\s+(\d{2}\/\d{2})\s+CLOSING BALANCE\s+([\d.]+)/;
    match = line.match(regex);
    jsonItem.type = "ClosingBalance";
    jsonItem.fieldList = ["date", "amount"];
    jsonItem.fieldsAndValues = [
      {"field": "date", "value": match[2]},
      {"field": "amount", "value": match[3]},
    ];
      
    return org.nuxeo.ecm.core.api.Blobs.createJSONBlob(JSON.stringify(jsonItem));
  }
  
  // =============================================
  // Something based on the length of the line
  // =============================================
  if(line.length < 60) {
    return null;
  }
  
  var date = line.substring(3, 8);
  var label = line.substring(12, 50).trim();
  var amount = line.length >= 61 ? line.substring(50, 61).trim() : line.substring(50, 60).trim();
  var balance = line.length >= 80 ? line.substring(62, 81).trim() : null;
  jsonItem.type = "Item";
  jsonItem.fieldList = ["date", "label", "amount"];
  jsonItem.fieldsAndValues = [
    {"field": "date", "value": date},
    {"field": "amount", "value": label},
    {"field": "amount", "value": amount},
  ];
    
  return org.nuxeo.ecm.core.api.Blobs.createJSONBlob(JSON.stringify(jsonItem));

Example of Usage with Nuxeo Studio

Nuxeo Studio is Nuxeo Low-Code configuration tool. If your LDT files don't need Java callback, the usage of the plugin is even easier:

  1. In Studio, add as many XML configurations as parsers you want to use (when you have different content for your LDT files)
  2. Use the Services.LDTParseAndCreateDocuments operation to create the Documents from the LDT file
    • You would typically use an async EventHandler for the documentCreated event, filtered on your document type, so the parsing is automatic. Make sure it is asynchronous
  3. If you just need to get the JSON, then call Services.GetLDTJsonRecord when needed
  4. If you need a different rendering:
    • Get the record as JSON, then yse freemarker and a template (see the example in the unit tests).
    • For example, an HTML template
    • Use Nuxeo Template Rendering (with freemarker) to render a word/pdf document
    • . . . use any template you want
    • The principle is to inject the values of the JSON

Typically, and still with the banck statement example (as used in the unit test and the "default" parser), you would...

  • Insert values, like (see below for the usage of the Context object):
<div class="label">Bank Id</div><br/>
<div>${Context.statement.bankId}</div>

<div class="label">Date</div><br/>
<div>${Context.statement.month}/${Context.statement.year}</div>

For items, using a table:

<table class="operations">
  <#list Context.items as item>
    <tr>
      <td class="date">${item.date}</td>
      <td>${item.label}</td>
      <#if item.amount < 0>
        <td class="amount negative">${item.amount}</td>
      <#else>
        <td class="amount">${item.amount}</td>
      </#if>
    </tr>
  </#list>
</table>
  • From Studio, as Services.GetLDTJsonRecord returns a String, converted to JSON and freemarker expects more Java objects, you may need to "massage" a bit the values. Again, see the unit test for an example (automation-render-pdf-with-any2pdf.xml). Something like:
function run(input, params) {
  // input is an LDT Record
  var jsonBlob = Services.GetLDTJsonRecord(input, {'parserName': "MyCustomParser"});
  var json = JSON.parse(jsonBlob.getString());

  // Make sure items are ordered. In our "MySuctomParser", we defined the root of JSON as "statement"
  json.statement.items.sort(function(a, b) {
    return a.order - b.order;
  });
  
  // We also add missing fields, set them to "" and convert the negative string to number
  // This happens when different types of items are used (here, our opening balance items don't have a "customRef", for example)
  json.statement.items.forEach(function(item) {
    if(!item.label) {
      item.label = "";
    }
    if(!item.ref) {
      item.ref = "";
    }
    if(item.amount.endsWith("-")) {
      item.amount = item.amount.replace("-", "");
      item.amount = +item.amount;
      item.amount *= -1;
    } else {
      item.amount = +item.amount;
    }
  });

  // The template expects a "statement" context variable...
  ctx.statement = json.statement;
  // .. and an "items" Java array of objects, we need a conversion
   ctx.items = Java.to(json.statement.items);

  // Now we can render as html using our template.
  // We use the input as a convenience (it is not modified)
  var html = Render.Document(dummyDoc,{
                    template: "template:BankStatementTemplate",
                    filename: "statement.html",
                    mimetype: "text/html",
                    type: "ftl"
                  });

  // Return the html blob
  return var;
}

S3 BlobProvider Configuration

Retrieval is Super Fast also when Using S3.

As explained above (Retrieving a Record inside the LDT File), the plugin, when parsing an ldt/cldt file, creates documents and stores retrieval information in the ldtrecord. This way, when a single record is requested, the plugin just gets the bytes without parsing the file.

This works exactly the same if you store your binaries on S3 using Nuxeop S3 Online Storage plugin, the plugin will directly read the recordSize bytes from the file on S3: No need to first download the file from S3 to local storage (typically an EBS column). This is extremely performant, because it makes no sense to download locally a 600MB file from S3 just to get 2KB from it. It would not scale : Imagine, 50 concurrent users asking their statement from 50. different big LDT files.

It is transparent. The only thing to do is add the allowByteRange property to the configuration. The contribution to add is:

<require>s3</require>
<extension target="org.nuxeo.ecm.core.blob.BlobManager" point="configuration">
  <blobprovider name="default">
    <property name="allowByteRange">true</property>
  </blobprovider>
</extension>

Compressing the LDT

As explained above, it can be interesting to compress the source .ldt file. It is text, with a lot of spaces, and, so, has a very good compression rate.

The plugin can compress the source LDT in a custom format that is simple: We extract each record from the source LDT, compress them with GZIP, append the compressed bytes to a .cldt file (mime-type "application/cldt"), return the startOffset and recordSize (which is a compressed recordSize). See nuxeo.ldt.parser.service.CompressedLDT.

Once compressed, retrieval is transparent : The plugin gets the bytes directly from the .cldt file, uncompress them to a string, which can then be rendered (JSON, PDF, …)

Compressed LDT

Build and run

Without building Docker:

cd /path/to/nuxeo-ldt-parser
mvn clean install -DskipDocker=true

To test with a S3 BinaryStore, see the testLDTParserWithS3BinaryStore class. You need to setup the following environment variables:

  • For accessing AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN and AWS_REGION
  • Info on the bucket: TEST_BUCKET and TEST_BUCKET_PREFIX

Support

Important

These features are not part of the Nuxeo Production platform.**

These solutions are provided for inspiration and we encourage customers to use them as code samples and learning resources.

This is a moving project (no API maintenance, no deprecation process, etc.) If any of these solutions are found to be useful for the Nuxeo Platform in general, they will be integrated directly into platform, not maintained here.

Licensing

Apache License, Version 2.0

About Nuxeo

Nuxeo, developer of the leading Content Services Platform, is reinventing enterprise content management (ECM) and digital asset management (DAM). Nuxeo is fundamentally changing how people work with data and content to realize new value from digital information. Its cloud-native platform has been deployed by large enterprises, mid-sized businesses and government agencies worldwide. Customers like Verizon, Electronic Arts, ABN Amro, and the Department of Defense have used Nuxeo's technology to transform the way they do business. Founded in 2008, the company is based in New York with offices across the United States, Europe, and Asia.

Releases

No releases published

Packages

No packages published