Eduction Table mode allows you to extract entities from a table, according to the values in the header of that table. This process allows you to target extraction on likely values in structured data, rather than extracting every possible entity value from a table. It can also improve the confidence that an ambiguous entity value corresponds to a particular type of data.
In standard extraction, Eduction searches text for a value that matches a particular entity. In many cases the entity values are distinctive, and so you can be reasonably confident that matches are relevant. For example, a string that matches an address entity is unlikely to be anything else.
Many other entity values are potentially ambiguous. For example, a number might match several entity types, and a date might be a date or birth or an event date. Without further information, it is difficult to determine whether these values are useful.
For unstructured text, you can use landmarks to find relevant information. Landmarks are values that identify a particular entity, without being a part of the entity value. For example, the phrase Date of Birth is a landmark. When a document contains the value Date of Birth: 06/07/80, it is highly likely that the date is a date of birth, and you can treat the data accordingly.
NOTE: The IDOL PII Package, IDOL PHI Package, and IDOL PCI Package, provide landmark entities in most grammars. To extract entities from tables with the Eduction standard grammar files, you might need to create your own landmark entities.
For structured data, it is less likely that the landmark occurs next to the entity. You might have the value Date of Birth in a table heading, and the actual date values in the rows below. In this case, you can use table extraction to extract the values that correspond to the landmark.
In table extraction, you define an entity or entities that you want to detect in the header row, and entities that you want to detect in the cells under that header. When Eduction matches one of these entities in the header row of a table, it attempts to extract the corresponding cell entities from the cells in that column.
To configure these, you use the HeaderEntityN and CellEntityN configuration parameters.
For example:
[Eduction] HeaderEntity0=pii/date/dob/landmark/all CellEntity0=pii/date/nocontext/all
This example matches date of birth landmark values in the header, and for all subsequent rows in that column, it extracts any date values.
NOTE: You can specify multiple entities, either by providing a comma-separated list, or by using wildcard characters. In this case, if the table header matches any of the configured header entities, Eduction matches the cell content against any of the configured cell entities.
This option might be useful if you want to match a particular entity in multiple languages, or if you want to include a custom entity in addition to a standard one.
To use table extraction with Connector Framework Server (CFS) or IDOL NiFi Ingest, you can also add the EntityFieldN parameter. This parameter specifies the field that CFS or NiFi write the extracted entities to in your documents.
In this case, if you do not set EntityFieldN, Eduction uses the value of CellEntityN to create a default field name (the capitalized entity name, with / * and ? characters replaced with underscores).
[Eduction] HeaderEntity0=pii/date/dob/landmark/all CellEntity0=pii/date/nocontext/all EntityField0=DATE_OF_BIRTH
NOTE: You cannot specify EntityFieldN for only some of your CellEntityN values; you must either use the default value for all, or set EntityFieldN for all.
With CFS and IDOL NiFi Ingest, you can also configure Eduction to extract entities from structured table data, in XML, such as the output from Media Server OCR. In this case, there is an additional parameter, TableCellPath, which describes the structure of the XML to allow Eduction to find the cells. For more information, refer to the Connector Framework Server documentation.
After you configure table extraction, you can run Eduction as normal, with a CSV or TSV table file as input.
In the Eduction SDK:
C: You provide the table file by using the AddInputText or SetInputStream functions. You can use the EdkGetMatchTablePosition function to retrieve the row and column details of a match.
Java: You provide the table file by using the addInputText or setInputStream functions. You can use public EDKMatch.TablePosition getTablePosition() to return an object with two public members, row and column.
.NET: You provide the table file by using AddInputText or SetInputStream functions. You can use the readonly property public IExtractionMatchTablePosition TablePosition to return an object that has the readonly properties Row and Column.
NOTE: To use Table Extraction with the Eduction SDK, you must create an Eduction engine with a configuration file. See the Standalone API Usage section for your language in API Reference.
In Eduction Server and the edktool command-line tool, you provide the table file as plain input text. Eduction returns the matches in the response.
In CFS and NiFI, the ingestion process sends the table file to the Eduction engine. CFS and NiFi add the match details to the output documents.