C API Programming Tutorial
The File Content Extraction Filter SDK allows you to embed File Content Extraction functionality into your application. This section demonstrates how to get started with the Filter C API. OpenText also recommends that you refer to the filter_tutorial sample program, because the source code for that program includes the features described here and demonstrates best practice for using the API.
Create a Session
To get started with the File Content Extraction Filter API, several steps are required:
-
Include the required headers.
-
Load the Filter interface, by calling KV_GetFilterInterfaceEx().
-
Create a Filter session by calling fpInit().
-
Load the Extract interface, by calling fpGetExtractInterface(), passing in the session that you initialized.
The following code is similar to the code in the filter_tutorial sample program and demonstrates all of these steps.
#include <stdio.h>
#include "kvtypes.h"
#include "kvfilt.h"
#include "kvxtract.h"
KVErrorCode main()
{
KVFltInterfaceEx filter = { 0 };
KVExtractInterfaceRec extract = { 0 };
KVFilterSession session = NULL;
KVErrorCode error = setupFilterSession(getBinDirectoryPath(), &filter, &extract, &session);
// Use File Content Extraction
// For example, detect file format or filter text, as discussed later in the tutorial
if (filter.fpShutdown)
{
filter.fpShutdown(session);
}
return error;
}
// API Setup
KVErrorCode setupFilterSession(const char* const binDirectoryPath, KVFltInterfaceEx* filter, KVExtractInterfaceRec* extract, KVFilterSession* pSession)
{
// Load the Filter interface
KVErrorCode error = KV_GetFilterInterfaceEx(filter, KVFLTINTERFACE_REVISION);
if (error != KVError_Success)
{
fprintf(stderr, "setupFilterSession() - KV_GetFilterInterfaceEx() failed\n");
return error;
}
// Create a Filter session
KVFilterInitOptions options;
KVStructInit(&options);
options.outputCharSet = KVCS_UTF8;
error = filter->fpInit(binDirectoryPath, FILTER_LICENSE, &options, pSession);
if (error != KVError_Success)
{
fprintf(stderr, "setupFilterSession() - fpInit() failed\n");
return error;
}
// Load the Extract interface
KVStructInit(extract);
error = filter->fpGetExtractInterface(*pSession, extract);
if (error != KVError_Success)
{
fprintf(stderr, "setupFilterSession() - fpGetExtractInterface() failed\n");
return error;
}
// Optional code to configure the Filter session
// This is discussed later in the tutorial
return KVError_Success;
}
All Filter API functionality requires a session, which you must initialize before the start of processing by calling fpInit(), and shut down after the end of processing by calling fpShutdown().
In this example, the work involved in creating the session exists in the setupFilterSession() function. When this function completes successfully and returns KVError_Success, the session has been initialized and the function pointers in the filter and extract structures have been assigned and can be used to access Filter and Extract API functionality.
The function fpInit() takes the path to the File Content Extraction bin folder, and your license key, which are not defined in this example. It also takes a pointer to a KVFilterInitOptions structure, which you must initialize by using KVStructInit(). This macro ensures that a struct is correctly set up for use with the File Content Extraction interface, including version information for backwards compatibility. Any File Content Extraction struct that contains a KVStructHead member must be initialized with the KVStructInit() macro.
Many Filter API functions return a KVErrorCode. If this is KVError_Success, then the function succeeded. Otherwise, the error code indicates the problem that occurred. Future examples in this tutorial may assume that you check the error code after each function call.
TIP: Performance considerations:
-
Session Lifetime. You can process multiple files in a single session, which might improve performance by reducing costs associated with start-up and shutdown.
-
Multi-threading. To maximize throughput when processing multiple files, you can call File Content Extraction from multiple threads. All File Content Extraction functions are thread-safe when called in this manner. Each thread using File Content Extraction must create its own session by calling
fpInit(). You must not share filter sessions between threads.
TIP: Security considerations:
-
Privilege Reduction. By default, File Content Extraction performs most of its operations out-of-process, creating a separate process to parse file data. This protects your main application from the effects of rare problems like memory leaks or crashes. You can include additional protection by running File Content Extraction with reduced privileges. See Run Filter with Minimal Privileges.
-
Temp Directory. While processing, File Content Extraction might place sensitive data in the temporary directory. You might want to consider protecting the temporary directory. See Protect the Temporary Directory.
Open a Document
File Content Extraction functions operate on a KVDocument object, which is a representation of a document that is agnostic as to the data source. You can create a KVDocument from a file on disk by calling fpOpenDocumentFromFile(). You must close the KVDocument after use.
KVDocument document = NULL;
error = filter.fpOpenDocumentFromFile(session, pathToInputFile, &document);
//Pass document to KeyView functions
filter.fpCloseDocument(document);
Filter a Document
Filtering text from a document is one of the most important features that File Content Extraction provides. You can filter text to an output file by calling fpFilterToFile().
error = filter.fpFilterToFile(document, pathToOutputFile);
TIP: Partial Filtering. The fpFilterToFile() function filters the entire file in one go, but you might want to filter only part of the file, or filter the file in chunks. The advanced tutorial covers how to do partial filtering.
TIP: Mail Files. Mail files, such as EML or MSG, are considered a form of container, and you cannot filter them directly. This tutorial covers how to filter mail files later, in Extract Subfiles.
Determine the Format of a Document
File Content Extraction enables you to reliably determine the file format of a huge range of documents. It does this by analyzing the internal structure and content of the file, rather than relying on file names or extensions. Detection prioritizes both accuracy and speed, only processing as much of the file as necessary to rule out false positives.
File format detection functionality is exposed through the API function fpGetDocInfo().
ADDOCINFO adInfo;
error = filter.fpGetDocInfo(document, &adInfo);
if (error == KVError_Success)
{
printf("The file format for this document is: %d\n", adInfo.eFormat);
}
This example prints the file format to the console as an integer. You can look up this value in the list of File Content Extraction supported file formats. The filter_tutorial sample program includes a function named outputDocInfo (in print.c) that demonstrates how to output more detailed information including a human-friendly description and MIME type (if available).
TIP: Source Code Identification. File Content Extraction can optionally detect source code, attempting to identify the programming language that it is written in. You can learn more in Source Code Identification.
Access Metadata
File formats can contain a variety of different metadata, and File Content Extraction makes it easy to access all of this information. File Content Extraction retrieves metadata from various sources in a file, such as:
-
Format-specific standard metadata
-
User-provided custom metadata
-
Exif tags
-
XMP elements
-
MIP Labels
You can retrieve metadata by calling fpGetMetadataList(). This function fills out the KVMetadataList structure, which you must free by using its fpFree function.
const KVMetadataList* metadataList = NULL;
error = filter.fpGetMetadataList(document, &metadataList);
//Iterate through metadata
metadataList->fpFree(metadataList);
To retrieve individual metadata elements, iterate through the metadata list using the fpGetNext() function in KVMetadataList, which fills out the KVMetadataElement structure. The information that this structure returns is valid only while the session is still alive, and becomes invalid after you call fpFree(). The end of the list is indicated by the retrieved element being NULL.
while(1)
{
const KVMetadataElement* element = NULL;
error = metadataList->fpGetNext(metadataList, &element);
if(error != KVError_Success)
{
//Handle error
}
if(!element)
{
break;
}
//Process metadata element
}
Interpret a Metadata Element
Each metadata element is conceptually represented as a key-value pair, where pKey is the name of the metadata key, and pValue is the value of that piece of metadata. To know the type of the metadata object the pValue points to, you must first consult the eType member. Strings are output in the character set that you requested in the call to fpInit().
switch (element->eType)
{
case KVMetadataValue_Bool:
{
BOOL value = *(BOOL*)element->pValue;
//Process Bool value
break;
}
case KVMetadataValue_Int64:
{
int64_t value = *(int64_t*)element->pValue;
//Process Int64 value
break;
}
case KVMetadataValue_Double:
{
double value = *(double*)element->pValue;
//Process Doube value
break;
}
case KVMetadataValue_WinFileTime:
{
int64_t value = *(int64_t*)element->pValue;
//Process WinFileTime value
break;
}
case KVMetadataValue_String:
{
KVString value = *(KVString*)element->pValue;
//Process String value
break;
}
case KVMetadataValue_Binary:
{
KVBinaryData value = *(KVBinaryData*)element->pValue;
//Process Binary value
break;
}
case KVMetadataValue_MIPLabel:
{
KVMIPLabel value = *(KVMIPLabel*)element->pValue;
//Process MIPLabel value
break;
}
default:
//Handle unrecognised type
break;
}
Standardized Metadata Elements
Different file formats can store the same piece of information in different ways. For example, one file format might call the width of the image width, another image_width, and another x_size. This behavior is often unhelpful, because you then need to maintain a list of fields that correspond to a particular piece of information. File Content Extraction solves this problem by standardizing certain metadata fields. See Field Standardization.
Extract Subfiles
File Content Extraction allows you to access the subfiles of a document, from both pure containers (such as ZIP or TAR files) and from documents embedded inside other files (such as an Excel spreadsheet embedded in a Word document).
You must open a container file before you can access its subfiles. Open the container by calling fpOpenFileFromFilterSession(). This function creates a file handle that you can use with the other functions in the extract interface. You must close this handle after use.
void* fileHandle = NULL;
KVOpenFileArgRec openArg;
KVStructInit(&openArg);
openArg.extractDir = "path/to/extract/dir";
openArg.document = document;
error = extract.fpOpenFileFromFilterSession(session, &openArg, &fileHandle);
//Use File Handle
extract.fpCloseFile(fileHandle);
You can get information about the container itself by using the function fpGetMainFileInfo(). Most importantly, this tells you the number of subfiles. You must free this structure after use.
KVMainFileInfo fileInfo = NULL;
error = extract.fpGetMainFileInfo(fileHandle, fileInfo);
//Use main file info
extract.fpFreeStruct(fileHandle, fileInfo);
Extract Subfiles
Before you extract a subfile, you can first get some information about it. You get this information by calling the fpGetSubFileInfo() function, using the index to identify the subfile. You must free this structure after use.
for(int ii = 0; ii < fileInfo->numSubFiles; ++ii)
{
KVSubFileInfo subFileInfo = NULL;
error = extract.fpGetSubFileInfo(fileHandle, ii, &subFileInfo);
//Use sub file info
extract.fpFreeStruct(fileHandle, subFileInfo);
}
After you have this subfile info, you can use it to construct the necessary arguments for extraction.
KVSubFileExtractInfo extractInfo = NULL;
KVExtractSubFileArgRec extractArg;
KVStructInit(&extractArg);
if (subFileInfo->subFileType == KVSubFileType_Folder || (subFileInfo->infoFlag & KVSubFileInfoFlag_External))
{
// The subfile represents a folder or a reference to an external resource
// You might want to ignore these subfiles because there is nothing to extract
}
extractArg.index = index;
extractArg.filePath = subFileInfo->subFileName;
extractArg.extractionFlag =
KVExtractionFlag_CreateDir |
KVExtractionFlag_Overwrite |
KVExtractionFlag_GetFormattedBody |
KVExtractionFlag_SanitizeAbsolutePaths;
error = extract.fpExtractSubFile(fileHandle, &extractArg, &extractInfo);
//Do more processing, such as filtering the sub file
extract.fpFreeStruct(fileHandle, extractInfo);
The fpExtractSubFile() function fills out the KVSubFileExtractInfo pointer, which tells you more about what the function actually did. For example, it tells you the location it extracted the file to.
TIP: Mail Files. File Content Extraction treats mail files as containers, where the first subfile is the contents of the mail file, and subsequent subfiles are the attachments.
NOTE: Security. KVExtractionFlag_SanitizeAbsolutePaths mitigates against certain path traversal attacks. See Sanitize Absolute Paths.
Retrieve Subfile Metadata
A file that you extract from a container can have its own metadata. In addition, containers sometimes contain metadata about their subfiles. A common example is mail containers (like PST files) that contain metadata about the messages that are stored. Metadata that is stored in a container but describes a subfile can be retrieved by calling the function fpGetSubFileMetadataList(). This function fills out the same KVMetadataList structure that you used in Access Metadata, and can be handled in the same way. You must initialize KVGetSubFileMetadataListArg by using KVStructInit().
const KVMetadataList* metadataList = NULL;
KVGetSubfileMetadataListArgRec metaArgs;
KVStructInit(&metaArgs);
metaArgs.index = index;
metaArgs.trgCharset = KVCS_UTF8;
error = extract.fpGetSubfileMetadataList(fileHandle, &metaArg, &metadataList);
//Process metadata using metadataList->fpGetNext()
metadataList->fpFree(metadataList);
Configure the Session
File Content Extraction provides many options that control what text to output, and how to convert or display that text. A common requirement of File Content Extraction is to output as much text as possible, including text that is not normally visible in a document, such as hidden cells or slides, or ancillary text like comments or notes. You can filter this text by enabling the hidden text option with fpSetConfig().
error = filter.fpSetConfig(session, KVFLT_SHOWHIDDENTEXT, TRUE, NULL);
By default, File Content Extraction does not consider embedded images (for example, images embedded in a word processor document) to be subfiles. You can also enable image extraction by calling fpSetConfig().
error = filter.fpSetConfig(session, KVFLT_EXTRACTIMAGES, TRUE, NULL);
Build and Link Your Program
To create a program that uses File Content Extraction, you need to install a supported compiler, and use it to build and link your program.
NOTE: When building with the Visual Studio compiler, you must ensure you open the correct command prompt for the installed version. For example, if you install the WINDOWS_X86_64 version of File Content Extraction, ensure you use x64 Native Tools Command Prompt.
The easiest way to get access to File Content Extraction functionality is to link against the kvfilter shared library and place your executable in the same directory as the File Content Extraction binaries (that is, the directory containing kvfilter.so or kvfilter.dll).
TIP: Loading shared libraries can expose your application to attacks. For advice on avoiding DLL preloading attacks, see Security Best Practices.
-
Linking using Visual Studio
On Windows, link against the import library for
kvfilter.dll. This library is provided as part of the Filter SDK, under{platform}/lib/kvfilter.lib. -
Linking using GCC
On Linux, link against
kvfilter.so, and also pass in the–rpath $ORIGINoption to the linker. For example:%KEYVIEW_HOME%/LINUX_X86_64/bin/kvfilter.so -Wl,-rpath,'$ORIGIN'
NOTE: When you call it from inside a makefile, you might need to escape
$ORIGINto$$ORIGIN. -
Linking using Clang
On MacOS, link against
kvfilter.so, and also pass in the–rpath @loader_pathoption to the linker. For example:%KEYVIEW_HOME%/MACOS_X86_64/bin/kvfilter.so -Wl,-rpath,@loader_path
Build the Sample Program
The filter_tutorial directory includes sample makefiles for all supported platforms.
Before you attempt to build the filter_tutorial sample program, open the file configuration.h and enter your license key and the path to your File Content Extraction bin directory:
-
Replace the value of
YOUR_LICENSE_KEYwith your license. -
Change the
YOUR_BIN_DIRvariable to the location of the File Content Extraction bin directory.