C++ API Programming Tutorial

The File Content Extraction Filter SDK allows you to embed File Content Extraction functionality into other services.

The following tutorial describes how to get started with the basics of filtering, metadata extraction, and handling subfiles with the C++ API.

Creating a Filter Session

To use the C++ Filter SDK, link the library built in Build the C++ API, and then create a session:

Copy
#include "Keyview_FilterSDK.hpp"
auto session = keyview::Session{license, bin_path};

bin_path should be a std::string that holds the location of the Filter SDK binaries.

When you create a session, you might need to consider various impacts on performance and security.

Performance considerations:

  • Session Lifetime. You can process multiple files in a single session, which might improve performance by reducing costs associated with start-up and shutdown.

  • Multi-threading. To maximize throughput when processing multiple files, you can call File Content Extraction from multiple threads. All File Content Extraction functions are thread-safe when called in this manner. Each thread that uses File Content Extraction must create its own session. You must not share filter sessions between threads.

Security considerations:

  • Privilege Reduction. By default, File Content Extraction performs most of its operations out-of-process, creating a separate process to parse file data. This behavior protects your main application from the effects of rare problems like memory leaks or crashes. You can include additional protection by running File Content Extraction with reduced privileges. See Run Filter with Minimal Privileges.

  • Temp Directory. While processing, File Content Extraction might place sensitive data in the temporary directory. You might want to consider protecting the temporary directory. See Protect the Temporary Directory.

Opening a Document

A Document object is a representation of a document that is agnostic as to where the document lives. You can create a Document from a file on disk by using the open() method on your Session object.

Copy
auto doc = session.open("InputFile.docx");

TIP: You can also create a Document from a stream-like object. See Using a Custom Input Stream.

Filter a File

One of the most important features of File Content Extraction is filtering text from a document. This section shows you how to get text from a Document object, and output it to a file on disk.

Filtering Text

You can write the text filtered from a document to an output file using the filter method on a Document object.

Copy
doc.filter("output.txt");

TIP: Partial Filtering. When you pass an output path to the filter() function, this function filters the entire file in one go. In some cases you might want to filter only part of the file, or filter the file in chunks. In this case, you can filter text to a stream. For more information, see Using a Custom Input Stream.

NOTE: Mail files, such as EML or MSG, are considered a form of container, and you cannot filter them directly. See Extracting Subfiles.

Filtering Hidden Information

File Content Extraction provides a number of options that control what text to output, and how to convert or display that text.

It is common to require File Content Extraction to output as much text as possible, including text that is not normally visible in the document, such as hidden cells or slides, and ancillary text like comments or notes. You can display this text by enabling the hidden text option using the config() method of the Session object.

Copy
session.config().hidden_text(true);

Detecting and Using File Format Information

File Content Extraction enables you to reliably determine the file format of a huge range of documents. It does this by analyzing the internal structure and content of the file, rather than relying on file names or extensions. Detection prioritizes both accuracy and speed, only processing as much of the file as necessary to rule out false positives.

You can find the format of a document by using the info method on a Document object.

Copy
auto doc_info = doc.info();

os << "\nFile Format Information\n"
    << "Format:\t" << static_cast<int>(doc_info.format()) << "\n"
    << "Description:\t" << doc_info.description() << "\n"
    << "Version:\t" << doc_info.version() << "\n"
    << "Category:\t" << static_cast<int>(doc_info.category()) << "\n"
    << "Category Name:\t" << doc_info.category_name() << "\n"
    << "Encrypted:\t" << std::boolalpha << doc_info.encrypted() << "\n\n";

TIP: File Content Extraction can optionally detect source code, attempting to identify the programming language that it is written in. For more information, see Source Code Identification.

Retrieving Metadata

File formats can contain a variety of different metadata, and File Content Extraction makes it easy to access all of this information. File Content Extraction can retrieve metadata from various sources in a file, such as:

  • Format-specific standard metadata

  • User-provided custom metadata

  • Exif tags

  • XMP elements

  • MIP Labels

You can access a document’s metadata by using the metadata method on a Document object. For example:

Copy
for(const auto& [key, elem] : doc.metadata())
{
    std::cout << key << ": " << elem.convert_to_string() << std::endl;
}

Interpreting a Metadata Element

Each metadata element is conceptually represented as a key-value pair. The key is provided both as part of the Metadata iterable, and also as a member of the MetadataElement class. To know the type of the metadata object value, you must first consult the value_type() function, and then call the appropriate *_value() function to access the data. Strings are output in the Session's target encoding.

Copy
switch (element.value_type())
{
case keyview::MetadataValueType::Bool:
    os << std::boolalpha << element.bool_value();
    break;
case keyview::MetadataValueType::Int64:
    os << element.int64_value();
    break;
case keyview::MetadataValueType::Double:
    os << element.double_value();
    break;
case keyview::MetadataValueType::DateTime:
{
    auto secondsSinceUnixEpoch = element.datetime_value();
    os << formatDate(secondsSinceUnixEpoch);
    break;
}
case keyview::MetadataValueType::TargetEncodingString:
    os << element.string_value();
    break;
case keyview::MetadataValueType::Binary:
{
    auto binary = element.binary_value();
    os << std::string(reinterpret_cast<const char*>(binary.data()), binary.size());
    break;
}
case keyview::MetadataValueType::MIPLabel:
    os << element.mip_label_value();
    break;
default:
    os << "Unknown Value";
    break;
}

TIP: You can also process metadata by using a visitor pattern. See The MetadataVisitorBase Class.

Standardized Metadata Elements

Different file formats can store the same piece of information in different ways. For example, one file format might call the width of the image width, another image_width, and another x_size. This behavior is often unhelpful, because you then need to maintain a list of fields that correspond to a particular piece of information. File Content Extraction solves this problem by standardizing certain metadata fields. See Field Standardization.

Extracting Subfiles

You can iterate over subfile information using the subfiles method on a Document object. Each element returned by the iterator contains information about the subfile, and a method that you can use to extract it.

Copy
for (const auto& subfile : doc.subfiles())
{
    subfile.extract(output_path);
}

CAUTION: The Subfile object provides the original filename of the subfile through the rawname() function. For security, you must not use this file name directly as the output path, because using the raw file name would make your application vulnerable to path traversal attacks. For more information, see The Subfile Class extract function.

TIP: File Content Extraction treats mail files as containers, where the first subfile is the contents of the mail file, and subsequent subfiles are the attachments.

By default, File Content Extraction does not extract images when extracting subfiles. You can enable image extraction by using the config extract_images function.

Copy
session.config().extract_images(true);

Retrieving Subfile Metadata

You can retrieve metadata for a particular subfile by using the subfile class metadata function. This function returns the same Metadata class as the Document class metadata function, and can be handled in the same way.

While the class returned is the same, the metadata provided by Document and Subfile are different. Document metadata represents information the document contains about itself. For instance, a raster image file contains metadata recording the image width and height; a word processing document might contain metadata recording the document author and title. Subfile metadata is information stored in the container file about their subfiles. For instance, a mail format may contain fields like "to" or "From"; a zip might contain comments associated with the subfiles.

Copy
for(const auto& [key, elem] : subfile.metadata())
{
    std::cout << key << ": " << elem.convert_to_string() << std::endl;
}

Exceptions

All the C++ API methods can throw exceptions. File Content Extraction errors take the form of an instance of keyview_error, which is itself derived from std::exception. The exceptions that can be thrown are defined in Keyview_Errors.hpp.

In application code, it is possible to catch and correctly handle many of these exceptions. For example, while processing many files, a format_not_supported_error might be thrown. The correct behavior for an application might be to skip this file, or to add it to a list of files that could not be recognized. Similarly, if a password_protected_error is thrown and caught, an application might prompt a user to enter a password and then retry.