C++ API Advanced Programming Tutorial

Using a Custom Input Stream

Until now, you have worked with File Content Extraction operating on files on disk. In some cases you might want to get File Content Extraction to operate on streams instead. For example, you might want to use File Content Extraction in stream mode when:

  • The file you are dealing with is in-memory, because it was output by another operation. You can use a custom input stream to read the file directly from memory, instead of writing it out to a file first.

  • You want to get the filtered text in small chunks instead of all at once. This approach has the following advantages:

    • You can process the output data in parallel with filtering the rest of the text. Parallel processing can minimize the time it takes to filter and process the text.

    • You can choose to stop filtering when the application has all the text it needs, which can save valuable resources. This approach is called partial filtering.

  • You want to extract subfiles into memory, instead of storing them on disk.

  • You do not have the whole file available to begin with. In this case, you can use a custom input stream to retrieve only the required parts of the file as File Content Extraction requests them.

You can implement a custom stream by creating a class that conforms to the Input template interface described in Keyview_IO.hpp. To illustrate this, the following example defines a very simple stream that forwards to a std::ifstream.

Copy
class InputStream
{
public:
    InputStream(const std::filesystem::path& input_path) :
        is(input_path, std::ios::binary)
    {
    }

    // Input types must have a read method which takes 2 arguments:
    //    char* ptr            A pointer to a block of memory with a size of at least count bytes, to read into.
    //    int64_t count        The number of input bytes to read.
    // and returns:
    //    int64_t            The number of bytes successfully read.
    int64_t read(char* ptr, int64_t count)
    {
        is.read(ptr, count);
        return is.gcount();
    }

    // Input types must have a seek method which takes 2 arguments:
    //    int64_t offset    Number of bytes to offset from origin.
    //    int origin        SEEK_SET = 0, SEEK_CUR = 1, or SEEK_END = 2.
    // and returns:
    //    int                0 if the seek was successful, otherwise a non-zero value.
    int seek(int64_t offset, int origin)
    {
        // If the stream has ended up off the end of the file, it will have set eofbit,
        // which needs to be cleared now the stream is seeking somewhere else.
        is.clear();
        is.seekg(offset, origin);
        return is.good() ? 0 : 1;
    }

    // Input types must have a tell method which takes 0 arguments and returns:
    //    int64_t            The current position in the input file/stream.
    int64_t tell()
    {
        return is.tellg();
    }

private:
    std::ifstream is;
};

Filtering Text Using Streams

For some use cases, you might not need all the text from the file, or you might want to analyze the text in small pieces. By requesting text in chunks, a mutli-threaded application can often filter and process all the text from a file in a shorter time, by passing the text to downstream processing on another thread, while the first thread continues to get the next chunk from the stream.

You might also want to stop processing before you have filtered the entire file, for example because you have already found a search term, or because you have hit a resource threshold.

Using std::istream Output

The Document text() method provides a non-seekable std::istream that provides access to the filtered text. You can use the istream by using normal C++ functions to extract as much text as you need, and it refills its internal buffer by processing more of the input file. This behavior means you might be able to avoid processing the entire file until necessary.

Do not call any other Filter SDK functions between read operations on the returned istream, because doing so might prevent text filtering from resuming correctly.

Copy
for (std::string line; std::getline(document.text(), line);)
{
    std::cout << line;
}

Implementing a Custom Output Stream

For a greater degree of control over how the output stream is handled, you can implement can a custom output stream by creating a class that conforms to the Output template interface described in Keyview_IO.hpp.

Here is an example Output class, also provided in Keyview_IO.hpp, which simply writes to stdout. This example demonstrates how to implement the required write() function.

Copy
class OutputStdout
{
    public:
    //Output types must have a write method which takes 2 arguments:
    //  const char* ptr    A pointer to the output text to be written.
    //  int64_t count        The number of bytes to write.
    //and returns:
    //  int64_t            The number of bytes successfully written.
    int64_t write(const char* ptr, int64_t count)
    {
        std::fwrite(ptr, 1, count, stdout);
        std::fflush(stdout);
        return count;
    }
};

Extracting Directly to a Document

File Content Extraction lets you access subfiles directly as documents, rather than needing to extract them yourself.

To access a subfile as a document, use the Subfile::open method, rather than using Subfile::extract.

Copy
auto [subdoc, extract_info] = subfile.open();
    if (subdoc)
    {
    // Use subdoc
    }