C API Advanced Programming Tutorial
This tutorial helps you to:
-
familiarize yourself with more advanced Filter SDK functionality.
-
work on streams, rather than files.
NOTE: This tutorial assumes that you have already completed the C API Programming Tutorial.
Using a Custom Stream
In some cases you might want to get File Content Extraction to operate on streams, rather than files on disk. For example, you might want to use File Content Extraction in stream mode when:
-
The file you are dealing with is in-memory, because it was output by another operation. You can use a custom input stream to read the file directly from memory, instead of writing it out to a file first.
-
You want to get the filtered text in small chunks instead of all at once. This approach has the following advantages:
-
You can process the output data in parallel with filtering the rest of the text. Parallel processing can minimize the time it takes to filter and process the text.
-
You can choose to stop filtering when the application has all the text it needs, which can save valuable resources. This approach is called partial filtering.
-
-
You want to extract subfiles into memory, instead of storing them on disk.
-
You do not have the whole file available to begin with. In this case, you can use a custom input stream to retrieve only the required parts of the file as File Content Extraction requests them.
Defining a Custom Input Stream
You can implement a custom stream by filling out a KVInputStream structure with functions that perform the appropriate actions. Each of these functions are equivalent to the ANSI counterparts (fopen, fread, and so on), except that several functions return a BOOL rather than an error code.
To illustrate how to use a custom stream, the following example defines a very simple stream that forwards to the file-based operations.
typedef struct
{
const char* filename;
FILE* fp;
} StreamInfo;
BOOL pascal streamOpen(KVInputStream* stream)
{
if(!stream || !stream->pInputStreamPrivateData)
{
return FALSE;
}
StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
if (info->fp == NULL)
{
info->fp = fopen(info->filename, "rb");
}
if (info->fp)
{
fseek(info->fp, 0, SEEK_SET);
}
return info->fp != NULL;
}
UINT pascal streamRead(KVInputStream* stream, BYTE * buffer, UINT size)
{
if(!stream || !stream->pInputStreamPrivateData)
{
return 0;
}
StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
return fread(buffer, 1, size, info->fp);
}
BOOL pascal streamSeek (KVInputStream* stream, long offset, int whence)
{
if(!stream || !stream->pInputStreamPrivateData)
{
return FALSE;
}
StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
return fseek(info->fp, offset, whence) == 0;
}
long pascal streamTell(KVInputStream* stream)
{
if(!stream || !stream->pInputStreamPrivateData)
{
return -1;
}
StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
return ftell(info->fp);
}
BOOL pascal streamClose(KVInputStream* stream)
{
if(!stream || !stream->pInputStreamPrivateData)
{
return FALSE;
}
StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
int retval = fclose(info->fp);
info->fp = NULL;
return retval == 0;
}
StreamInfo info = {pathToInputFile, NULL};
KVInputStream stream;
stream.pInputStreamPrivateData = &info;
stream.lcbFilesize = 0;
stream.fpOpen = streamOpen;
stream.fpRead = streamRead;
stream.fpSeek = streamSeek;
stream.fpTell = streamTell;
stream.fpClose = streamClose;
When you know the size of the document when you create the stream, you can use this information to fill out the lcbFilesize member. This option can reduce the number of seeks required, because File Content Extraction does not need to seek to the end of the file to determine the size.
When you do not know the size, you must set this member to zero. Not setting this member results in undefined behavior.
Opening a Document From a Stream
After you define a stream, you can create a KVDocument from the stream by calling fpOpenDocumentFromStream(). This KVDocument functions in the same way as a document created using fpOpenDocumentFromFile().
You must not open a second document from a stream until you have closed the first document.
KVDocument document = NULL;
error = filter.fpOpenDocumentFromStream(session, &stream, &document);
//Pass document to KeyView functions
filter.fpCloseDocument(document);
Extracting Directly to a Document
File Content Extraction lets you access subfile directly as documents, rather than needing to extract them yourself.
To access a subfile as a document, use the fpOpenDocumentFromSubFile() function, rather than using fpExtractSubFile(). The KVOpenDocumentFromSubFileArg structure is very similar to the KVExtractSubFileArg used for extracting files (see Extract Subfiles).
KVOpenDocumentFromSubFileArgRec extractArg;
KVStructInit(&extractArg);
extractArg.index = index;
extractArg.extractionFlag = KVExtractionFlag_GetFormattedBody;
KVDocument document = NULL;
KVSubFileExtractInfo postExtractInfo = NULL;
KVErrorCode error = extract->fpOpenDocumentFromSubFile(fileContext, &extractArg, &document, &postExtractInfo);
// Use document
filter->fpCloseDocument(document);
extract->fpFreeStruct(postExtractInfo);
Filtering Text Using Streams
For some use cases, you might not need all the text from the file, or you might want to analyze the text in small pieces. The fpFilter() function outputs the text in chunks, by filling out a KVFilterOutput structure. You must also free this structure by using the fpFreeFilterOutput() function.
The end of the stream is indicated by an empty KVFilterOutput structure. You do not need to free the empty structure.
By requesting text in chunks, a mutli-threaded application can often filter and process all the text from a file in a shorter time, by passing the text to downstream processing on another thread, while the first thread continues to get the next chunk from the stream.
Partial Filtering
You might want to stop processing before you have filtered the entire file, for example because you have already found a search term, or because you have hit a resource threshold. You can safely stop processing, as long as you still call fpFreeFilterOutput() and fpCloseDocument().
You can optionally keep track of how many bytes have been output, by accumulating the cbText field of KVFilterOutput.
uint64_t totalSize = 0;
while(1)
{
KVFilterOutput output = {0};
error = filter->fpFilter(document, &output);
if(error != KVError_Success)
{
return error;
}
if(output.cbText == 0)
{
break;
}
totalSize += output.cbText;
//Use filter output
filter->fpFreeFilterOutput(session, &output);
}
Conclusion
After you have completed the C API Programming Tutorial, and this more advanced tutorial, you should have a good understanding of the Filter SDK C API, allowing you to automatically detect the file format and extract metadata, text, and subfiles.