You might want to split pages into multiple documents. For example, if you ingest pages from a discussion board you might want to ingest one document for each message on the page.
Connector Framework Server can create documents for sections of a Web page identified using CSS selectors. CFS creates a child document for each section of the page that is identified. Metadata fields (named CHILD_DOCUMENT) are added to the parent document, to refer to the child documents.
To split pages into multiple documents, add the following parameters to your WKOOPHtmlExtraction task:
| ChildDocumentSelector | A CSS2 selector that identifies the root element of each child document in the page source. | 
| ChildReferenceSelector | (Optional) An element in the child document that contains a value to use as the document reference. The value you extract should be unique for each child document, because it is used as part of the DREREFERENCEfield  in the child document. If you do not set this parameter, the connector uses a GUID. Specify the element using a CSS2 selector, relative to the element identified byChildDocumentSelector. | 
| ChildMetadataSelector | (Optional) A list of elements in the child document that contain metadata. The metadata in these elements are extracted and added to the metadata fields of child documents. Specify the elements as a list of CSS2 selectors, relative to the element identified by  To specify the name(s) of the document field(s) to contain the extracted information, set the configuration parameter  | 
| ChildMetadataFieldName | (Optional) The names to use for document fields (in child documents) that contain information extracted using the parameter ChildMetadataSelector. This parameter must have the same number of values asChildMetadataSelector. | 
For example, consider the following example page which represents messages on a page of a discussion board:
<html>
    <head>
        <title>Example Page</title>
        <meta charset="utf-8">
    </head>
    <body>
        <div>
            <h1>Example Page</h1>
            <div class="content">
                <p>content</p>
            </div>
            <div class="message">
                <h1>Message 1</h1>
                <p class="meta">some metadata</p>
                <p>some content</p>
            </div>
            <div class="message">
                <h1>Message 2</h1>
                <p class="meta">some metadata</p>
                <p>some content</p>
            </div>
            ...
        </div>
    </body>
</html>
        To create separate documents for the messages contained on this page, you could use the following configuration:
[MyTask] ... ChildDocumentSelector=div.message ChildReferenceSelector=h1 ChildMetadataFieldName0=my_metadata ChildMetadataSelector0=p.meta
This example would produce the following child document (and a similar document for the second message):
#DREREFERENCE <current_document_reference>:<child_reference> #DREFIELD my_metadata="some metadata" ... #DRECONTENT Message 1 some metadata some content ...
The value of the DREREFERENCE field is constructed from the reference of the original document and the value of the element identified by the ChildReferenceSelector configuration parameter. If you don't set this configuration parameter or the element is not found, CFS uses a GUID instead.
CFS adds the reference of the original document to the fields DREPARENTREFERENCE and DREROOTPARENTREFERENCE. It also adds an HTML_PROCESSING metadata field that contains any metadata and links that are extracted from the child document.
The DRECONTENT field is populated with text extracted from the HTML elements that you identified as belonging to the child document.
Connector Framework Server automatically adds fields to the parent document, named CHILD_DOCUMENT,  that contain the references of associated child documents.
|  |