Tuesday, July 15, 2008

Welcome to the Data Mining Using SAS Enterprise Miner Blog

Please feel free to ask me any general questions or comments that you might have with SAS Enterprise Miner.

Link Analysis Node

On page 126 under the Transactions tab, I would like to add additional comments (highlighted in bold) to the Minimum count and Retain path position options under the Sequence section of the Link Analysis node.

The Sequence section will be available for selection assuming that you have sequence data, that is, a sequence or a time stamp variable, in the active training data set. The Sequence section has the following options for configuring the sequence:

Minimum count: Specifies the minimum number of items that occur, defined as a sequence to the analysis. By default, a sequence is defined by two separate occurrences. For instance, analyzing people visiting various Web pages, setting this option to one will ensure you that you will capture all people visiting various Web pages, even those customers who visit a single Web site.

Retain path position: This option is designed to transform the sequences into links and nodes. Setting this option to Yes will retain the position information of the sequence variable. That is, the nodes will be positioned in the link graph by each sequence that occurs within each sequence variable in the sequence data set. Again, analyzing people visiting various Web pages, setting this option to Yes will instruct the node to retain the order or the paths that were selected to navigate to the particular website.

Transform Variables Node

On page 168, under the Output tab, I reference to the fact that SAS Enterprise Miner will automatically set the transformed variable to missing when the LOG transformation is applied to the variable of the value of zero. It should be noted that this problem has been resolved by SAS with a downloadable service patch. From the table listing, notice that some of the transformed YOJ values are set to zero from the user-defined transformation as oppose to the standard transformation that sets the transformed values to missing with regard to the values that are undefined from the logarithmic transformation. However, it should be noted that SAS has provided a downloadable service patch to resolve this problem.

MBR Modeling Node

On page 462, under the Fundamental Contents to MBR Modeling section, I would like to make it very clear as to the reason why principal component variables are used in nearest neighbor modeling in SAS Enterprise Miner. One reason why is because the first principal component is first entered into the model, then followed by the second principal component variable. In other words, the nearest neighbor modeling estimates are calculated similar to moving average estimates in which the first k-values are averaged by the sorted values of the first variable in the model within the subsequent values of the second variable.

In Enterprise Miner, the probe x is defined by the sorted values of the input variables that are created in the SAS data set. Since it is recommended in using the principal component scores with numerous input variables to the analysis, then the probe x is determined by the sorted values of the principal component scores. Therefore, the values of the first principal component will determine the sorted order of the fitted values of the target variable to the nearest neighbor model. The nearest neighbor modeling estimates are calculated by the average target values or the number of target categories within a predetermined window of k points that lie closest to the current data point to fit in the multidimensional region.

Two-Stage Modeling Node

On page 482, under the Two Stage Model Settings section, I forgot to mention for MLP and RBF neural network modeling that five preliminary training runs are automatically performed in SAS Enterprise Miner. In other words, five preliminary training runs are performed to determine the initial weight estimates during the subsequent training run in both RBF and MLP neural network modeling.