Modeling/Analyzing Big Data a) What is the over-fitting problem in analyzing big
ID: 3736563 • Letter: M
Question
Modeling/Analyzing Big Data
a) What is the over-fitting problem in analyzing big data sets? Illustrate the problem in the case of tree models.
b) How do neural net models compare to tree models in analyzing big data?
c) You read a paper that says the best way to predict profitability of new investments is to do an ensemble analysis. It estimates different models – from linear regression to trees to neural nets to logistic equations – that link economic data and tweets to whether a past investment was profitable (=1) or not (=0). It applies the estimated models to predict whether new investments are likely to profitable or not and invests in the projects getting the most votes. Why might this voting procedure work?
d) Professor Critical says “This class is about rare discontinuous events that affect economies but big data means having large numbers of observations on many small events. How can big data illuminate rare events?”
Explanation / Answer
In the training phase, the correct class for each record is known (i.e., supervised training), and the output nodes can be assigned correct values -- 1 for the node corresponding to the correct class, and 0 for the others. Results have been found using values of 0.9 and 0.1, respectively. As a result, it is possible to compare the network's calculated values for the output nodes to these correct values, and calculate an error term for each node. These error terms are then used to adjust the weights in the hidden layers so that the next time around the output values will be closer to the correct values.
The Iterative Learning Process
A key feature of neural networks is an iterative learning process in which records (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time. After all cases are presented, the process often starts over again. During this learning phase, the network trains by adjusting the weights to predict the correct class label of input samples. Advantages of neural networks include their high tolerance to noisy data, as well as their ability to classify patterns on which they have not been trained. The most popular neural network algorithm is the back-propagation algorithm proposed in the 1980s.
Once a network has been structured for a particular application, that network is ready to be trained. To start this process, the initial weights are chosen randomly. Next, the training begins.
The network processes the records in the training data one at a time -- using the weights and functions in the hidden layers -- then compares the resulting outputs against the desired outputs. Errors are then propagated back through the system, causing the system to adjust the weights for the next record. This process occurs again as the weights are continually tweaked. During the training of a network, the same set of data is processed many times as the connection weights are continually refined.
Note that some networks never learn. This could be because the input data does not contain the specific information from which the desired output is derived. Networks also will not converge if there is not enough data to enable complete learning. Ideally, there should be enough data available to create a Validation Set.
Feedforward, Back-Propagation
The feedforward, back-propagation architecture was developed in the early 1970s by several independent sources (Werbor, Parker, Rumelhart, Hinton, and Williams). This independent co-development was the result of a proliferation of articles and talks at various conferences that stimulated the entire industry. Currently, this synergistically developed back-propagation architecture is the most popular and effective model for complex, multi-layered networks. Its greatest strength is in non-linear solutions to ill-defined problems. The typical back-propagation network has an input layer, an output layer, and at least one hidden layer. Theoretically, there is no limit on the number of hidden layers, but typically there are just one or two. Some studies have shown that the total number of layers needed to solve problems of any complexity is five (one input layer, three hidden layers, and an output layer). Each layer is fully connected to the succeeding layer.
The training process normally uses some variant of the Delta Rule, which starts with the calculated difference between the actual outputs and the desired outputs. Using this error, connection weights are increased in proportion to the error times, which are a scaling factor for global accuracy. This means that the inputs, the output, and the desired output all must be present at the same processing element. The most complex part of this algorithm is determining which input contributed the most to an incorrect output and how to modify the input to correct the error. (An inactive node would not contribute to the error and would have no need to change its weights.) To solve this problem, training inputs are applied to the input layer of the network, and desired outputs are compared at the output layer. During the learning process, a forward sweep is made through the network, and the output of each element is computed layer by layer. The difference between the output of the final layer and the desired output is back-propagated to the previous layer(s), usually modified by the derivative of the transfer function. The connection weights are normally adjusted using the Delta Rule. This process proceeds for the previous layer(s) until the input layer is reached.
Structuring the Network
The number of layers and the number of processing elements per layer are important decisions. To a feedforward, these parameters back-propagation topology, are also the most ethereal - they are the art of the network designer. There is no quantifiable, best answer to the layout of the network for any particular application. There are only three general rules picked up over time and followed by most researchers and engineers applying this architecture to their problems.
Rule One: As the complexity in the relationship between the input data and the desired output increases, the number of the processing elements in the hidden layer should also increase.
Rule Two: If the process being modeled is separable into multiple stages, then additional hidden layer(s) may be required. If the process is not separable into stages, then additional layers may simply enable memorization of the Training Set, and not a true general solution.
Rule Three: The amount of training data available sets an upper bound for the number of processing elements in the hidden layer(s). To calculate this upper bound, use the number of cases in the Training Set and divide that number by the sum of the number of nodes in the input and output layers in the network. Then divide that result again by a scaling factor between five and ten. Larger scaling factors are used for relatively less noisy data. If too many artificial neurons are used, the Training Set will be memorized, not generalized, and the network will be useless on new data sets.
Ensemble Methods
XLMiner V2015 offers two powerful ensemble methods for use with Neural Networks: bagging (bootstrap aggregating) and boosting. The Neural Networks Algorithm on its own can be used to find one model that results in good predictions for the new data. We can view the statistics and confusion matrices of the current predictor to see if our model is a good fit to the data, but how would we know if there is a better predictor just waiting to be found? The answer is that we do not know if a better predictor exists. However, ensemble methods allow us to combine multiple weak neural networks which, when taken together form a new, more accurate strong neural networks. These methods work by creating multiple diverse networks, by taking different samples of the original dataset, and then combining their outputs. (Outputs may be combined by several techniques for example, majority vote for classification and averaging for prediction.) This combination of models effectively reduces the variance in the strong model. The two different types of ensemble methods offered in XLMiner (bagging and boosting) differ on three items: 1) the selection of training data for each predictor or weak model; 2) how the weak models are generated; and 3) how the outputs are combined. In all three methods, each weak model is trained on the entire Training Set to become proficient in some portion of the data set.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.