With this applied I now had my data set nearly completed in the ways that would allow for proper analysis. The final step in the data cleansing process was excluding data that was no longer applicable in determining hall of fame candidacy. What I realized when looking at all of the statistics is that the athletes that made the Hall of Fame from 1870-1940 had significantly lower counting statistics than those from 1941-2005. There are many factors that could be attributed to this including genetics, lack of nutritional/training procedures, and military requirements. Because there was no way to even out these statistics for more accurate results I chose to omit those shortstops prior to 1940 and start analysis from 1941 and end it in 2005.  The following is a look at the completed data set: 
My method for analysis in this project was to utilize Weka. Before performing different models to test the data I had to do a few steps after implementing the data into Weka. The first step was nominalizing the hall of fame category. This made it so the category could be used as a predictor value. After nominalzing I also chose to discretize the data and broke it into 4 bins. This allowed me to break down the data and see which bins contained more hall of famers than others. I then compared this to the results where I did not discretize the data.
Utilizing Weka I created three models for the dataset. These models were SMO, random forest, and naïve bayes. My first goal in analysis was to determine the average hall of fame career for a shortstop compared to a non hall of fame career. 
After understanding the average career for a Hall of Fame shortstop I then moved into analyzing it through bins to see if it correlated with the average career or not. Some statistics that were particularly interesting in the results were hits, home runs, doubles, and RBI’s. 
The results of this analysis were generally what I expected to see. Higher counting statistics are more likely to lead to someone reaching the Hall of Fame. The one anomaly here is home runs, but a potential reasoning for this may be the steroid era. Baseball has been plagued with an issue in regards to the use of steroids and one of the most impactful statistics that steroids affect are home runs. 
The final portion of analysis with this project was utilizing three predictive models to test accuracy on training data. The three models that I used were SMO, random forest, and naïve bayes. 
When looking at the results of this data it’s clear that the SMO and random forest models were prone to over fitting. The accuracies were in the upper 90 to 100 % so the models are not the most credible. The naïve bayes returned a high but acceptable accuracy at 82 percent. 
  So what does this data mean and how can it be applied in a useful manner? For any team I think their goal would be to get the best chance at a Hall of Fame shortstop as possible. More future Hall of Famers equals better players than your peers and thus a better opportunity at winning games and championships. The application of this data on a minor league scale would allow for teams to get a better understanding of who has the best chance of reaching the Hall of Fame before they step foot in the MLB. Based on this analysis the most important statistics to look at would be doubles and hits followed by RBI’s. Applying this on a scale.
