Orange is one of those simple, yet profound tools that allows a non-programmer/data scientist to comfortably approach the basics in data mining and machine learning without knowing every last nuance of the underlying tech. The Orange data mining tool does this in a very hands-on and visual style and still manages to retain a large degree of advanced functionality and extensibility.
Are there other good options for data mining software/data visualization tools? Sure - but are they easy to learn, quick to configure/install and simple enough for your grandmother to use? Can you use them for rapid data mining prototyping to proof of concept your workflow? Are they free data mining tools? (Sorry RapidMiner and SPSS...)
Orange is a truly portable data mining app that does not require Windows admin rights for installation - key for those in the Enterprise that might want an easy install without the requisite IT helpdesk hassle.
Orange can even be installed as a pure python application!
pip install orange3
Finally - to quickly weed out the remaining contenders on your dwindling data mining tools list, can it be installed on a Windows machine without administrator rights? Feel free to hit the lights on the way out...
Head on over to the Orange website (https://orange.biolab.si/download/) and download a copy. Alternatively I've included links here:
Machine learning: a five ten minute tour
Can you click buttons, drag widgets, find a spreadsheet file?
Assuming you are all sorted with the download and install, let's get cracking with a quick 5 minute whirlwind demonstration. Alternatively you could follow Biolab SI's excellent intro tutorial with titanic passenger data and decision trees...
Let's go ahead and start up Orange. From the startup dialog, choose 'New' and give your new workflow a name. Next hit 'OK' and you should see a blank canvas at which point we'll:
- Click the file icon on Data tool tab
- Double-click the file widget on your canvas
- Click the folder icon in dialog window and choose 'iris.tab' dataset
One of the first things I like to do in my workflows is to go ahead and add a data table view as this helps me to quickly visualize the features before I start transformations or modeling. This is a simple drag and drop procedure:
- Click 'Data Table' widget
- Drag from right handle on 'File' widget to left handle on 'Data Table' widget
- Click 'Data Table' widget to open the data table dialog window
Whee this is fun and easy! We're going to data science up in here...
I can already hear you saying, "Irises? Seriously - I thought maybe I'd learn something so I can get a machine learning endorsement on my LinkedIn profile..."
A few things to point out about this dataset - you'll note there aren't a ton of features: just four features describing length and width of sepals and petals of Irises. "But wait a minute!" you say, "I see five columns in our data table..."
The first column is our target column and is what we will be attempting to predict with our machine learning workflow.
We're going to be looking at a specific realm of machine learning known as classification. Given these physical features of an Iris, we will create a basic machine learning model to determine which variety of Iris we have based on the combination of these four features.
This same concept can be applied to any number of classification problems - we chose what it is we want to classify (predict) and then supply features we feel may be relevant for our model. Now there are a number of other data preparation steps that are be key in developing a highly accurate classification model:
Preparing your data
- Feature Selection - are there additional data that might better describe or clarify our model?
- Is there additional data we should be gathering to round out our model?
- Is there data we should remove? (The day's temp for our Iris sample probably is not going to help)
- Data Preprocessing - how do we get it into a useable form?
- Cleaning - look at your features and fix or remove missing data
- Formatting - maybe it's in an Access database and needs to go into CSV format
- Sampling - is our data sample size too small or is it too large for our cpu resources?
- Data Transformation - do you understand the problem space and the domain in which you algorithm with be used? (classification vs regression, etc.)
- Decomposition - can we reduce a compound data structure into it's elements? (car to wheel, axle, windshield counts)
- Aggregation - can we start from the atomic and build up (size + color + shape = small gray square)
- Scale - some problems and algorithms may require scale of features to be similar (KNN, Neural Nets, etc.)
Seriously - that was a lot of theory and conceptual side-tracking - let's get back to the fun stuff!! So we'll skip all the data preparation because fortunately the Iris dataset is in really good shape as is.
Let's go ahead and add one more data table to our canvas because we're going to split our data workflow...
What we are going to do here is create two flavors of our data - one will contain our training set and the other will contain our test data. We do this for several reasons - chiefly to prevent overfitting - which is a nasty little side-effect of over-optimization on the same dataset.
Separating our data
Same process as in Figure 1 - just drag another data table out and connect it up to our file widget. We'll also go ahead and right-click on each data table and rename them to something more meaningful to our workflow. Call one Training Data and the other Test Data - it doesn't matter which is which at this point.
Why should we split our data? Could we not just build a model on all the available data?
+ Assessing model's predictive accuracy
+ Preventing overfitting of model to specific dataset
Evaluating our predictive models
Next we'll open up the Evaluate tab and drop the Test & Score widget onto the canvas (Figure-4, step 1). This will help us assess our predictive models against each other to determine which ones we want to use.
I know, I know - you thought we were only going to build one machine learning model and understood there would be no math...
Next we'll drag from our Training Data Widget to the Test & Score Widget and from our Test Datawidget to the Test & Score Widget (Figure 4, step 2). Next double-click the connector from your Test Data widget to the Test & Score Widget - this will pull up an Edit Links dialog box. If the connector doesn't go from Data to Data, single click the line in dialog box and then drag it from Data to Data as pictured in Figure 4, Step 4. Repeat this exercise for the connector between Training Data and Test & Score, but this time make sure it goes from Data to Test Data.
Alright that was a lot of steps, but hopefully you are starting to see that we can adjust how data is consumed by downstream widgets - this is useful later for when or if you actually decide to start coding your workflow and want to think about the relations between objects in your processes.
Ok - let's bring on the machine learning models!
Machine learning model time
Click to the model tab and behold it's glory! Now that you are suitably awed, drop the following widgets on the screen as pictured in figure 5 below:
- Logistic Regression
- SVM (Support Vector Machine)
- Stochastic Gradient Descent
- kNN (k-Nearest Neighbors
Next drag a connector from the left-side of the Test &Score widget to each of these machine learning widgets we just placed.
Logistic Regression - yeah you probably saw this one before in your statistic class(es) in college. It examines probablities odds/log odds. It's a fast to train, resistant to overfitting and really quick at classifying unknown samples.
SVM - A support vector machine can be used for both linear and non-linear classification. It utilizes hyperplanes and n-dimensional space and because you were told there would be no math, I'll simply point you toward Wikipedia as a jumping off spot for more information: https://en.wikipedia.org/wiki/Support_vector_machine
Stochastic Gradient Descent - Not just a good name for a Swedish Death Metal band, this model is often seen in neural networks and is often used in text and natural language workflows.
AdaBoost - Obviously 1.0s would be a best case scenario - the AdaBoost model works with weak learners to achieve this by building models upon models until a training set is perfectly predicted. Dr. Jason Brownlee does a fantastic job explaining this in his blog https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/
kNN- K-Nearest Neighbor is one of my favorites. It looks at nearest neighbor experiences to make predictions. You will need to tune k which indicates how many closest neighbors to make a prediction. Also and your data should be normalized to a similar scale. If you have a lot of data, be prepared to use a lot of memory.
Testing and Scoring
Our work here is almost complete! Let's double click the Test & Score widget and look at our scores. Then go ahead and click the report button in this dialog so we can record our observations (always a good practice when it's time to science...)
The scores for each of our machine learning methods are displayed. Let's do a quick run-through of what these metrics mean and then we'll look at our models in further detail:
- AUC - Area Under the Curve - this is a quick way to assess the quality of your model a random classifier should score a 0.50 (50%) and a perfect classifier a 1.0 (100%)
- CA - Classification Accuracy - a ratio of true predictions to all samples where 1.0 would be perfect accuracy
- Precision - ratio of true positive to the sum of true positives + false positives. This is indicative of a model's ability to not mislabel a false sample as postive. 1.0 would indicate perfect precision.
- Recall - ratio of true positives to sum of true positives + false negatives. This is indicative of a model's predilection to find all the positive samples. 1.0 is perfect recall.
- F1 - this is a weighted average of precision and recall where a 1.0 is a perfect score.
Finally we get toward the end and now we get to see... something... So there are a whole host of visualization widgets and if you click over to the Visualize tab, we'll go ahead and choose two of them: Scatterplot and Heatmap. Drag and drop on your canvas and connect from right-side of the Test & Score widget to each of these new widgets as in Figure 7:
Double-clicking on each of these widgets will open up graph dialogs with a lot of configurable items. I'll briefly touch on these but will save the deep-dive for another post.
Let's start with the Heatmap graph (Figure 8, step 1) - this will help us quickly visualize the features to see which attribute values are strongest or weakest in the prediction. We can also remove features upstream in our datasets to see secondary features which might be somewhat hidden by the strongest ones.
I've recolored the heatmap to go from Orange to Red with the darker colors indicating the more relavant features in making the prediction. In this case we are isolating to our Logistic Regression model and looking and we can see that sepal length and petal length appear to be the most indicative features - particularily in identifying the iris-virginia variety.
Now observe the scatter plot (Figure 8, Step 2) - we've set x and y axis to reflect sepal and petal length set our color to be our target (iris type). Clicking the Find Informative Projections button (Figure 8, step 3) confirms our observation from the heat map regarding most indicative features in our prediction model.
Hopefully this was as fun and informative as it was for me - there are many, many directions to go from here - exploring the other models, tweaking their parameters, doing further curation on our dataset, etc.
In the near future, I'll follow this up with some further insights into these topics. It feels good to get these things out of my head...
Without further ado, I'll leave you with a short gallery of Irises I took several years ago at the Keizer Iris Festival in Oregon.
Subscribe to Six By Seven
Get the latest posts delivered right to your inbox