Author: BeingTechie

Taking the First step to Machine Learning Part -1 Visualizing the data

Every machine learner’s journey begins with one of the most simple machine learning algorithm ‘The Linear regression’. There are already many pages having detailed description about the algorithm and how it works. So here we won’t be doing that. But just going on with our exploration as budding machine learners.

Another thing that is associated with the machine learning beginners is the ‘Iris Data set’.

Iris Dataset Repository

This dataset consists of three different categories of Iris plant : setosa, versicolor and virginica.

Before we proceed load the data

import numpy as np
import pandas as pd
dataframe = pd.read_csv('Iris.csv')

#Note: If you have downloaded dataset from kaggle then in quotes write the path where you have stored the file otherwise it will give you an error. I have downloaded it from kaggle but if you are downloading it from UCI repository then either try to copy paste in excel sheet (but if you couldn’t) else please follow these steps:

from urllib.request import urlretrieve
irisData = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
urlretrieve(iris)
dataframe = pd.read_csv(irisData, sep=',')
column_Name = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
dataframe.columns = column_Name

Reference 1

Now you have data you are all set!

For ease of choosing suitable machine learning algorithm it is necessary to know the data. It will help you to identify any missing values in the data set.

dataframe.shape
dataframe.info
dataframe.ndim
dataframe.head()
dataframe.tail()
dataframe.describe()

Check the number of null values.
dataframe.isnull().sum()

It’s time to get set of separate inputs and outputs.
Slicing of the first column
X = dataframe.iloc[:, 1:5].values
y = dataframe.iloc[:, 5].values

Usually when you start with Iris you must have seen a 4X4 matrix of graph to understand the relation of attributes with each other and in this there are 4 blank graphs on the principle diagonal. So I cut short the no. of these graphs into 6 by using the following steps:

colors = {'Iris-setosa':'red', 'Iris-versicolor':'green', 'Iris-virginica':'blue'}
#Creating the list of the column names
column_list = list(dataframe)
#removing the first element
del column_list[0]
#Removing the last column name that is Id
del column_list[len(column_list)-1]

The reason why I am deleting the first and the last column is

 

So we won’t require the column Id and Species. If your dataset is from UCI then I think you will only need to delete the species column.

So till now we have created a list of column which we will be using to create the graphs. Now its time to create some plots. As I already mentioned, most of you have seen a graph for Iris dataset something like this:Iris_dataset_scatterplot.svg

But the following code is different it will only be plotting six plots.

Iris_dataset_scatterplot.svg

They are :

  1. SepalLength Vs SepalWidth
  2. SepalLength Vs PetalWidth
  3. SepalLength Vs PetalLength
  4. SepalWidth Vs PetalWidth
  5. SepalWidth Vs PetalLenght
  6. PetalWidth Vs PetalLenght

for i,x in enumrate(column_list):
       for y in range(i+1, len(column_list)):
              plt.scatter(dataframe[x], dataframe[column_list[y]], color = dataframe['Species'].apply(lambda x : colors[x]),label =                      ("Setosa", "Versicolor", "Virginica"))
               plt.xlabel(x)
               plt.ylabel(column_list[y])
               red_patch = mpatches.Patch(color = 'red', label = "Iris-setosa")
               green_patch = mpatches.Patch(color = 'green', label = "Iris-versicolor")
               blue_patch = mpatches.Patch(color = 'blue', label = "Iris-virginica")
               plt.legend(handles = [red_patch, green_patch, blue_patch], loc = 'upper left', prop = {'size':6})
               plt.show()

The first two line of code :

for i, x in enumrate(column_list):

for y in range(i+1, len(column_list)):

enumerate() method helps programmers to keep a count on number of iterations while working with the iterators. ‘i‘ will be keeping the count and ‘x‘ is iterator.

In second for loop ‘y’ will have element from column_list ranging starting from index i+1  to column_list length.

Eg. when iterator is on SepalLength then y’s value will range from SepalWidth to PetalLength.

In the plt.scatter() method I have used lambda function to color the plots. Red for setosa, green for versicolor and blue for virginica. It will see the the specie name in dataframe[species] and then choose the appropriate color.

The next three lines are for creating patches for the legend. In plt.legend() I have specified the location and size of the legend.

Graphs

These graphs may seem from the images of the graph displayed above because change of labels on X and Y axis.

From these graphs you can see that PetalLength and PetalWidht are highly correlated that means that only if we have petal length and width of the flower then we could find the species to which it belongs.

In the next part I’ll be sharing my experience of using Multiple Regression with backward elimination on Iris dataset.

Bye!!

 

Advertisements

Numpy Part-2

In the last post I have left some points which I’ll try to sum up here.

          1. While using vstack() keep in mind the no.of columns must be equal in both the arrays and so with the case if one is using row_stack() and.
          2. Similarly while using hstack() or column_stack() keep in mind that there are equal no. of columns in both the matrices.
          3. The two methods concatenate() and append()
            perform the same function but the underlying difference in their implementation. The append() method is implemented in terms of latter.
          4. Like there are methods for stacking there are methods of splitting too vsplit() and hsplit(). If you want to split at particular index then keep in mind the shape of the matrix
          5. You can also create an array of all ones by np.ones((3,4)) or array of all zeros np.zeros((3, 4)), an array of constant np.full((3,3), 4, dtype = int64) and an array of random nos. np.random.random((3, 4)).

This is all for this time I’ll be adding these points if I gather some more information.

Numpy

The first step in machine learning is to learn is to choose appropriate tool to work upon.I began my search for that tool and finally ended up with SciKit learn. But before beginning my journey, I began working upon my skills on the libraries that are used along with scikit learn.

The first library which I chose to study was numpy. Numpy library according to much I have learnt is used to handle arrays. Now moving some formal definition of NumPy as per wikipedia is “NumPy  is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.”

Let’s begin with NumPy

import numpy as np

Later in while coding the name numpy would be seem tedious to use thus it is used in short form of np.
#Creating an Array
arr = np.array([1, 2, 3])
print(arr)

#Creating an array with range of nos. from 0, 100
arrOne = np.arange(0, 100)
#Creating an array within the range along with step size
arrTwo = np,linespace(1, 3, 0.5)

In Python, list and arrays seems one or the other form but they differ a lot in terms of their size and time taken to process them. Try running this code on your system.

#Comparing the list and array
list = range(0, 1000)
arrThree = np.arange(0, 1000)
import sys
print(sys.getsizeof(1) * len(list))
print(D.size * D.itemsize)
import time
L1 = range(1000)
L2 = range(1000)
start = time.time()
result = [(x+y) for x,y in zip(L1,L2)]
print(time.time()-start)
start = time.time()
AR1 = np.arange(1000)
AR2 = np.arange(1000)
result = AR1 + AR2
print(time.time()-start)

There might be the case that output may be same then in that case try to increase the range.

There are some functions in numpy that help to gain better insight of the array
#Understanding the array

arrFour = np.array([(1, 2, 3), (4, 5, 6),(7, 8, 9)])
print(arrFour.ndim) #Ouput : 2 (dimension)
print(arrFour.itemsize) #Ouput : 4 (item Size)
print(arrFour.dtype) #Ouput : int32 (data type)
print(arrFour.size) #Ouput : 9 (no. of elements)
print(arrFour.shape) #Ouput : (3, 3) (shape)
print(arrFour.reshape(9,1))

In numpy reshaping preserves the size of the array that is no new element is added or deleted while reshaping it is just a transformation from one form to another.

Two different array can be stacked to one n another either horizontally or vertically


arrFive = np.array([(1, 2, 3), (4, 5, 6)])
arrSix = np.array([(7, 8, 9), (10, 11, 12)])
#Column wise stacking
np.hstack((arrFive, arrSix))
#Row wise stacking
np.vstack((arrFive, arrSix))

All the elements in the n-dimensional array can be used in 1d by using the function ravel() or flatten().


arrSeven = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
arrSeven.ravel()
arrSeven.flatten()

The output of both the functions will be same with the only difference that flatten() returns copy of the array whereas ravel() returns the original view of the array.

To get a view of a particular set of row or column following code can be used

arrEight = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
print(arr[0:2])
print(arr[0:2, 2])

In python the indexing of the array starts from 0. In this code snippet the first statement will print print the row (1, 2, 3) and (4, 5, 6) while the second statement will going to print the 3rd elements of each row that are [3, 6].

There are other some standard mathematical operations that can be performed on the arrays such as square root (sqrt), log(log or log 10), standard deviation(std), sin etc.


arrNine = np.array([(1, 2, 3), (4, 5, 6),(7, 8, 9)])
print(np.sqrt(arrNine))
print(np.std(arrNine))
print(np.log10(arrNine))
print(np.log(arrNine))

Communication in Aviation

If you’re a frequent air traveller then you must have experienced the delays and cancellation of flights due to bad weather.

To us it is easily conveyed through calls or messages from the airlines but what about those aeroplanes which are still in the sky preparing to land or are on their way to destination?

Obviously there is Air Traffic Control room to take care of such situations but just think about the number of flights to be informed and this passing of information to take place in the form of text messages is a tedious task. If for a second, this is ignored then what if the pilot is informed about problem but due to different ethnicity they may not understand each other.

And in such situation the person in ATC room doesn’t have enough time to dedicate more than a minute or two to this. So now what?

The solution for this problem was NOTAM.

Notice to Airmen abbreviated as NOTAM is a notice filed with an aviation authority to alert pilots of potential hazards along a flight route or at a location that could affect the safety of the flight.

It is done in two steps:

  1. NOTAM is filed with an aviation authority to alert pilots about any en route hazard.
  2. Authority in turn provides means to disseminate relevant NOTAM to the pilot.

There are different version of NOTAM based on their usage.

  • BIRDTAM: when there is a passage of flock of birds in airspace
  • SNOWTAM: notification of runway status w.r.t. snow, ice and standing water
  • ASHTAM: notification of significant change in volcanic ash or dust contamination.

There are many other reasons to use NOTAMS like in case of temporary erection of tall obstruction, inoperable lights on tall obstruction, military exercises or flight of important people.

There are various sites which allow airline dispatchers, pilots and airport authority personnel to search for active NOTAMs.

There are also apps and websites available to decode NOTAMs.

For more information regarding NOTAM decoding please go through this link:

http://thinkaviation.net/notams-decoded/

PS: NOTAM is not actually a two-step procedure but for better understanding I had broken it in two steps 🙂

Enjoy Learning!!