DataSci&MachLearn Workflow 1/4

Posted on February 10, 2021May 25, 2021Machine Learning Algorithms, Python, Research Note

Inspired by a fabulous post on kaggle：
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

A data science framework

Define the Problem: Determine the key requirement and pick up the suitable algorithm.
Gather the Data: Gathering all types of data and techniques.
Prepare Data for Consumption: This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.
Perform Exploratory Analysis: Looking for potential problems, patterns, correlations etc by deploying decriptive and graphical statistics. A basic rule for data science, garbage-in, garbage-out (GIGO). In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
Model Data: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It’s important to remember, algorithms are tools and not magical wands or silver bullets. You should focus on the the selection and application of tools.
Validate and Implement Data Model: After you’ve trained your model based on a subset of your data, it’s time to test your model. This helps ensure you haven’t overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.
Optimize and Strategize: This is the “bionic man” step, where you iterate back through the process to make it better, stronger, faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design.

4 C’s for Data Cleaning

Correcting aberrant values and outliers
Unless confirm the values is absurd, do not modify them for an accurate model.
Compleing missing information
When missing data take up a noticeable amount of portion, imputing missing data is recommended rather than delete them.
For quantitative data, mean+randomized std deviation, mean, median is often adapted.
For qualitative data, the features with much missing data is often abolished or complete with mode (众数)
Creating new features for analysis
Converting fields to appopriate format for calculations and illustration:
E.g., numpy.ndarray for CPU calculations, tensor for GPU calculations.

Get familiar with Data & Checking Null data

Using isnall() or isna() to missing data
Or simply use df.describe() to get known about general situation.
df.sample() can also be used to get samples from the dataset.

# get unique data of a series
In: seri.unique()

# count the number of them
In: seri.value_counts()

In: df.isnul().sum()

Out: 
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In: df.describle(include='all')

Out:

# get unique data of a series

In: seri.unique()

# count the number of them

In: seri.value_counts()

In: df.isnul().sum()

Out:

PassengerId 0

Survived 0

Pclass 0

Name 0

Sex 0

Age 177

SibSp 0

Parch 0

Ticket 0

Fare 0

Cabin 687

Embarked 2

dtype: int64

In: df.describle(include='all')

Out:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
count	891.000000	891.000000	891.000000	891	891	714.000000	891.000000	891.000000	891	891.000000	204	889
unique	NaN	NaN	NaN	891	2	NaN	NaN	NaN	681	NaN	147	3
top	NaN	NaN	NaN	Lindahl, Miss. Agda Thorilda Viktoria	male	NaN	NaN	NaN	1601	NaN	C23 C25 C27	S
freq	NaN	NaN	NaN	1	577	NaN	NaN	NaN	7	NaN	4	644
mean	446.000000	0.383838	2.308642	NaN	NaN	29.699118	0.523008	0.381594	NaN	32.204208	NaN	NaN
std	257.353842	0.486592	0.836071	NaN	NaN	14.526497	1.102743	0.806057	NaN	49.693429	NaN	NaN
min	1.000000	0.000000	1.000000	NaN	NaN	0.420000	0.000000	0.000000	NaN	0.000000	NaN	NaN
25%	223.500000	0.000000	2.000000	NaN	NaN	20.125000	0.000000	0.000000	NaN	7.910400	NaN	NaN
50%	446.000000	0.000000	3.000000	NaN	NaN	28.000000	0.000000	0.000000	NaN	14.454200	NaN	NaN
75%	668.500000	1.000000	3.000000	NaN	NaN	38.000000	1.000000	0.000000	NaN	31.000000	NaN	NaN
max	891.000000	1.000000	3.000000	NaN	NaN	80.000000	8.000000	6.000000	NaN	512.329200	NaN	NaN

Cleaning data

###COMPLETING: complete or delete missing values in train and test/validation dataset
for dataset in data_cleaner:    
    #complete missing age with median
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)

    #complete embarked with mode
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)

    #complete missing fare with median
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
    
#delete the cabin feature/column and others previously stated to exclude in train dataset
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)

# Double check the data cleaning results
print(data1.isnull().sum())
print("-"*10)
print(data_val.isnull().sum())

###COMPLETING: complete or delete missing values in train and test/validation dataset

for dataset in data_cleaner:

#complete missing age with median

dataset['Age'].fillna(dataset['Age'].median(), inplace = True)

#complete embarked with mode

dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)

#complete missing fare with median

dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)

#delete the cabin feature/column and others previously stated to exclude in train dataset

drop_column = ['PassengerId','Cabin', 'Ticket']

data1.drop(drop_column, axis=1, inplace = True)

# Double check the data cleaning results

print(data1.isnull().sum())

print("-"*10)

print(data_val.isnull().sum())

Supplementary Notes

‘df.mode()‘ may return a dataframe, because it may have multiple modes for a series. A typical solution is ‘df.fillna(df.mode()[0])’
‘df.apply()‘ works for function operations on certain row or column
‘df.applymap()’ works for function operations on whole dataframe
‘df.map()‘ works for dictionary replacement
‘df.groupby()’ is also commonly used with these functions for quick feature selections.

Generating new features

In this example, family size and IsAlone or not and Title are generated.
Some interesting pandas functions are utilized.

###CREATE: Feature Engineering for train and test/validation dataset
for dataset in data_cleaner:    
    #Discrete variables
    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1

    dataset['IsAlone'] = 1 #initialize to yes/1 is alone
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

    #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]


    #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
    #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

    #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)


    
#cleanup rare title names
#print(data1['Title'].value_counts())
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (data1['Title'].value_counts() &lt; stat_min) #this will create a true false series with title name as index

#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())
print("-"*10)

###CREATE: Feature Engineering for train and test/validation dataset

for dataset in data_cleaner:

#Discrete variables

dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1

dataset['IsAlone'] = 1 #initialize to yes/1 is alone

dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

#quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split

dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

#Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut

#Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html

dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

#Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html

dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)

#cleanup rare title names

#print(data1['Title'].value_counts())

stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/

title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index

#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/

data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)

print(data1['Title'].value_counts())

print("-"*10)

Supplementary notes

“pd.qcut( arr, n) ” will divide array into n groups with equal number of data in each group
“pd.cut(arr, n) ” divides array into n groups with equal data range, regardless of frequency.
“split()” function splits string with the given elements. In the aforementioned example, the string in ‘Name‘ such as ‘Jermyn, Miss. Annie’ is split with commas ‘,’ and take the latter half “Miss. Annie”, then split with period ‘.’ and take the former half “Miss”
Misc should be abbreviation of miscellaneous

Convert format

#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset

#code categorical data
label = LabelEncoder()
for dataset in data_cleaner:    
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])


#define y variable aka target/outcome
Target = ['Survived']

#define x variables for original features aka feature selection
data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
data1_xy =  Target + data1_x
print('Original X Y: ', data1_xy, '\n')


#define x variables for original w/bin features to remove continuous variables
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')


#define x and y variables for dummy features original
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')



data1_dummy.head()

#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset

#code categorical data

label = LabelEncoder()

for dataset in data_cleaner:

dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])

dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])

dataset['Title_Code'] = label.fit_transform(dataset['Title'])

dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])

dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])

#define y variable aka target/outcome

Target = ['Survived']

#define x variables for original features aka feature selection

data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts

data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation

data1_xy = Target + data1_x

print('Original X Y: ', data1_xy, '\n')

#define x variables for original w/bin features to remove continuous variables

data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']

data1_xy_bin = Target + data1_x_bin

print('Bin X Y: ', data1_xy_bin, '\n')

#define x and y variables for dummy features original

data1_dummy = pd.get_dummies(data1[data1_x])

data1_x_dummy = data1_dummy.columns.tolist()

data1_xy_dummy = Target + data1_x_dummy

print('Dummy X Y: ', data1_xy_dummy, '\n')

data1_dummy.head()

Supplementary Notes

Categorical Encoding is allocate each type of label with a number, which is easier for algorithm process.
‘sklearn.preprocess.LabelEncoder()’ can learn and convert (no matter numerical or non-numerical) labels into numerical labels.
‘fit_transform(y)‘ will fit label encoder and return encoded labels.
For more, see:https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Similar procedure is done by attribute the col into category then convert it into codes.
‘df[<col>].astype(‘category’)’

# convert the column into category
obj_df["body_style"] = obj_df["body_style"].astype('category')

# encode the converted column 
# and concat the encoding results to the dataframe
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes

# convert the column into category

obj_df["body_style"] = obj_df["body_style"].astype('category')

# encode the converted column

# and concat the encoding results to the dataframe

obj_df["body_style_cat"] = obj_df["body_style"].cat.codes

The results is

	make	body_style	drive_wheels	body_style_cat
0	alfa-romero	convertible	rwd	0
1	alfa-romero	convertible	rwd	0
2	alfa-romero	hatchback	rwd	2
3	audi	sedan	fwd	3
4	audi	sedan	4wd	3

One-hot Encoder or dummy encode encode label to binary label
‘sklearn.preprocess.OneHotEncoder(categories=’auto’, drop=None,sparse=True, handle_unknown=’error’)’ can drop ‘first’ column or ‘if_binary’, categories can be indicated. ‘sparse = True ’returns a sparse matrix.
‘df_dummy = pd.get_dummies(df)’ also returns a dataframe with one-hot keys labelled parameters.
A dataframe with one-hot encoded could be like this:
(The sklearn Encoder is used in this demo)

from sklearn.preprocessing import OneHotEncoder

oe_style = OneHotEncoder()
oe_results = oe_style.fit_transform(obj_df[["body_style"]])
pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).head()

from sklearn.preprocessing import OneHotEncoder

oe_style = OneHotEncoder()

oe_results = oe_style.fit_transform(obj_df[["body_style"]])

pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).head()

	convertible	hatchback	sedan
0	1	0	0
1	1	0	0
2	0	1	0
3	0	0	1
4	0	0	1

Da-Double Check

Double check the cleaned data before put into model training etc.
‘df.info()’ is recommended.

Train & Test Spliting

data are usually separated into 3 parts, train, test, validation. Definition of them are various, but they all convey the idea that, modification of algorithm should not be made based on the results of the final data.
The proportion of train/test is usually 75/25.
A quick division can be done by ‘sklearn.model_selection.train_test_split(x,y, random_state=0)’, but ‘cross_validation’ might be used for massive model training and selection to enrich dataset.