DataSci&MachLearn Workflow 1/4
Inspired by a fabulous post on kaggle:
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
A data science framework
- Define the Problem: Determine the key requirement and pick up the suitable algorithm.
- Gather the Data: Gathering all types of data and techniques.
- Prepare Data for Consumption: This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.
- Perform Exploratory Analysis: Looking for potential problems, patterns, correlations etc by deploying decriptive and graphical statistics. A basic rule for data science, garbage-in, garbage-out (GIGO). In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
- Model Data: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It’s important to remember, algorithms are tools and not magical wands or silver bullets. You should focus on the the selection and application of tools.
- Validate and Implement Data Model: After you’ve trained your model based on a subset of your data, it’s time to test your model. This helps ensure you haven’t overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.
- Optimize and Strategize: This is the “bionic man” step, where you iterate back through the process to make it better, stronger, faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design.
4 C’s for Data Cleaning
- Correcting aberrant values and outliers
Unless confirm the values is absurd, do not modify them for an accurate model. - Compleing missing information
When missing data take up a noticeable amount of portion, imputing missing data is recommended rather than delete them.
For quantitative data, mean+randomized std deviation, mean, median is often adapted.
For qualitative data, the features with much missing data is often abolished or complete with mode (众数) - Creating new features for analysis
- Converting fields to appopriate format for calculations and illustration:
E.g., numpy.ndarray for CPU calculations, tensor for GPU calculations.
Get familiar with Data & Checking Null data
Using isnall() or isna() to missing data
Or simply use df.describe() to get known about general situation.
df.sample() can also be used to get samples from the dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# get unique data of a series In: seri.unique() # count the number of them In: seri.value_counts() In: df.isnul().sum() Out: PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 In: df.describle(include='all') Out: |
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891 | 891 | 714.000000 | 891.000000 | 891.000000 | 891 | 891.000000 | 204 | 889 |
unique | NaN | NaN | NaN | 891 | 2 | NaN | NaN | NaN | 681 | NaN | 147 | 3 |
top | NaN | NaN | NaN | Lindahl, Miss. Agda Thorilda Viktoria | male | NaN | NaN | NaN | 1601 | NaN | C23 C25 C27 | S |
freq | NaN | NaN | NaN | 1 | 577 | NaN | NaN | NaN | 7 | NaN | 4 | 644 |
mean | 446.000000 | 0.383838 | 2.308642 | NaN | NaN | 29.699118 | 0.523008 | 0.381594 | NaN | 32.204208 | NaN | NaN |
std | 257.353842 | 0.486592 | 0.836071 | NaN | NaN | 14.526497 | 1.102743 | 0.806057 | NaN | 49.693429 | NaN | NaN |
min | 1.000000 | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN |
25% | 223.500000 | 0.000000 | 2.000000 | NaN | NaN | 20.125000 | 0.000000 | 0.000000 | NaN | 7.910400 | NaN | NaN |
50% | 446.000000 | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 14.454200 | NaN | NaN |
75% | 668.500000 | 1.000000 | 3.000000 | NaN | NaN | 38.000000 | 1.000000 | 0.000000 | NaN | 31.000000 | NaN | NaN |
max | 891.000000 | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | NaN | 512.329200 | NaN | NaN |
Cleaning data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
###COMPLETING: complete or delete missing values in train and test/validation dataset for dataset in data_cleaner: #complete missing age with median dataset['Age'].fillna(dataset['Age'].median(), inplace = True) #complete embarked with mode dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True) #complete missing fare with median dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True) #delete the cabin feature/column and others previously stated to exclude in train dataset drop_column = ['PassengerId','Cabin', 'Ticket'] data1.drop(drop_column, axis=1, inplace = True) # Double check the data cleaning results print(data1.isnull().sum()) print("-"*10) print(data_val.isnull().sum()) |
Supplementary Notes
- ‘df.mode()‘ may return a dataframe, because it may have multiple modes for a series. A typical solution is ‘df.fillna(df.mode()[0])’
- ‘df.apply()‘ works for function operations on certain row or column
‘df.applymap()’ works for function operations on whole dataframe
‘df.map()‘ works for dictionary replacement
‘df.groupby()’ is also commonly used with these functions for quick feature selections.
Generating new features
In this example, family size and IsAlone or not and Title are generated.
Some interesting pandas functions are utilized.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
###CREATE: Feature Engineering for train and test/validation dataset for dataset in data_cleaner: #Discrete variables dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1 dataset['IsAlone'] = 1 #initialize to yes/1 is alone dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1 #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0] #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html dataset['FareBin'] = pd.qcut(dataset['Fare'], 4) #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5) #cleanup rare title names #print(data1['Title'].value_counts()) stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/ title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index #apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/ data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x) print(data1['Title'].value_counts()) print("-"*10) |
Supplementary notes
- “pd.qcut( arr, n) ” will divide array into n groups with equal number of data in each group
“pd.cut(arr, n) ” divides array into n groups with equal data range, regardless of frequency. - “split()” function splits string with the given elements. In the aforementioned example, the string in ‘Name‘ such as ‘Jermyn, Miss. Annie’ is split with commas ‘,’ and take the latter half “Miss. Annie”, then split with period ‘.’ and take the former half “Miss”
- Misc should be abbreviation of miscellaneous
Convert format
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset #code categorical data label = LabelEncoder() for dataset in data_cleaner: dataset['Sex_Code'] = label.fit_transform(dataset['Sex']) dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked']) dataset['Title_Code'] = label.fit_transform(dataset['Title']) dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin']) dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin']) #define y variable aka target/outcome Target = ['Survived'] #define x variables for original features aka feature selection data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation data1_xy = Target + data1_x print('Original X Y: ', data1_xy, '\n') #define x variables for original w/bin features to remove continuous variables data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code'] data1_xy_bin = Target + data1_x_bin print('Bin X Y: ', data1_xy_bin, '\n') #define x and y variables for dummy features original data1_dummy = pd.get_dummies(data1[data1_x]) data1_x_dummy = data1_dummy.columns.tolist() data1_xy_dummy = Target + data1_x_dummy print('Dummy X Y: ', data1_xy_dummy, '\n') data1_dummy.head() |
Supplementary Notes
- Categorical Encoding is allocate each type of label with a number, which is easier for algorithm process.
- ‘sklearn.preprocess.LabelEncoder()’ can learn and convert (no matter numerical or non-numerical) labels into numerical labels.
‘fit_transform(y)‘ will fit label encoder and return encoded labels.
For more, see:https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Similar procedure is done by attribute the col into category then convert it into codes.
‘df[<col>].astype(‘category’)’
1 2 3 4 5 6 |
# convert the column into category obj_df["body_style"] = obj_df["body_style"].astype('category') # encode the converted column # and concat the encoding results to the dataframe obj_df["body_style_cat"] = obj_df["body_style"].cat.codes |
The results is
make | body_style | drive_wheels | body_style_cat | |
---|---|---|---|---|
0 | alfa-romero | convertible | rwd | 0 |
1 | alfa-romero | convertible | rwd | 0 |
2 | alfa-romero | hatchback | rwd | 2 |
3 | audi | sedan | fwd | 3 |
4 | audi | sedan | 4wd | 3 |
- One-hot Encoder or dummy encode encode label to binary label
- ‘sklearn.preprocess.OneHotEncoder(categories=’auto’, drop=None,sparse=True, handle_unknown=’error’)’ can drop ‘first’ column or ‘if_binary’, categories can be indicated. ‘sparse = True ’returns a sparse matrix.
‘df_dummy = pd.get_dummies(df)’ also returns a dataframe with one-hot keys labelled parameters.
A dataframe with one-hot encoded could be like this:
(The sklearn Encoder is used in this demo)
1 2 3 4 5 |
from sklearn.preprocessing import OneHotEncoder oe_style = OneHotEncoder() oe_results = oe_style.fit_transform(obj_df[["body_style"]]) pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).head() |
convertible | hardtop | hatchback | sedan | wagon | |
---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 0 | 1 | 0 |
Da-Double Check
Double check the cleaned data before put into model training etc.
‘df.info()’ is recommended.
Train & Test Spliting
data are usually separated into 3 parts, train, test, validation. Definition of them are various, but they all convey the idea that, modification of algorithm should not be made based on the results of the final data.
The proportion of train/test is usually 75/25.
A quick division can be done by ‘sklearn.model_selection.train_test_split(x,y, random_state=0)’, but ‘cross_validation’ might be used for massive model training and selection to enrich dataset.