});

DataSci&MachLearn Workflow 1/4

Inspired by a fabulous post on kaggle:
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

A data science framework

  1. Define the Problem: Determine the key requirement and pick up the suitable algorithm.
  2. Gather the Data: Gathering all types of data and techniques.
  3. Prepare Data for Consumption: This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.
  4. Perform Exploratory Analysis: Looking for potential problems, patterns, correlations etc by deploying decriptive and graphical statistics. A basic rule for data science, garbage-in, garbage-out (GIGO).  In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
  5. Model Data: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It’s important to remember, algorithms are tools and not magical wands or silver bullets. You should focus on the the selection and application of tools.
  6. Validate and Implement Data Model: After you’ve trained your model based on a subset of your data, it’s time to test your model. This helps ensure you haven’t overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.
  7. Optimize and Strategize: This is the “bionic man” step, where you iterate back through the process to make it better, stronger, faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design. 

4 C’s for Data Cleaning

  1. Correcting aberrant values and outliers
    Unless confirm the values is absurd, do not modify them for an accurate model.
  2. Compleing missing information
    When missing data take up a noticeable amount of portion, imputing missing data is recommended rather than delete them.
    For quantitative data, mean+randomized std deviation, mean, median is often adapted.
    For qualitative data, the features with much missing data is often abolished or complete with mode (众数)
  3. Creating new features for analysis
  4. Converting fields to appopriate format for calculations and illustration:
    E.g., numpy.ndarray for CPU calculations, tensor for GPU calculations.

Get familiar with Data & Checking Null data

Using isnall() or isna() to missing data
Or simply use df.describe() to get known about general situation.
df.sample() can also be used to get samples from the dataset.


PassengerId
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
count891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889
uniqueNaNNaNNaN8912NaNNaNNaN681NaN1473
topNaNNaNNaNLindahl, Miss. Agda Thorilda ViktoriamaleNaNNaNNaN1601NaNC23 C25 C27S
freqNaNNaNNaN1577NaNNaNNaN7NaN4644
mean446.0000000.3838382.308642NaNNaN29.6991180.5230080.381594NaN32.204208NaNNaN
std257.3538420.4865920.836071NaNNaN14.5264971.1027430.806057NaN49.693429NaNNaN
min1.0000000.0000001.000000NaNNaN0.4200000.0000000.000000NaN0.000000NaNNaN
25%223.5000000.0000002.000000NaNNaN20.1250000.0000000.000000NaN7.910400NaNNaN
50%446.0000000.0000003.000000NaNNaN28.0000000.0000000.000000NaN14.454200NaNNaN
75%668.5000001.0000003.000000NaNNaN38.0000001.0000000.000000NaN31.000000NaNNaN
max891.0000001.0000003.000000NaNNaN80.0000008.0000006.000000NaN512.329200NaNNaN

Cleaning data

Supplementary Notes

  1. df.mode()‘ may return a dataframe, because it may have multiple modes for a series. A typical solution is ‘df.fillna(df.mode()[0])’
  2. df.apply()‘ works for function operations on certain row or column
    df.applymap()’ works for function operations on whole dataframe
    df.map() works for dictionary replacement
    df.groupby()’ is also commonly used with these functions for quick feature selections.

Generating new features

In this example, family size and IsAlone or not and Title are generated.
Some interesting pandas functions are utilized.

Supplementary notes

  1. pd.qcut( arr, n) ” will divide array into n groups with equal number of data in each group
    pd.cut(arr, n) ” divides array into n groups with equal data range, regardless of frequency.
  2. split()” function splits string with the given elements. In the aforementioned example, the string in ‘Name‘ such as ‘Jermyn, Miss. Annie’ is split with commas ‘,’ and take the latter half “Miss. Annie”, then split with period ‘.’ and take the former half “Miss”
  3. Misc should be abbreviation of miscellaneous

Convert format

Supplementary Notes

  1. Categorical Encoding is allocate each type of label with a number, which is easier for algorithm process.
  2. ‘sklearn.preprocess.LabelEncoder()’ can learn and convert (no matter numerical or non-numerical) labels into numerical labels.
    fit_transform(y)‘ will fit label encoder and return encoded labels.
    For more, see:https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
    Similar procedure is done by attribute the col into category then convert it into codes.
    ‘df[<col>].astype(‘category’)’

The results is

makebody_styledrive_wheelsbody_style_cat
0alfa-romeroconvertiblerwd0
1alfa-romeroconvertiblerwd0
2alfa-romerohatchbackrwd2
3audisedanfwd3
4audisedan4wd3
  1. One-hot Encoder or dummy encode encode label to binary label
  2. ‘sklearn.preprocess.OneHotEncoder(categories=’auto’, drop=None,sparse=True, handle_unknown=’error’)’ can drop ‘first’ column or ‘if_binary’, categories can be indicated. ‘sparse = True ’returns a sparse matrix.
    ‘df_dummy = pd.get_dummies(df)’ also returns a dataframe with one-hot keys labelled parameters.
    A dataframe with one-hot encoded could be like this:
    (The sklearn Encoder is used in this demo)

convertiblehardtophatchbacksedanwagon
010000
110000
200100
300010
400010

Da-Double Check

Double check the cleaned data before put into model training etc.
‘df.info()’ is recommended.

Train & Test Spliting

data are usually separated into 3 parts, train, test, validation. Definition of them are various, but they all convey the idea that, modification of algorithm should not be made based on the results of the final data.
The proportion of train/test is usually 75/25.
A quick division can be done by ‘sklearn.model_selection.train_test_split(x,y, random_state=0)’, but ‘cross_validation’ might be used for massive model training and selection to enrich dataset.