Quoting a joke from Sheldon Cooper, the funny character in THE BIG BANG THEORY TV show, when he made a prediction for a new comer's name: "The world most popular first name is 'Mohamed'. The most popular last name is 'Lee'. His name must be 'Mohamed Lee'" I guess that Sheldon is no Substantive Expert of human race and names, or he pretended to be.
Here is my favourite data science Venn Diagram first illustrated by Drew Conway:
Machine learning algorithms themselves are not very difficult to understand, rather the data relationship is, and particularly the relationship is understood by both data analyst and substantive expert in the project. The data relationship is the decisive factor in how data should be structured in the machine learn processing. Ultimately it decides the final outcome of machine learning models.
To decide which dataset will be the target data is relatively easy. What you want to improve is usually what your data modeling target data will be. But, to decide which dataset will be included in the data modeling input is often difficult. A set of input data is called a machine learning “feature”.
Data analysts depend on two types of graphs to make the call for machine learning features:
- feature group chart (similarity factors, feature distances, even data heat map)
- feature importance tornado bar chart
Simple principle is to pick several top features in the feature importance chart, then check if some top features belong to the same feature group. If they do belong to the same group, keep one at the highest position, throw out the rest.
In reality, to carry out the “simple principle” is much complicated. Taking the Shale Well Optimization Project, that is provided in the first presentation of this knowledge share session, for instance.
“open_hole” and “n_frac_stage” are grouped together (the pink group) in the feature importance chart. “open_hole” (open hole well completion? Yes/No, valued in “0” or “1”) is a Boolean. “n_frac_stage” (number of well fracturing stages) is an integer number at the range of 15-30. The two features are not related at all. The two features with small integers were grouped together only by accident, since the rest of feature values are in thousands, even millions with much broad ranges. Without substantive expertise input to separate them, they would be used in the same data dimension.
Other example in the shale optimization project is the abandonment of “ave_perf_int_leng” feature (average perforation interval length), though the feature is at top 5 position in feature importance chart. This feature multiply “n_frac_stage” equals “total_perf_len”, in turn, equals”total_frac_len” which already includes in the selected feature. In addition, only wells with “open_hole” = “0” have this dataset. That mean half of the wells do not have this dataset. So the dataset is taken out the machine learning input features.
I do not mean to confuse people with the above examples. I rather intend to emphasize how important it is for data analyst and substantive experts to work closely during data analytics process, because the inter-data relationships are indeed very complicated. It is why we appeal to machine learning and other advanced technologies for help at the first place. Human have to structure the data before the mathematical algorithms can do anything. Otherwise our data analytics process is just “garbage in and garbage out”.
As the joke at the beginning of this article, a data analytics model can be misleading and very dangerous if it was not being fully examined from every aspect by close collaboration between data analyst and substantive expert.