# EDA.ipynb

Olawale

This notebook covers the exploratory data analysis of the tutorial which helps to gain necessary insights needed for selecting the best data preparation techniques, in order to get the best prediction results out of the machine learning model(s).

# importing required libraries and packages

import numpy as np
import pandas as pd

!pip3 install petroeval
import petroeval as pet
import matplotlib.pyplot as plt

### #Getting Data

The train data set comprise well logs from a total of 98 wells from the North Sea, while the open test data is made up of 10 wells, Only the gamma ray (GR) log is present in all the wells.

traindata = pd.read_csv('./data/train.csv', sep=';')
testdata = pd.read_csv('./data/leaderboard_test_features.csv.txt', sep=';')

### #Data Exploration and Visualization

Here, the data is investigated to understand it better (shape and form). Since the data is a combination of 98 different wells, it will be time intensive to make visualizations of each of the wells, so a more combined approach into looking at the wells is taken. Both train and test data are explored simultaneously.

• Basic information check on wells and data
• Visualizing the percentage of each logs present in all wells
• Visualizing the percentage OF missing values of each logs present in all wells
• Spatial distribution of wells according to their coordinates
• Visualizing some log plots
traindata.head()
traindata.shape, testdata.shape

With over 1million training data points, safe to call this "big data". Also, could be seen that the test data set has two logs lesser than the train data. Let's investigate that.

number = 0
logs = []
if log not in testdata.columns:
logs.append(log)
number += 1

print(f'{number} logs not in test data:')
print(logs)
traindata.columns
traindata.dtypes

From above data types, there are three categorical logs denoted as 'object'. Getting more info on these logs will provide better insight to choose a better encoding technique.

print(f'Number of wells in data: {len(np.unique(traindata.WELL))}')
# checking more info on the categorical logs (FORMATION AND GROUP)

print(f'Unique formation count: {len(dict(traindata.FORMATION.value_counts()))}')
print(f'Unique group count: {len(dict(traindata.GROUP.value_counts()))}')
traindata.GROUP.value_counts()
traindata.FORMATION.value_counts()

#### #Checking for the percentage of each logs present in all wells

For train data

# this code is extracted from Matteo Niccoli github EDA repo for the FORCE competition:
# https://github.com/mycarta/Force-2020-Machine-Learning-competition_predict-lithology-EDA/blob/master/Interactive_data_inspection_and_visualization_by_well.ipynb

occurences = np.zeros(25)
for well in traindata['WELL'].unique():
occurences += traindata[traindata['WELL'] == well].isna().all().astype(int).values[2:-2]
fig, ax = plt.subplots(1, 1, figsize=(14, 7))
ax.bar(x=np.arange(occurences.shape[0]), height=(traindata.WELL.unique().shape[0]-occurences)/traindata.WELL.unique().shape[0]*100.0)
ax.set_xticklabels(traindata.columns[2:-2], rotation=45)
ax.set_xticks(np.arange(occurences.shape[0]))
ax.set_ylabel('Well presence (\%)')
occurences = np.zeros(23)
for well in testdata['WELL'].unique():
occurences += testdata[testdata['WELL'] == well].isna().all().astype(int).values[2:-2]
fig, ax = plt.subplots(1, 1, figsize=(14, 7))
ax.bar(x=np.arange(occurences.shape[0]), height=(testdata.WELL.unique().shape[0]-occurences)/testdata.WELL.unique().shape[0]*100.0)
ax.set_xticklabels(testdata.columns[2:-2], rotation=45)
ax.set_xticks(np.arange(occurences.shape[0]))
ax.set_ylabel('Well presence (\%)')

Logs are presently different in both train and test data, as we can see, the SGR log is absent in all test logs

While figure above shows the log presence based on appearance on all wells, the below shows the percentage of logs based on missing values/actual data present

train_well_items = dict(100 - (traindata.isna().sum()/traindata.shape[0]) * 100)
test_well_items = dict(100 - (testdata.isna().sum()/testdata.shape[0]) * 100)
train_well_items = {log:value for log, value in train_well_items.items() if value != 100.0}
test_well_items = {log:value for log, value in test_well_items.items() if value != 100.0}
train_well_items
occurences = np.zeros(len(train_well_items))
fig, ax = plt.subplots(1, 1, figsize=(14, 7))
ax.bar(x=np.arange(occurences.shape[0]), height=train_well_items.values())
ax.set_xticklabels(train_well_items.keys(), rotation=45)
ax.set_xticks(np.arange(occurences.shape[0]))
ax.set_ylabel('Data presence (\%)')
occurences = np.zeros(len(test_well_items))
fig, ax = plt.subplots(1, 1, figsize=(14, 7))
ax.bar(x=np.arange(occurences.shape[0]), height=test_well_items.values())
ax.set_xticklabels(test_well_items.keys(), rotation=45)
ax.set_xticks(np.arange(occurences.shape[0]))
ax.set_ylabel('Data presence (\%)')

Visualizing the train labels categories

labels = dict(traindata.FORCE_2020_LITHOFACIES_LITHOLOGY.value_counts())
lithofacies_names = ['Shale', 'Sandstone', 'SS/Shale', 'Marl',
'Dolomite', 'Limestone', 'Chalk', 'Halite', 'Anhydrite',
'Tuff', 'Coal', 'Basement']
traindata.FORCE_2020_LITHOFACIES_LITHOLOGY.value_counts()
fig = plt.figure(figsize=(15, 10))
plt.bar(lithofacies_names, (np.array(list(labels.values()))/traindata.shape[0]) * 100)
# spatial distribution of both train and test wells

fig = plt.figure(figsize=(12,8))
plt.scatter(traindata.X_LOC, traindata.Y_LOC, c='g')
plt.scatter(testdata.X_LOC, testdata.Y_LOC, c='r')
plt.show()

From the plot, we can see that the test wells are evenly distributed to cover the train wells spread. This makes it easier while preparing the data sets as we get to apply almost same techniques to both data sets.

While it might be impossible to view all train and test logs, let's take a look at a test well in close proximity to the train logs and compare log signatures

testdata.loc[testdata.WELL == '15/9-14']
pet.four_plots(testdata.loc[testdata.WELL == '15/9-14'], x1='GR', x2='NPHI', x3='SP', x4='DTC',
top=480, base=3565, depth='DEPTH_MD')
traindata.WELL.unique()
traindata.loc[traindata.WELL == '15/9-15']
pet.four_plots(traindata.loc[traindata.WELL == '15/9-15'], x1='GR', x2='NPHI', x3='SP', x4='DTC',
top=485, base=3201, depth='DEPTH_MD')