-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
151 lines (116 loc) · 4.27 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
###**WEEK 1 project
As a data scientist working ar instagram how would you analyse key performance insights in assessing the success of the IGTV product
###**Step 1:
**Description
As a data scientist at instagram i would follow the data science process that would involve fundamental of statistcs, mathematics and programming
###**Step 2:Data Collection
I generated my data from mockaroo as a csv file with various fields
```python
import pyforest
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
pd.set_option('display.max_rows',None)
data=pd.read_csv("C:/Users/User/Downloads/MOCK_DATA (2).csv")
```
###**Step 3:Data Preprocessing
This is how i cleaned the data.
```python
data.dtypes
data.head()
data.shape
df=data.drop(['impressions'],axis=1)
```
###Renaming
```python
df=df.rename(columns={'video_id':'Video_ID','views':'No_of_Views','average_watch_time':'Average_WatchTime','completion_rate':'Completion_Rate','click_through_rate':'Clicks_Rate','engangement_rate':'Engagement_Rate'})
df
```
###drop null values
```python
df=df.drop(df.isna(),axis=1)
```
df.loc[df.duplicated()]
```
###**Step 4:Data Exploration
This is how I performed EDA using diffrent approaches
```python
df_sorted = df.sort_values('followers', ascending=False)
mostly_followed = df[:10]
mostly_followed
# Create a horizontal bar chart
plt.barh(mostly_followed['Video_ID'], mostly_followed['followers'])
plt.xlabel('Followers')
plt.ylabel('Video ID')
plt.title('Top 10 Videos by Followers')
plt.show()
```
###**Step 5:Predictive Modelling
I analysed two models
1)
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Creating a copy of the DataFrame
most_influential_copy = most_influential.copy()
# Handling missing values (NaN) in the 'shares' column in the copied DataFrame
most_influential_copy['shares'].fillna(0, inplace=True)
# Defining independent variables (X) and dependent variable (y)
X = most_influential[['likes', 'comments', 'shares']]
y = most_influential['Video_ID']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
# Fitting the model to the training data
model.fit(X_train, y_train)
# Making predictions on the test data
y_pred = model.predict(X_test)
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2): {r2:.2f}")
# Coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_
print(f"Coefficients: {coefficients}")
print(f"Intercept: {intercept}")
```
2)
```python
#using random forest
from sklearn.ensemble import RandomForestRegressor
da=pd.read_csv("C:/Users/User/Downloads/MOCK_DATA (2).csv")
da
da=data.drop(['impressions'],axis=1)
da.fillna(da.mean(), inplace=True)
da['average_watch_time']=pd.to_datetime(da['average_watch_time'])
# Splitting the data into features (X) and target (y)
X = da.drop('average_watch_time', axis=1)
y = da['engagement_rate']
# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=42) # You can adjust hyperparameters as needed
# Fitting the model on the training data
model.fit(X_train, y_train)
# Making predictions on the test data
y_pred = model.predict(X_test)
# Evaluating the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
```
###**Step 6:Selection of the models
I decided to go with the random forest model as it yielded positive prediction and used it in my rest ot the data.
Random Forest:
Mean Squared Error: 0.00918879593299292
R-squared: 0.99998861412386
Linear Regression:
Mean Squared Error (MSE): 91429.19
R-squared (R2): -3.41
Coefficients: [ 0.79901879 -0.02785449 -0.05815464]
Intercept: -79012.17522320009