THE COMPLETE GUIDE TO
Using Linear Regression to predict the Dow Jones Industrial Average Index
“The stock market is filled with individuals who know the price of everything, but the value of nothing.” — Fischer
“A prediction about the direction of the stock market tells you nothing about where stocks are headed, but a whole lot about the person doing the predicting.” — Warren Buffet
Today, the stock market is an important part of the lives of many people. People are looking to earn extra money in the stock market to supplement their income. However, there is a large risk associated with investing in the stock market. Many people think why they should risk there hard-earned money in the stock market. Therefore, there is a huge need to predict the Stock Market.
And predicting the Dow Jones is a perfect way to start predicting the Stock Market.
This tutorial will take around 30–45 minutes to complete. We will be using Google Colaboratory.
Note: This tutorial has been adapted from the tutorial by the YouTube channel, Computer Science.
- A Google account
- The Dow Jones dataset. Get it here — DowJones.csv
- Basic knowledge of Linear Regression
- Basic knowledge of Python
Recommended: CS50’s Introduction to Artificial Intelligence with Python — by CS50 at Harvard University.
Create a new .ipynb Jupyter Notebook in Google Colaboratory and call it ‘Dow Jones.ipynb’.
I. The Imports
import numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionimport matplotlib.pyplot as pltplt.style.use(‘bmh’)
We will use pandas, scikit–learn, numpy and matplotlib. And we will use the bmh style in matplotlib.
II. Uploading the dataset
Add the following piece of code in a new block right below the imports.
from google.colab import filesuploaded = files.upload()
We are actually creating an upload widget in Google Colaboratory and then uploading the file.
Now, we can upload the DowJones.csv file in the upload widget
III. Reading the CSV file using pandas
df = pd.read_csv(‘DowJones.csv’)df.head(6)
We are reading the file and getting the top six rows. You will get the alongside output.
The shape of the data frame(df) is (2082, 2). You can get it using df.shape. This is one way to get the number trading days (2082).
IV. Visualizing the data
plt.figure(figsize=(16, 8))plt.title(‘Dow Jones’)plt.xlabel(‘Date’)plt.ylabel(‘Close Price USD ($)’)plt.plot(df[‘Value’])plt.show()
We will use figure size: 16, 8; set the title to ‘Dow Jones’; x values will be dates and y values will be close prices in USD ($), which are the Value column in df. Then we will show the data. The output will be as follows.
V. Getting the values
df = df[[‘Value’]]df.head(4)
So, now we have only the values in the data frame. I am getting the first four rows to see the output, which is alongside.
VI. Creating a variable to predict ‘x’ future days
future_days = 900df[‘Prediction’] = df[[‘Value’]].shift(-future_days)df.head(4)
We will predict 900 days in the future. We create a new column called ‘Prediction’, which will be the values shifted 900 days. After getting the top four rows you will get the following output. The predictions are not anywhere close to the actual values because we haven’t used linear regression yet.
VII. Creating the feature dataset (X)
X = np.array(df.drop([‘Prediction’], 1))[:-future_days]
We are creating the feature dataset and then converting it to a numpy array and then removing the last ‘x’ rows/days.
VIII. Creating the target dataset (Y)
y = np.array(df[‘Prediction’])[:-future_days]
We are creating the target dataset, converting it to a numpy array and getting the target values except ‘x’ days
XI. Splitting the data
75% training and 25% testing data
x_train, x_test ,y_train, y_test = train_test_split(X, y, test_size = 0.25)
X. Creating the ML model
model = LinearRegression().fit(x_train, y_train)
XI. Get the last ‘x’ rows of the feature dataset
x_future = df.drop([‘Prediction’], 1)[:-future_days]x_future = x_future.tail(future_days)x_future = np.array(x_future)x_future
XII. Getting the predictions and viewing the results
prediction = model.predict(x_future)predictions = predictionvalid = df[X.shape:]valid[‘Predictions’] = predictionsplt.figure(figsize = (16, 8))plt.title(‘Dow Jones’)plt.xlabel(‘Days’)plt.ylabel(‘Close Price USD ($)’)plt.plot(df[‘Value’])plt.plot(valid[[‘Value’, ‘Predictions’]])plt.legend([‘Orig’, ‘Val’, ‘Pred’])
The accuracy chart will be displayed as follows: