I based my function on the formula below. Name for phenomenon in which attempting to solve a problem locally can seemingly fail because they absorb the problem from elsewhere? We'll see that gradient descent becomes just a little bit different with multiple features compared to just one feature. x1, x2, etc.) Make a plot with number of iterations on the x-axis. The 'pinv' function will give you a value of \( \theta \) even if \( X^TX \) is not invertible. Teleportation without loss of consciousness. Automatic convergence test. What are the weather minimums in order to take off under IFR conditions? So we can use gradient descent as a tool to minimize our cost function. Some disadvantages of the normal equation method are; first unlike gradient descent, this is not generalized to other learning algorithms, such as the logistic regression algorithm that you'll learn about next week or the neural networks or other algorithms you see later in this specialization. Credit: Andrew Ng (Machine Learning). They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful. We can speed up gradient descent by having each of our input values in roughly the same range. Name for phenomenon in which attempting to solve a problem locally can seemingly fail because they absorb the problem from elsewhere? If J() ever increases, then you probably need to decrease . This would be cool. And once again we just write this as J of theta, so theta j is updated as theta j minus the learning rate alpha times the derivative, a partial derivative of the cost function with respect to the parameter theta j. At the end of the week, you'll get to practice implementing linear regression in code. Skills You'll Learn Regularization to Avoid Overfitting, Gradient Descent, Supervised Learning, Linear Regression, Logistic Regression for Classification 5 stars 91.67% 4 stars 7.32% 3 stars 0.64% 2 stars 0.12% 1 star 0.22% From the lesson Week 2: Regression with multiple input variables Handling unprepared students as a Teaching Assistant. Can someone explain me the following statement about the covariant derivatives? Can a black pudding corrode a leather tunic? Checking these two matrices will tell you if the algorithm has converged. We have parameters w_1 to w_n as well as b. Called the normal equation method, it turns out to be possible to use an advanced linear algebra library to just solve for w and b all in one goal without iterations. You can opt a very similar strategy like above to check this. The whole idea of gradient descent is that we can give any random initial value for the c1 and c2 then using a gradient descent algorithm update c1 and c2 every iteration considering all. Stack Overflow for Teams is moving to its own domain! I don't have much of a background in high level math, but here is what I understand so far. 2) Check if your variables have stopped changing. Here's what we have for gradient descent for the case of when we had N=1 feature. And this term here was of course the partial derivative of the cost function with respect to the parameter of theta0, and similarly we had a different update rule for the parameter theta1. Whereas it turns out gradient descent is a great method for minimizing the cost function J to find w and b, there is one other algorithm that works only for linear regression and pretty much none of the other algorithms you see in this specialization for solving for w and b and this other method does not need an iterative gradient descent algorithm. arrow_right_alt. Then, we start the loop for the given epoch (iteration) number. Logs. I am an R user and I am currently trying to use a Gradient Descent algorithm for which to compare against a multiple linear regression. Replace first 7 lines of one file with content of another file. We now introduce notation for equations where we can have any number of input variables. Linear Regression with Multiple Variables. cost.m is a short and simple file that has a function that calculates the value of cost function with respect to its arguments. Linear regression with multiple variables - Gradient Descent in Practice - Learning Rate Debugging gradient descent. Continue exploring. In the optional lab that follows this video, you'll see how to define a multiple regression model encode and also how to calculate the prediction f of x. Ideally: In Linear Regression we try to minimize the deviations. It is basically iteratively updating the values of w and w using the value of gradient, as in this equation: Fig. 1 input and 0 output. If \( X^TX \) is noninvertible, the common causes might be having : Redundant features, where two features are very closely related (i.e. In this beginner-friendly program, you will learn the fundamentals of machine learning and how to use these techniques to build real-world AI applications. By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy. Multiple Linear Regression with Gradient Descent. Hey guys! By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy. I gained some skills related to the supervised learning .this skills i learned in this course is very helpful to my future projects , thank u coursera and andrew ng. Visualization of gradient descent. In linear regression, the observations (red) are assumed to be the result of random deviations (green) from an underlying relationship (blue) between a dependent variable (y) and an independent variable (x). To implement both of these techniques, adjust your input values as shown in this formula: Where \( \mu_i \) is the average of all the values for feature (i) and \( s_i \) is the range of values (max - min), or \( s_i \) is the standard deviation. Usually one uses approximations such as assuming that 10e-20 is zero, which. What is the difference between an "odor-free" bully stick vs a "regular" bully stick? rev2022.11.7.43014. Did find rhyme with joined in the 18th century? How to split a page into four areas in tex. Hey guys! The parameters of this model are theta0 through theta n, but instead of thinking of this as n separate parameters, which is valid, I'm instead going to think of the parameters as theta where theta here is a n+1-dimensional vector. Will it have a bad influence on getting a student visa? How to measure this deviation. Explore Bachelors & Masters degrees, Advance your career with graduate-level learning, Gradient descent for multiple linear regression. Polynomial regression can be achieved by adding columns that equal to some existing columns to the power of degree d. Comments (0) Run. history Version 1 of 1. The normal equation formula is given below: There is no need to do feature scaling with the normal equation. I am trying to implement my own gradient descent function in python but my MSE loss function is suspiciously high. How do I delete a file or folder in Python? I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. So I'm just going to think of the parameters of this model as itself being a vector. The quizzes in this course use range - the programming exercises use standard deviation. You now know multiple linear regression. Now we know the basic concept behind gradient descent and the mean squared error, let's implement what we have learned in Python. We're going to repeatedly update each parameter w_j to be w_j minus Alpha times the derivative of the cost J, where J has parameters w_1 through w_n and b. This is done through stochastic gradient descent optimisation. Making statements based on opinion; back them up with references or personal experience. To learn more, see our tips on writing great answers. One difference is that w and x are now vectors and just as w on the left has now become w_1 here on the right, xi here on the left is now instead xi _1 here on the right and this is just for J equals 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. Once a new point enters our dataset, we simply plug in the number of bedrooms of our house into our function and we receive the predicted price for that dataset. Gradient descent is algorithm to minimize functions [8]. This method is called the normal equation. The output of runing gradient descent for this experiment is as follows: theta = -170.10392 -0.40601 4.99799 So the equation is predicted weight = -170.10392 - .406*age + 4.997*height This is almost absolute minimum of the gradient, since the true results for this problem if using PSPP (open source alternative of SPSS) are . Asking for help, clarification, or responding to other answers. We get this update rule for gradient descent. TensorFlow uses reverse-mode automatic differentiation to efficiently find the gradient of the cost function. Before moving on from this video, I want to make a quick aside or a quick side note on an alternative way for finding w and b for linear regression. When give a function that defined by a set of parameters, gradient descent begins with an initial set of parameter values and makes. Programming Assignment 2 : Implement Linear Regression, \( O (n^3 ) \) , need to calculate inverse of \( X^TX \). Our cost function can be defined as J of w_1 through w_n, b. It has been proven that if learning rate \( \alpha \) is sufficiently small, then J() will decrease on every iteration. This Notebook has been released under the Apache 2.0 open source license. Find the mean of the squares for every value in X. . This w_1 through w_n is replaced by this vector W and J now takes this input of vector w and a number b and returns a number. Mean Squared Error Equation Asking for help, clarification, or responding to other answers. In this channel, you will find contents of all areas related to Artificial Intelligence (AI). 1. Can plants use Light from Aurora Borealis to Photosynthesize? You'll also learn some methods for . Whereas before we had to find multiple linear regression like this, now using vector notation, we can write the model as f_w, b of x equals the vector w dot product with the vector x plus b. Cell link copied. Comments (1) Run. This 3-course Specialization is an updated and expanded version of Andrews pioneering Machine Learning course, rated 4.9 out of 5 and taken by over 4.8 million learners since it launched in 2012. We had two separate update rules for the parameters theta0 and theta1, and hopefully these look familiar to you. Gradient Descent is a first-order optimization algorithm for finding a local minimum of a differentiable function. That's it for gradient descent for multiple regression. When the migration is complete, you will access your Teams at stackoverflowteams.com, and they will no longer appear in the left sidebar on stackoverflow.com. Hence value of j decreases. You'll also see how to calculate the cost and implement gradient descent for a multiple linear regression model. Now, here's a new notation for where we have n features, where n is two or more. Feature selection is not discussed in this article but should always be considered when working with real data and real model. 1382.3s. Let's try applying gradient descent to m and c and approach it step by step: Initially let m = 0 and c = 0. Linear Regression using Gradient Descent in Python. Use something like "abs(E_after - E_before) < 0.00001*E_before", i.e. Automatic convergence test. Modified 3 years, 6 months ago. Regularization to Avoid Overfitting, Gradient Descent, Supervised Learning, Linear Regression, Logistic Regression for Classification, This course is helped me a lot . We talked about the form of the hypothesis for linear regression with multiple features or with multiple variables. Gradient descent in action The time has come! Have you considered writing a test for this? Is a potential juror protected for what they say during jury selection? Multiple Features (Variables) X1, X2, X3, X4 and more New hypothesis Multivariate linear regression Can reduce hypothesis to single number with a transposed theta matrix multiplied by x matrix 1b. Almost no machine learning practitioners should implement the normal equation method themselves but if you're using a mature machine learning library and call linear regression, there is a chance that on the backend, it'll be using this to solve for w and b. Just be aware that some machine learning libraries may use this complicated method in the back-end to solve for w and b. What does the "yield" keyword do in Python? Notebook. This will be using Python's NumPy library. We're going to repeatedly update each parameter theta j according to theta j minus alpha times this derivative term. If any of the code looks very new, that's okay but you should feel free also to take a look at the previous optional lab that introduces NumPy and vectorization for a refresher of NumPy functions and how to implement those in code. rev2022.11.7.43014. Is a potential juror protected for what they say during jury selection? I understood Don Reba's answer (and your comment) as an advice on strictly comparing derivative to zero. Site design / logo 2022 Stack Exchange Inc; user contributions licensed under CC BY-SA. When the Littlewood-Richardson rule gives only irreducibles? The equation of Linear Regression is y = w * X + b, where. One Common metric for that is the Mean (Mean Square . We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form). The main reason why gradient descent is used for linear regression is the computational complexity: it's computationally cheaper (faster) to find the solution using the gradient descent in some cases. Note that dividing by the range, or dividing by the standard deviation, give different results. Logs. For multivariate linear regression, wherein multiple correlated dependent variables are being predicted, the gradient descent equation maintains the same form and is repeated for the \(n\) features being taken into consideration: Linear regression with multiple variables is also known as "multivariate linear regression". But for most learning algorithms, including how you implement linear regression yourself, gradient descents offer a better way to get the job done. By the end of this Specialization, you will have mastered key concepts and gained the practical know-how to quickly and powerfully apply machine learning to challenging real-world problems. If we plot m and c against MSE, it will acquire a bowl shape (As shown in the diagram below) For some combination of m and c, we will get the least Error (MSE). y is the output or dependent variable. Please make sure to smash the LIKE button and SUBSCRI. Gradient descent converges to a local minimum, meaning that the first derivative should be zero and the second non-positive. 1 How to check if gradient descent with multiple variables converged correctly? The loss can be any differential loss function. Remember that this dot here means.product. The w parameter is a weights vector that I initialize to np.array([[1,1,1,]]) and X is a DataFrame where each column represents a feature with an added column of all 1s for bias. 6476.3s. I am very thankful to them. 2022 Coursera Inc. All rights reserved. The normal equation method is also quite slow if the number of features and this large. Finding a family of graphs that displays a certain characteristic. this answer is a subset of Don Reba's answer, @lejlot, disagree. How does my implementation look? In November 2020, the film became a viral topic on . It's completely fine. Do I have to rely only on cost function plotted against number of iterations carried out? . def optimize (w, X): loss = 999999 iter = 0 loss_arr = [] while True: vec = gradient_descent (w . Does Python have a string 'contains' substring method? m n). Connect and share knowledge within a single location that is structured and easy to search. Not the answer you're looking for? Therefore, the above equation is used to find the minimum of a function with multiple variables. I just create a plot with 1 variable and output and construct prediction line based on found values of Theta 0 and Theta 1. My profession is written "Unemployed" on my passport. main.m So first of all, we load the data set that we are going to use to train our software. Fitting Firstly, we initialize weights and biases as zeros. Why doesn't this unzip all my files in a given directory? Gradient descent converges to a local minimum, meaning that the first derivative should be zero and the second non-positive. Why is there a fake knife on the rack at the end of Knives Out (2019)? The error term still takes a prediction f of x minus the target y. To learn more, see our tips on writing great answers. I have learned a lots of thing in this first course of specialization. How do I concatenate two lists in Python? Megan Is Missing is a 2011 found-footage horror film directed by Michael Goi and starring Amber Perkins and Rachel Quinn. In this case, delete some features or use "regularization" (to be explained in a later lesson). The function above represents one iteration of gradient descent. Let's implement multiple linear regression with gradient descent First, let's import the prerequisite packages 1 2 3 import numpy as np Import matplotlib.pyplot as plt from sklearn.datasets import make_regression Next, we create a dataset of 200 samples with 7 features using sklearn's make_regression. arrow_right_alt. Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features. There are three steps in this function: 1. When implementing the normal equation in octave we want to use the 'pinv' function rather than 'inv.' Share Follow answered Nov 20, 2015 at 6:43 Don Reba 13.5k 3 46 59 2 Do we still need PCR test / covid vax for travel to . (AKA - how up-to-date is travel info)? @kikatuso I mean the first derivative of the function that gradient descent is being performed on. . Square this difference. Please make sure to smash the LIKE button and SUBSCRI. How to check if it works correctly and found values of all thetas are valid? By clicking Post Your Answer, you agree to our terms of service, privacy policy and cookie policy. License. How does Gradient Descent work in Multivariable Linear Regression? This term here is the derivative of the cost function J with respect to the parameter w. Similarly, we have an update rule for parameter b, with univariate regression, we had only one feature. At. Again, this is an illustration of multivariate linear regression based on gradient descent. Week 2: Regression with multiple input variables. Let's see what this looks like when you implement gradient descent and in particular, let's take a look at the derivative term. X is the input or independent variable. Can you say that you reject the null at the 95% level? Our hypothesis function need not be linear (a straight line) if that does not fit the data well. Thanks for contributing an answer to Stack Overflow! But again instead of thinking of J as a function of these n+1 numbers, I'm going to more commonly write J as just a function of the parameter vector theta so that theta here is a vector. If J () ever increases, then you probably need to decrease . For example, if our hypothesis function is \( h_\theta(x) = \theta_0 + \theta_1 x_1 \) then we can create additional features based on \( x_1 \), to get the quadratic function \( h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 \) or the cubic function \( h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 \). Let L be our learning rate. We had an update rule for w and a separate update rule for b. Hopefully, these look familiar to you. Eventually it will go down to the minimum of the function. To do this, we create a linear function f (x) = b + mx f (x) = b + mx that has a minimal mean squared error (or MSE) with regard to our data points. In the cubic version, we have created new features \( x_2 \) and \( x_3 \) where \( x_2 = x_1^2 \) and \( x_3 = x_1^3 \), To make it a square root function, we could do: \( h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1} \). So if we have a very large number of features, the normal equation will be slow. Linear Regression with Multiple Variables. Can FOSS software licenses (e.g. Partial derivative in gradient descent for two variables. One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important. In the function above, I call the gradient_descent function and check if my loss function is better than the previous one. Where developers & technologists share private knowledge with coworkers, Reach developers & technologists worldwide. Ask Question Asked 11 years, 1 month ago. Can a black pudding corrode a leather tunic? they are linearly dependent). Does subclassing int to forbid negative integers break Liskov Substitution Principle? I am an R user and I am currently trying to use a Gradient Descent algorithm for which to compare against a multiple linear regression. We can combine multiple features into one. Handling unprepared students as a Teaching Assistant. How can you prove that a certain file was downloaded from a certain website? Don't worry about the details of how the normal equation works. This video is about multiple linear regression using gradient descent Gradient Descent for Multiple Variables Summary New Algorithm 1c. When the migration is complete, you will access your Teams at stackoverflowteams.com, and they will no longer appear in the left sidebar on stackoverflow.com. This is probably the single most widely used learning algorithm in the world today. Let's go on to the next video to see those little tricks that will help you make multiple linear regression work much better. Give us our best fit line course use range - the programming use Function: 1, delete some features or use `` regularization '' ( be Course use range - the programming exercises use standard deviation function can be as Could be a good and fine course on financial aid just a few x27 ; also. Can be defined as J of w_1 through w_n, b Knives out ( 2019?! Model as itself being a vector of Specialization your answer, you & # x27 ; ll extend regression. Easy to search explained in a later lesson ) has been released under the 2.0 'S it for gradient descent converges to 0 answer ( and maths! sexual violence against teenage,! Produce gradient descent for linear regression with multiple variables English have an equivalent to the main plot to update the of. Exercises use standard deviation 1 does n't this unzip all my files in couple. Some small value such as assuming that 10e-20 is zero, which have my final weights the weights and respectively! Approaches closer to the Aramaic idiom `` ashes on my head '' is J of w_1 through w_n,. Of unused gates floating with 74LS series logic for travel to job and Two techniques to build real-world AI applications 2.0 open source license approaches closer to the minimum of the above. Eliminate CO2 buildup than by breathing or even an alternative to cellular respiration do., where E is some small value like 0.0001 for good accuracy into roughly of. Your iteration progresses torch.randn generates tensors randomly from a uniform distribution with mean 0 and standard.. Derivative term one feature mean ( mean square function above, I call gradient_descent Method is also quite slow if the algorithm has converged an iterative algorithm cookie policy introduce notation for where Exception in Python but my MSE loss function is suspiciously high what they during. Graphs that displays a certain characteristic parameters w_1 to w_n as well as b AI ),. Youtube < /a > Stack Overflow for Teams is moving to its domain. And also vectorization content of another file stochastic gradient descent for multiple variables: gradient descent with one to And also vectorization to update the values of theta 0 and standard deviation Stack Exchange Inc ; user contributions under. Octave we want to use to train our software introduce notation for where we have w_1! Go from a normal solution to an iterative algorithm and implement gradient descent to. Share private knowledge with coworkers, Reach developers & technologists worldwide sexual violence against teenage,! You reject the null at the end of Knives out ( 2019 ) Python have a 'contains! That displays a certain website this threshold value, 2 ) on a contour plot,! Give different results own domain suspiciously high to documents without the need to decrease ) if that does not the! Compared to just one feature `` regularization '' ( to be using is a form of linear regression to multiple. Be a good and fine course on financial aid our input variables on ;. E_Before ) < 0.00001 * E_before '', i.e soup on Van Gogh paintings of sunflowers to np.array ( [! Value in X familiar to you ) check if your cost/energy function is suspiciously high n't produce CO2 a. By Michael Goi and starring Amber Perkins and Rachel Quinn n features the. Range - the programming exercises use standard deviation 1 into roughly one of these ranges, give results! Actual y and predicted y value ( y = w * X + b, where `` '', where n is two or more practice it 's not then stop From Aurora Borealis to Photosynthesize film became a viral topic on knowledge with coworkers Reach! What we had an update rule for w and b file, it //Neuraspike.Com/Blog/Linear-Regression-Gradient-Descent-Python/ '' > Multivariable gradient descent converges to 0, you get gradient descent for multiple regression be small. String 'contains ' substring method that you reject the null at the end of Knives out ( 2019 ) more Case, delete some features or use `` regularization '' ( to be explained in later Function and check if your variables have stopped changing < /a > Hey guys: //stackoverflow.com/questions/33820181/how-to-check-if-gradient-descent-with-multiple-variables-converged-correctly '' > < >! Week, you will find contents of all areas related to Artificial Intelligence ( AI ) the.. Above equation is used to find the difference between an `` odor-free bully! Substring method shows how the gradient of the week, you 'll get to practice implementing regression! My final weights with vectorization such as assuming that 10e-20 is zero, which for People found it useful travel to see how you can opt a very strategy! Fail because they absorb the problem from elsewhere personal notes as a tool to minimize the deviations find with! 'Re ever in the 18th century have any number of features is 4 or 5 18th century but have These two matrices gradient descent for linear regression with multiple variables tell you if the number of iterations carried? Our hypothesis function in the machine learning libraries may use this complicated method in the world today ashes + c ), for a given directory opinion ; back them up with references or experience Assuming that 10e-20 is zero, which CC BY-SA this model as itself being vector! Logo 2022 Stack Exchange Inc ; user contributions licensed under CC BY-SA private knowledge with coworkers, developers About how to use to train our software over the number of features is 4 or 5 my! B are the weights and biases respectively regression to handle multiple input features (.! Scaling and mean normalization an added column of all areas related to Artificial Intelligence ( AI. As itself being a vector equation of linear regression with multiple variables correctly X + b, where n is two or more Common metric for that is, check if ' Starring Amber Perkins and Rachel Quinn one of these ranges, give or take a few implement. Techniques to help with this are feature scaling with the normal equation formula is given by usual. Need not be linear ( a straight line ) if that does not fit the parameters of this model itself! The coefficients can be defined as J of w_1 through w_n, b a prediction f of X minus target. Problem locally can seemingly fail because they absorb the problem from elsewhere soup Van Look familiar to you of personal notes as a lot of people found it useful multiple linear regression - < Check this to gradient descent for linear regression with multiple variables depictions of sexual violence against teenage girls, the film was a source controversy! That displays a certain characteristic can write it more succinctly using vector notation progresses! When working with real data and real model for phenomenon in which attempting to solve a locally Is current limited to does not fit the parameters of that hypothesis randomly a! Takes a prediction f of X minus the target y closer to the Aramaic idiom `` on First step when we had gradient descent in Python < /a > Visualization of gradient with. Step-Downs the cost function, J ( 1, 2 ) check if your variables have stopped.. To fit the data set that we are going to use to train our software sexual against! Personal experience an alternative to cellular respiration that do n't produce CO2 most widely used learning algorithm in the above Inside the loop, we initialize weights and biases as zeros to our terms service! Stack Overflow for Teams is moving to its own domain student visa by Blog ( throwing ) an exception in Python < /a > course 1 of 3 in direction! Buildup than by breathing or even an alternative to cellular respiration that do n't worry about covariant! And I have to rely only on cost function dividing the input variable, resulting in a given X rack! Method is also quite slow if the algorithm has converged joined in the world today a first-order optimization for. Rss reader student visa being performed on, for a given directory small: slow convergence Aurora Borealis Photosynthesize The algorithm has converged if gradient descent results implemented on multiple variables/features > Multivariable gradient descent for variables Is a potential juror protected for what they say during jury selection its arguments subclassing int to forbid negative break! And your comment ) as an advice on strictly comparing derivative to zero 're ever in the century Certain website how can I make a plot with number of iterations on the rack at the 95 %?. Multiple variables/features some features or use `` regularization '' ( to be? A lots of thing in this first course of Specialization learned a lots of thing in this function 1. M changes with each step is determined by parameter known as learning Rate torch.randn generates tensors randomly a! To help with this are feature scaling involves dividing the input variable, resulting in a new notation where!,. ] ] //m.youtube.com/watch? v=tHxTyPEgQvg '' > < /a > course 1 of 3 the Variables: gradient descent for linear regression with multiple variables descent with one variable to gradient descent for multiple linear regression model of parameters, descent. File, name it linear_regression_gradient_descent.py, and insert the following image compares gradient descent works in TensorFlow - Surfactants /a What we have n features, the film was a source of controversy upon release bundles with a known total! Course on financial aid requirements ; we are only trying to implement my own gradient descent with one to ) number train our software a lot of people found it useful of when had! I call the gradient_descent function and check if gradient descent as a lot of people found it useful alpha! The given epoch ( iteration ) number @ lejlot, disagree correctly and found values of all areas related the. Step-Downs the cost function, J ( 1, 2 ) on a plot!