L1 loss, abs L1 loss, and L2 loss
We are commonly using L2 loss in deep learning. My question is why not use L1 loss? Here, I’d like to do an experiment to explain.
Experiment setup
we are going to estimate
wx = y
where we only have one parameter which is w and in order to make the problem even simple, we’d like also fix the value of x by picking up a value of 2 for x.
So what we are going to estimate is
2*w = y
For the regular wx+b=y estimation, please check this code.
we set the unknow w as 1 and we are going to estimate this w.
Code
prepare the data
# simulate the l1 loss and l2 loss
# simulate w*x = y where x is a fixed number 2
# It means we are going to simulate 2*w = y
# a noise item will be added which is z.
# 2*w + z = y
# if z is 0, then there is no noise,
# else some gaussian noise will be added
import numpy as np
import random
import matplotlib.pyplot as plt
# number of samples
N = 20
INIT_w = 20
# our target w is 1.0add_noise = False
x = 2*np.ones((N,))
if add_noise:
z = np.random.normal(0, 0.05, N)
y = 1.0*x + z
else:
y = 1.0*x
Define the loss
def loss(y_hat, y, loss='L1'):
if loss=='L1':
print('L1 loss')
return np.sum(y_hat - y)
elif loss == 'L2':
print('L2 loss')
return np.sum(np.square(y_hat-y))
Train the model using a gradient descent algorithm
epoches = 20
batch_size = 2
batch_start_idx = list(range(0, N, batch_size))l1_losses, l2_losses = [], []
w_l1_loss, w_l2_loss = [], []
# set learning rate as 1.0
lr = 0.01for loss_type in ['L1', 'L2']:
w = INIT_w
for epoch in range(epoches):
for batch_i in batch_start_idx:
this_x, this_y = x[batch_i:batch_i+batch_size], y[batch_i:batch_i+batch_size]
y_hat = w*this_x
L = loss(y_hat, this_y, loss_type)
if loss_type == 'L1':
l1_losses.append(L)
print(f"{loss_type}, {L}")
# L = y_hat -y , dL/dy_hat = 1
# dy_hat/dw = x
# dL/dw = 1*x
gradient = np.sum(x)
print(gradient)
w -= lr*gradient
w_l1_loss.append(w)
elif loss_type == 'L2':
l2_losses.append(L)
print(f"{loss_type}, {L}")
# L = (y_hat -y)**2 , dL/dy_hat = 2(y_hat - y)
# dy_hat/dw = x
# dL/dw = 2(y_hat-y)*x
gradient = np.sum(2*(y_hat-this_y)*this_x)
w -= lr*gradient
w_l2_loss.append(w)
Visualization of the algorithm
plt.plot(range(len(l1_losses)), l1_losses, c = 'r', label='L1 loss')
plt.plot(range(len(l2_losses)), l2_losses, c = 'g', label='L2 loss')
plt.xlabel('step')
plt.ylabel('loss')
if add_noise:
plt.title('initialize w:'+str(INIT_w)+' with noise')
else:
plt.title('initialize w:'+str(INIT_w)+' without noise')
plt.legend(loc='best')
if add_noise:
plt.savefig('loss_l1_l2_with_noise.png')
else:
plt.savefig('loss_l1_l2_without_noise.png')
plt.show()
plt.close()
plt.plot(range(len(w_l1_loss)), w_l1_loss, c = 'r', label='L1 loss')
plt.plot(range(len(w_l2_loss)), w_l2_loss, c = 'g', label='L2 loss')
plt.xlabel('step')
plt.ylabel('estimate w (target is 1)')
if add_noise:
plt.title('initialize w:'+str(INIT_w)+' with noise')
else:
plt.title('initialize w:'+str(INIT_w)+' without noise')
plt.legend(loc='best')
if add_noise:
plt.savefig('w_l1_l2_with_noise.png')
else:
plt.savefig('w_l1_l2_without_noise.png')
plt.show()
The full code
Full code can get from here:
Visualization results
From the second plot, we can see by using L1 loss, the estimation fails as the gradient descent algorithm will push the loss function go to a smaller value. So by using L1 loss, it fails. If we keep on training, the w will go to -inf.
What if we change the L1 loss to abs L1 loss?
Yes, it works. However, the L2 loss converges much faster as L2 loss has a larger gradient when far away from the zero gradients. While the L1 loss is kind of stable.
What about adding some noise
Gaussian noise is added with mean as 0 and standard deviation as sigma.
When sigma is small which is 0.05, we have
When sigma is large say sigma = 5
The estimation of L2 is better than based on abs L1 loss. We believe it is because the L2 gradient will be reduced if it is close to the global minimum. However, for the abs L1 the magnitude of the gradient keeps the same.
How to address this issue?
Approach 1: using a smaller learning rate
current learning rate = 0.001
change it to 0.0001
We need more iterations in this case and the results of abs L1 loss is better than before.
Also we can use an exponential decay learning rate.
we reset the learning rate to 0.01 and decay = 1/2000
lr = init_lr*(1. / (1. + decay * iterations))
When the learning rate is small, the updating process will be dominated by the gradient. The gradient will have fluctuations with the noise. This explains why the update of w has fluctuated.
Code related to abs l1 loss
Code for this part is in the same repo as the previous one and the code name is:abs_l1_l2_loss.py
Thanks for reading.