Nonlocal regularized CNN for image segmentation

Non-local dependency is a very important prior for many image segmentation tasks. Generally, convolutional operations are building blocks that process one local neighborhood at a time which means the convolutional neural networks(CNNs) usually do not explicitly make use of the non-local prior on image segmentation tasks. Though the pooling and dilated convolution techniques can enlarge the receptive field to use some nonlocal information during the feature extracting step, there is no nonlocal priori for feature classification step in the current CNNs' architectures. In this paper, we present a non-local total variation (TV) regularized softmax activation function method for semantic image segmentation tasks. The proposed method can be integrated into the architecture of CNNs. To handle the difficulty of back-propagation for CNNs due to the non-smoothness of nonlocal TV, we develop a primal-dual hybrid gradient method to realize the back-propagation of nonlocal TV in CNNs. Experimental evaluations of the non-local TV regularized softmax layer on a series of image segmentation datasets showcase its good performance. Many CNNs can benefit from our proposed method on image segmentation tasks.


(Communicated by Tieyong Zeng)
Abstract. Non-local dependency is a very important prior for many image segmentation tasks. Generally, convolutional operations are building blocks that process one local neighborhood at a time which means the convolutional neural networks(CNNs) usually do not explicitly make use of the non-local prior on image segmentation tasks. Though the pooling and dilated convolution techniques can enlarge the receptive field to use some nonlocal information during the feature extracting step, there is no nonlocal priori for feature classification step in the current CNNs' architectures. In this paper, we present a nonlocal total variation (TV) regularized softmax activation function method for semantic image segmentation tasks. The proposed method can be integrated into the architecture of CNNs. To handle the difficulty of back-propagation for CNNs due to the non-smoothness of nonlocal TV, we develop a primal-dual hybrid gradient method to realize the back-propagation of nonlocal TV in CNNs. Experimental evaluations of the non-local TV regularized softmax layer on a series of image segmentation datasets showcase its good performance. Many CNNs can benefit from our proposed method on image segmentation tasks.
1. Introduction. Image segmentation has long been a hot topic and attracted hundreds of thousands of researchers from lots of fields all over the world. Generally speaking, given an image, image segmentation aims to classify all pixels to several classes. In the past few decades, kinds of methods have been proposed for image segmentation. Depending on whether there are labels available or not, all the methods could be mainly classified into two types, unsupervised methods and supervised methods. Given no label prior, unsupervised methods such as thresholding method [24] , edge based method [7], region based method [1], partial differential equation(PDE) and variational based methods [12,21,20], graph partitioning method [29] and their variations first appear in the last century. These methods usually make use of constraints according to some prior information such as image intensity, shape prior. Besides unsupervised methods, when there are annotated samples available, supervised methods come into being and learn valid information from training dataset. Discriminative features and context information will be extracted followed by a dense pixel-wise classification.
Total variation(TV) exhibits prominent performance in image restoration problems, which is first introduced in computer vision by Rudin, Osher and Fatemi [26]. It is one of the most popular regularization methods in image processing field due to its good performance in handling minimization problems. In the recent decades, TV has been explored by thousands of researchers and extended to a series of forms for dealing with sorts of other image processing tasks, such as anisotropic TV [6], weighted TV [31], fourth-order PDE model(two-step method) [18], higherorder TV [5], and non-local TV [8]. An effective framework has also been proposed in recent years. After introducing a novel region force term into Potts model,it achieves good performance in multi-phase image segmentation and semi-supervised data clustering [33] tasks. Their method can be easily applied to high dimensional data clustering tasks via graph total variation.
Convolutional Neural Networks (CNNs) [14,15] have achieved distinguished performance in a series of tasks in the last decade. Especially in computer vision area, CNNs showcase their prominent abilities in learning discriminative features from various large scale datasets. Leading other methods by a large margin, CNNs achieve the first place in many kinds of tasks such as image classification, object detection and image segmentation. Semantic image segmentation is a dense classification task which aims to classify each pixel to a certain class. It not only segments a given image into several regions, but also tells you which label each pixel belongs to [4,28].
Fully Convolutional Networks (FCNs) [17] was the first successful attempt for semantic image segmentation task via an end-to-end CNN framework. Noh et al. [22] proposed an extension of FCNs. They used a VGG [30] 16-layer network as the convolution network, followed by a series of up-pooling and deconvolutional layers. Utilizing the spatial dependency between neighbor pixels, Conditional Random Fields (CRFs) were employed as a post-processing after CNNs to refine segmentation results [16]. CNNs were also introduced to medical image processing fields due to its powerful ability. Inspired by FCN, U-net [25] using a symmetric structure while adding skip-connections like FCN to concatenate feature maps from different levels together. With abundant features from different levels, U-Net achieved very prominent performance and thus has since been applied to kinds of medical imaging tasks such as image translation. In recent years, variations of Unet come forth. Attention U-Net [23] employs attention gates to help CNN focus on extracting discriminative features from foreground. R2U-Net [2] introduces recurrent residual blocks to U-Net and achieves better results on several retina blood vessel segmentation datasets. Different variations of U-Net continuously improving the segmentation performance on kinds of medical image datasets.
However, convolution operators just can learn features from local context. Long range dependency is also important information in semantic image segmentation. Dilated convolution operators [35] could capture long range information. It has a larger receptive field with same computation and memory costs while also preserving resolution. But the dilated convolution would lose some position information in image segmentation. What is more, the receptive field of dilated convolution is not continuous and dilated convolution does not work well when facing small objects. Capturing multi-scale feature information, well designed Hybrid Dilated Convolution(HDC) [32] could eliminate the effect of some disadvantages and improve segmentation performance. Although HDC can work well when there is big enough training corpus, when the training dataset is quite small, HDC can hardly showcase its performance. That is why dilated convolution operators seldom appear in CNNs for medical image segmentation.
Given a few training samples, we explore the potential of non-local operators and provide a novel way to capture long range information in CNNs. In summary, the contributions of this paper are as follows: • We introduce graph total variation to softmax activation function, one can easily extend this model to other activation functions in CNNs. Some earlier works have tried to introduce local total variation to softmax activation function [10], but it is well known that non-local dependency is an important prior for the image segmentation problem. • We introduce a primal-dual hybrid gradient method for our proposed regularized softmax activation function that enables end-to-end training. • Experimental results show the good performance of the nonlocal total variation. Local total variation regularized softmax activation function could produce smoother objects, but it may lose some details such as corners. It is numerically verified that nonlocal total variation could eliminate isolated regions and preserve object details at the same time.
The paper is organized as follows. In Section 2, we give brief descriptions to related work. Our proposed method is illustrated in Section 3. In this section, we apply our proposed method to softmax layer and give the general formulas for forward propagation, backward propagation. Some implementation details are also illustrated here. The experimental results are described in Section 4, and the conclusions follow in Section 5.

2.1.
Multi-phase image segmentation. Let I(x) be an image which is defined on a domain Ω ∈ R 2 , the multi-phase image segmentation task is to classify Ω into K partitions, where K is the number of classes. Let {Ω k } K k=1 be the partitions, we have Ω = ∪ K k=1 Ω k and Ωk ∩Ω k = ∅ whenk = k. Potts model is a general variational based image segmentation model for multi-phase image segmentation. It consists of two terms, the data fidelity term and regularization term. Generally, it can be defined by the following minimization problem: (1) min The second term in Eq. (1) is a jump penalty which is usually defined as follows: where α(x) is an edge detector defined as α(x) = β 1+γ| Iσ| 2 . γ and β are manually set parameters controlling the property of edge detector. I σ is the result of convoluting the image I(x) with a Gaussian kernel g σ . The jump penalty is a scaled sum of boundary total length when α(x) is a constant λ ∈ R.
If we define an indicator function φ k (x)(k = 1, 2, · · · , K.) on the k-th sub-domain, then we have The segmentation condition becomes a relaxed one Corresponding to the binary segmentation constraint on φ k in (1), one can get the following convex programming problem which is a dual of a min-cut problem: 2.2. Graph model for data clustering. Graph model is a useful tool if we want to utilize pairwise relations between pixels. A undirected weighted graph G = (V, E, w) is constructed by vertex set V, edge set E and a weight function w : E → R + which is defined on the edges. In the image segmentation task, each image could be seen as a graph, each pixel in the image is a vertex. For x i , x j ∈ V, w ij = w(x i , x j ) measures the similarity between two vertexes. Since an image often has at least dozens of thousands of pixels, the computation and memory cost will be extremely huge if we use complete graph. Therefore, we assume that each pixel is connected to only a portion of other pixels. Then we get a sparse affinity matrix W . There are several methods to measure the similarity between two pixels. Given a distance metric dist(·) which measures the distance of the feature vector of two pixels x i and x j , the radial basis function (RBF) [27] is defined as: If we replace the constant 2 with the product of local variances σ(x i )σ(x j ), here comes the Zelnik-Manor and Perona function (ZMP) [36]: .
The cosine similar function is also widely used to measure the similarity between two non-zero vectors. It is defined as: In the fully connected pairwise CRF model [13], the weight function is often defined by pairwise potentials as (10) w where f i and f j are feature vectors for pixels x i , x j , λ m is a coefficient to control the impact of each kernel, and µ represents the label compatibility function. Λ m is a symmetric, positive-definite precision matrix.
Let µ(x i , x j ) = 1 if x i = x j and µ(x i , x j ) = 0 otherwise, we use RGB color vectors I i , I j and spatial positions p i , p j as feature vectors. Given two different pixels, the weight function in Eq. (10) can be rewritten as: where σ α , σ β σ γ are parameters controlling the scale of Gaussian kernels. The first term depends on both pixel colors and positions. Pixels with small differences in positions and colors are likely to have the same label. The second term only takes into account the spatial correlation, isolated points and regions will be removed.

Graph operators.
After introducing the weight functions in graph, some graph operators will be given in the following. One important operator is gradient operator. Given u ∈ L 2 (V) defined on the vertex set, the gradient operator Since we assume each pixel is only connected to a small potion of other pixels, we get a sparse graph G and each x i has at most d neighbors.
with at most d non-zeros. Correspondingly, the divergence operator is given by where v ∈ L 2 (V × V),.
2.4. Discrete Potts model. Given a graph G = (V, E, w), we want to classify the vertexes V into K partitions, denoted by V 1 , · · · , V K . Then the corresponding indicator function φ k (x i ) for the k-th class is defined as: The discrete counterpart of the Potts model defined in Eq. (6) is given by : where f k (·) is a region force function and N LT V α (φ k ) is the α weighted non-local total variation. As the dual norm of 1 -norm is ∞ -norm, N LT V α (φ k ) has the following form: Inverse Problems and Imaging Volume 14, No. 5 (2020), 891-911 where < ·, · > is the standard inner product of two vectors, q k is the dual variable of φ k , w and div w are non-local gradient and divergence operators defined in Eq. (14) and Eq. (16), respectively.
3. Proposed method. Usually, a softmax layer is employed as the last layer of a neural network, converting an input feature vector into a probability distribution vector. The sum of elements in the probability vector is 1.
Given an image with size N = N 1 × N 2 and N 1 , N 2 is the height and width. If we want to segment the image into K classes using CNN, here comes the following minimization problem: where A = (a ik ) ∈ R N ×K is the activation function, S is the soft segmentation condition defined in (5), and o = (o ik ) ∈ R N ×K is the feature map taken as input.

3.2.
Proposed non-local TV regularized Softmax function. Now we replace the edge force item in Eq. (18) with the function (22) and regularize the prediction result by non-local total variation. We set the edge detector α(x) as a constant parameter λ, the regularized Softmax function is defined as: The variational formulation of non-local total variation is given by where η k ∈ R N ×N is the dual variable of A k . The minimization problem Eq. (25) can be reformulated as a saddle-point problem: Inverse Problems and Imaging Volume 14, No. 5 (2020), 891-911 The above minimization problem can be solved by primal-dual hybrid gradient method updating dual variables η k and primal variables A k alternatively. The iteration is given by We also record the primal energy and dual energy during the iteration to monitor the convergence of the algorithm. The primal energy E P (A) is as follows: . The dual energy E D (η) is as follows: There are two stopping criteria, a maximum of 1500 iterations is reached or the relative absolute duality gap is smaller than a threshold e, i.e.: where e = 10 −5 in our experiments. for k = 1, ..., K do 6: end for 8: calculate div w η t , 9: for k = 1, ..., K do 10: A * = S(o − div w (η * )), η * = (η * 1 , η * 2 , · · · , η * K ). Replacing softmax with regularized softmax, we have regularized A * and In our numerical experiments, we set τ = 0.03 and λ = 3. Generally, we initially select a large λ and a small step size τ to perform the algorithm. Since the parameters λ and τ are image dependent, we iteratively finetune the parameters and finally select a best set of them. It is summarized as Algorithm 1.
Noticing that the second term in Eq. (25) could be seen as a negative entropy term which can enforce A to be smooth. If we add a control parameter > 0 to it, Eq. (25) becomes (34) min The corresponding minimizer is We can see that when adding a control parameter , it is equivalent to rescaling the output of regularized softmax by a factor 1 . In all our experiments, we set = 0.5.

3.3.
General convolutional neural network for semantic image segmentation. A general convolution neural network consists sets of convolution layers and activation layers. Given an input v, the convolution layer can be formulated as: where W is a linear operator such as convolution or deconvolution, b is a bias.
The activation function takes o as input and outputs v, it can be represented by where A can be ReLU, softmax, sigmoid, sampling and other activation functions.
Given an image as input, a general convolution neural network with L layers can be described by recursive connections as follows: A widely used loss function in many tasks is cross entropy which is given by The algorithm of learning is a gradient descent method: where step = 1, 2, . . . is the training iteration number and τ Θ is a hyper parameter controlling learning rate. ∂L ∂Θ l can be calculated by back-propagation technique using chain rule. Let ∆ l = ∂L ∂o l , then the back-propagation scheme is in the following Since η t only contributes to computing A t when t = 1, . . . , T + 1, the gradient of L with respect to η t is given by ∂A t · ∂A t ∂η t , t = 1, . . . , T + 1. Eq. (28) could be reformulated as : . ξ t contributes to computing both η t and ξ t+1 when t = 1, . . . , T . However, ξ T +1 contributes to compute η T +1 only. Then the gradient of L with respect to ξ t is given by (44) ∂L ∂ξ t = ∂L ∂η t · ∂η t ∂ξ t , t = T + 1 ∂L ∂η t · ∂η t ∂ξ t + ∂L ∂ξ t+1 , t = 1, . . . , T. A t is the input to compute ξ t+1 when t = 0, . . . , T , then the gradient of L with respect to A t is given by ∂A t , t = 0, . . . , T. o contributes to computing each A t when t = 0, . . . , T + 1. A 0 is initialized with S(o), finally the gradient of L with respect to o is given by ∂L ∂A t+1 is given by the loss layer in the backward propagation stage, so we can successively get ∂L ∂η T +1 , ∂L ∂ξ T +1 , ∂L ∂A T , . . . , ∂L ∂η 1 , ∂L ∂ξ 1 , ∂L ∂A 0 by Eq. (42), Eq. (44) and Eq. (45).
At last, ∂L ∂o is given by Eq. (46). 3.5. Implementation details. Since the total variation in this paper is defined on graph, we treat each input image as a graph G = (V, E, w) and each pixel is a vertex in V. One essential problem is how to define a proper edge set E and weights of edges. Assuming that each pixel is connected to at most d neighbors and these neighbors are chosen according to distances between the feature vectors of pixels. Geometrical four nearest neighbors may not be among these d neighbors. When each pixel is connected to every other pixel, G is a fully connected graph. When each pixel is connected to only a few neighbor pixels, G is a sparse graph, then w A k and div w η k are both sparse. We tried different d and found that a small d Inverse Problems and Imaging Volume 14, No. 5 (2020), 891-911 could work well. We will show some experimental results of different d in Section 4. When we introduce regularized softmax to CNN, we need to keep each A t k , η t k and some intermediate variables in graphics memory during forward propagation stage as they will be used to compute gradients in the backward propagation stage. Therefore, if d or t is too big, numerous computation and memory resources will be required. We use a small t and d in our experimental part, but there is still obvious regularization effect. Figure 1. An example of segmentation results by applying the algorithm of [34] and our proposed method on an image from BSD500. When using 4 geometrical nearest neighbors, the weights are set to 1. The segmentation is quite smooth and missing details (Figure 1(b)). When we use Eq. (11) to compute W, the segmentation results are with more details and better accuracy.
The limited GPU memory can only store variables of no more than dozens of steps, we only perform Eq. (28) one or a few steps each iteration in the training stage which indeed brings regularization effect. What's more, even though the primal energy curve continuously decreases in hundreds of steps, the segmentation results change slightly after dozens of steps. It's a trade-off between accuracy, memory resources and efficiency. We set the initialization ξ 0 and η 0 to 0, respectively. Then the first iteration is k ) According to the back-propagation procedure described in Subsection 3.4, the gradient of L with respect to o could be computed easily.

Experimental results.
In our experiments, we rescale all the intensity of the images to [0,1]. First of all, we try different d and select a proper one by comparing the segmentation results from a toy example.
In [34] several images from BSD500 [19] were selected to test their algorithm. We use the same image for comparison. In their experiments, each pixel has 4 neighbors and the weights of edges are set to 1. In this experiments, we use Eq. (11) as our distance metric and select the nearest 4,10,20,40 neighbors to perform our algorithm, respectively.
From Figure 1, we can see that, when using 4 geometrical neighbors with constant weights, the segmentation result is properly regularized and smoothed. There are not so many details. However, when using weights computed by Eq. (11), more details are preserved. When there is a few neighbors, the segmentation results appears to be a little noisy. There are many obvious isolated small regions on the vegetables and planks. The segmentation results appear to be smoother with increased number of neighbors. Nevertheless, a large number of neighbors need extra computation memory resources. In our experiments, we use d = 20 for WBC Dataset [37], d = 10 for CamVid Dataset [11].
For each network, we use SGD solver with momentum of 0.9. We set the learning rates to 0.0001 for Unet and its variations, the weights of Unet, RUnet, AUnet and ARUnet is randomly initialized. The weights of NLUnet and NLAUnet are finetuned from Unet and AUnet, respectively. We set the learning rates to 0.001 for Segnet, RSegnet and NLSegnet. Like the author of Segnet, we also initialize the weights of Segnet and RSegnet from the VGG model trained on ImageNet [9]. The weights of NLSegnet is finetuned from Segnet.
In data preparation stage, we compute the affinity matrix for each image. Since the affinity matrix is sparse, we use two matrices to represent it. One is W = (w i ), it keeps the edge weights computed by Eq. (11). The other is W idx = (widx i ), it keeps the indexes of nearest neighbors of the i-th pixel. We save the two matrices as local files so that we can load them during training and testing stages. Global pixel accuracy and mean intersection over union (mIoU) are two common metrics in image segmentation tasks, we also use them as our quantitative measures.
When evaluating a standard machine learning model, the prediction results are usually classified into four categories: true positives(TP), false positives(FP), true negatives(TN), and false negatives(FN). Global accuracy gives percent of pixels in all images which were correctly classified. The global accuracy is defined as The Intersection over Union (IoU) metric, also called the Jaccard index, calculates the percent overlap between the ground truth mask and the prediction output. The IoU metric is defined by For multi-class segmentation tasks, the mean IoU(mIoU) is the mean value of the IoU of each class. The RE score defined in the article [10] measures the regularization effect of segmentation result. Segmentation results with lower RE scores have smoother edges and less isolated regions.

WBC dataset.
There are two sub-datasets in White Blood Cell Image Dataset [37]. The image size in Dataset 1 is 120x120. It is too small for a CNNbased segmentation task. Dataset 2 consists of one hundred 300x300 color images. There is one white blood cell in the center of each image. Each image consists three classes, nucleus, cell sap and background. Comparing to Dataset 1, Dataset 2 is more suitable for a segmentation task thus selected in our experiments.   The training dataset contains 60 image randomly picked from WBC Datset2. The others are used for testing. We finetune Unet with non-local softmax(NLUnet) from Unet for 10k iterations. The CNN weights of Unet and RUnet are randomly initialized and they are trained for 20k iterations. Since the non-local softmax will take up some graphics memory for computing w A and div w η, the mini-batch size is three.
Since the affinity matrix W measures the similarity between pixels, if the pixel color value is perturbed, W will become inaccurate and wrong pixels will be selected as nearest neighbors. In our experiments, all the image used in training and testing stages are clean image, no noise is added to them. From Table 1 we can see that both mIoU and accuracy of NLUnet are improved compared to RUnet on testing dataset. The RE score of NLUnet is higher that RUnet, but less than Unet. This is because NLUnet could eliminate some isolated regions and produce smooth edges. Nevertheless, NLUnet can also preserve some details.
We show the convergence of RSoftmax and NLSoftmax in Figure 2(a) and Figure  2(b), respectively. The primal energy and dual energy of NLSoftmax are computed by Eq. (29) and Eq. (30), respectively. The primal energy and dual energy of RSoftmax are computed by the same equations after replacing the non-local operators w , div w with local operators , div. In Figure 2, the y-axis represents the energy value and the x-axis represents iteration number. Since we use a very small step size τ = 0.03, the energy values of primal and dual functions converges with 1000 iterations. We can also use a larger step size to make them converges faster.
(a) Enlarged View (b) Ground Truth (c) Unet [25] (d) RUnet [10] (e) NLUnet In Figure 3, we can see that NLUnet provides more details comparing to RUnet. In Figure 3 column 1, the segmentation result of Unet misses some nucleus, RUnet provides better segmentation results. The nucleus regions are closer to ground truth, but still some details are missed. NLUnet achieves the best segmentation result. There are less isolated regions and the edges are smoother comparing to Unet. Meanwhile, details are well preserved. Figure 4 is an enlarged view, we can see the segmentation details clearly. In Figure 3  3 column 3, there are two thin lines connecting different parts of nucleus(while region), Unet misses one of them and RUnet misses both of them. Surprisingly, NLUnet successfully preserves those details. If we take a closer look at the curves of nucleus and cell sap, we can see that the result of Unet is quite rough, RUnet gives much smoother edges. The edges provided by NLUnet are smoother than those of Unet, and more closer to ground truth comparing to RUnet. In Figure 3 column 4, we can see that the segmentation result of Unet is fragmented. RUnet gives a smooth segmentation result, but the nucleus is smaller comparing to ground truth due to its regularization effect. While NLUnet give relatively good result and the segmentation is closer to ground truth.
Since some variations of Unet appear in recent years, here we also use Attention Unet(AUnet) [23] to further evaluate the performance of our method. Comparing with original Unet, the Attention Unet introduces attention gates to help the network focus its attention on foreground. We simply add attention gates to Unet as the author do and use our own Caffe implementation. Excepting the attention gates, the other configurations are the same with Unet.
From Table 2 we can see that the attention gates help improve the performance of Unet. Nevertheless, the Attention Unet with local regularized softmax activation function(RAUnet) and the Attention Unet with non-local regularized softmax activation function(NLAUnet) further improve the mIoU and accuracy. Our proposed method achieves the best result. In Figure 5, we can see that the segmentations of nucleus(white regions) are very close to ground truth. Comparing with Unet in Figure 3, AUnet gives more complete nucleus. Inside the cell sap, there are some bubbles which looks very close to background. This may distract the attention gate such that some cell sap pixels are classified as background wrongly.

4.2.
CamVid dataset. CamVid Dataset [11] consists of a sequence of road scene images with size 360x480 collected by driving a car in the city of Cambridge. There are 367 images in the training dataset and 233 images in the testing dataset. This dataset contains 11 classes and pixels are ignored both in training stage and testing stage if they don't belong to these 11 classes. The authors of Segnet choose this data as their benchmark dataset.  We apply our non-local regularized softmax layer to Segnet, other configurations remain the same. The initial weights of Segnet are finetuned from the VGG model trained on ImageNet, its mini-batch size is four. The CNN weights of NLSegnet is initialized from Segnet and finetuned for 3k iteration with learning rates fixed to 0.01. The mini-batch size of NLSegnet is 1.
From Table 3 we can see that both mIoU and accuracy of NLSegnet are improved compared to RSegnet on testing dataset. The RE score of NLSegnet is higher that (a) Enlarged View (b) Ground Truth (c) Segnet [3] (d) RSegnet [10] (e) NLSegnet  RSegnet, but less than Segnet. The result is very similar to that of WBC dataset. But it is important to note that the mIoU is significantly improved from 57.79 to 59.84 by NLSegnet. Since mIoU measures the mean intersection over union of overall classes, the main gain in mIoU comes from classes which have small proportion pixels, such as pole and traffic sign. As non-local softmax could preserve more details, this can greatly benefit these minor classes.
In Figure 6, we can find that NLSegnet preserves many details such as tree branch, pole and roof top. In Figure 6 column, many isolated points and regions are removed in RSegnet and NLSegnet, there is a signal sign which is in pink color on the left hand side. The signal sign has a square shape which is well preserved by NLSegnet. Distinct details could be found in the enlarged view in Figure 8. However, the signal sign is distorted and becomes irregular in the segmentation results in Segnet and RSegnet. In Figure 6 column 2, the roof top on the left hand side is well preserved by NLSegnet, the segmentation result is nearly the same with ground truth. More details could be found in Figure 7. The segmentation result of Segnet is very coarse, whereas RSegnet gives smooth edges but some details are missed.

5.
Conclusions and future work. Even though regularized softmax with local operators could eliminate scattered points, tiny regions and give smooth edges, some details are often missed. Inspired by regularized softmax with local operator, we successfully apply non-local operator to regularized softmax. After observing the experimental results on WBC Datset and CamVid Dataset, our proposed method obviously helps improve the performance of Unet, Attention Unet and Segnet. The proposed method not only inherits the regularization property from regularized softmax, but also showcases its prominent performance by preserving many more details. Since our method is a variation of softmax activation function, it is applicable to all networks with softmax. Especially, it can showcase its performance on small datasets with simple network structures. Now the parameters in computing the pairwise potential Eq. (11) is manually tuned. In the future, we will find a way to generate the affinity matrix W online efficiently and make the parameters in Eq.