xNLMeans for AviSynth 2.6
DOCUMENTATION
 														
v 0.03
ABSTRACT
XNLMeans is an AviSynth plugin implementation of the Non Local Means denoising algorithm, as described by Buades et al. This implementation provides several optimizations and extensions over the original publication and other implementations.

FUNCTIONS SYNTAX
clip xNLMeans(clip, int "a", int "s", int "d", float|clip "h", float "sdev", int "planes", bool "lsb", clip "mask", clip "rclip", float "digits", float "diffout", float "lcomp", float "vcomp", float "wmin", float "dw", int "threads")
float xNLMeans_VersionNumber()

PIXEL FORMATS
XNLMeans supports all 8 bit AviSynth 2.6 frame formats and planar stacked 16 bit frame formats (see dither package for description). 8 bit YUY2 and RGB frames are internally converted, as the algorithm natively works on planar frames. See remarks below.
This plugin makes all computations on single component greyscale planes of image frames, i.e. R,G,B or Y,U,V separately. In the 'rclip' section, a method to use R+G+B together is shown.

METHOD OF OPERATION
The Non Local Means algorithm proposes, for every single pixel in an image, to take the special pattern of its surrounding pixels as a fingerprint (or patch) for this pixel, and then to derive the filtered result from the weighted mean value of all other pixels in the image with similar surrounding fingerprints, where the weights match the similarities.
When this is implemented as an algorithm, the comparison is restricted to a seek square of (2a+1)·(2a+1) possibly similar pixels. Otherwise it would take too much time to complete the calculation, and the pixels in the area should provide sufficient data for practical use.
Also, only a square of (2s+1)·(2s+1) pixels is defined as the region of surrounding pixels to determine the fingerprint. More central positions in the fingerprint area may or may not be given a higher weight than the more distant ones. In any case, the differences between corresponding pixels in the two fingerprints are averaged, resulting in a scalar 'similarity' between them.
The paper introduces the term method noise. This is the specific noise image which is extracted from a noisy image with a denoising method, i.e. the difference between the noisy and the denoised image. If the noise of a noisy image is of random nature without relations between the noise part of neighbor pixels, then the ideal method noise is also uncorrelated between pixels, with fine structure and without any original image content.

Optimized Computation Order for Geometrical Convolution
XNLMeans uses an unusual computation order to calculate the convolutions: It is obvious that there are many repeated calculations in the algorithm. More specifically, with a pixel to be filtered (x,y) pxy and a possibly similar test pixel (x+a,y+b), the ranges (x-s,y-s)...(x+s,y+s) and (x+a-s,y+b-s)...(x+a+s,y+b+s) are the two fingerprints. With the center pixel (x,y+1) and the test pixel (x+a,y+b+1), the new fingerprints just lack the top rows and have additional bottom rows (and for non-flat weighting, all pixel weights are different). Instead of keeping pixels to be filtered fixed and compare all possibly similar pixels around them in a sequence, xNLMeans fixes one offset after another and walks through the whole image with it. This allows to calculate intermediate results for each row and then just add up the appropriate intermediate results.
For the 1st pixel in each sequence, and when a mask is used, this benefit does not (fully) apply, but it is valid for the majority of pixels.
Roughly said, for the addition of (2s+1)·(2s+1) weighted differences, only (2s+1) differences plus (2s+1) intermediate results must be added, so this part of the algorithm is (2s+1)/2 times faster ?? with s=3 factor 3.5, with s=4 factor 4.5 and so on.

Fingerprint Similarity and the Geometrical Weight Function
The original paper makes no statement about position (distance from the center) weights wxy of pixels in the fingerprints. The differences of all corresponding pixels just might be added up to obtain the dissimilarity - called a 'flat' geometrical weight.
Alternatively, a decreasing weight e.g. associated with an euclidean distance r = ?(x²+y²) from the center may be assigned to the regions. This can be 1/r, 1/r² e-r or something else. 
For speed reasons, xNLMeans works with intermediate results for fingerprint rows. To keep independency from the pixel's angle, it needs a 'separable' weight function that applies the total weight in a first step to the pixels of the rows and then in a 2nd step to the rows of the area.
The gaussian function f(x) = exp(-x²/2s²) meets the resulting requirement f(x)·f(y) = f(?(x²+y²)).
Consequently, xNLMeans uses
	wx = exp(-x²/2sdev²)	wy = exp(-y²/2sdev²)
	wxy =  wx·wy	
for the geometrical weight function, if the parameter sdev ? 0.
The averaged differences between corresponding pixels
	1/Z ·?wxy·(qxy - pxy)²	where xy run over (2s+1)·(2s+1) and 1/Z is the normalizer ?wxy
form the scalar 'dissimilarity' D between the fingerprints.

The Fingerprint Size Dilemma, XNLMeans Multiscale Fingerprints
It is obvious that very big fingerprints are nonsense, because the Non Local Means denoising idea is based on keeping non-related other content out of the fingerprint. On the other hand, too small fingerprints do not produce a sufficiently averaged (i.e. reliable) similarity value. 
The best choice depends on the image content: E.g. for sky, a big window is best; while for rich detail this prevents finding similar fingerprints.
For that reason, xNLMeans calculates two or three fingerprints for each pixel: a flat one with diameter 2·s+1, a gaussian weighted one with diameter 2·min(s,3)+1 - and if s>3, a third flat one with diameter 7.
The fingerprint with the best similarity wins. This makes for very good averaging for unstructured content and better comparison performance for structured content. This mode of operation can be deactivated by setting sdev = 0.

Intensity Difference Weight Function
When the dissimilarity D between two pixels has been calculated as the result of the convolution, it must be transformed into the weight of contribution of qxy to the filtered pxy. The decision for this function is arbitrary, as with the geometrical weight function. However, exp(-D/h²) has originally been propsed for this. Since D is the mean squared difference of the region convolution, exp(-D/h²) is effectively exp(-mean(diff²)/h²), i.e. again the Gauss function. Compared with other possible intensity weight functions like h/D, h-D, exp(-D/h²) did a superior job in experiments recovering test images with superimposed white noise.

Pixel Difference Calculation, Luma Offset Compensation
The above term D is still a simplification. As said, the mean value of the difference squares is used to calculate the difference value.
The calculation of differences can be refined by determining the averaged offset between two fingerprints:
	ofs = E(X) = 1/Z ·?wxy·(qxy - pxy)
which can be used to calculate the averaged difference of variances according to Var(X) = E(X²) - [E(X)]² = E(X²) - ofs²
(known as König-Huygens formula or Steiner translation theorem).
However, using D - ofs² does not improve the visual result, but amplifies pale image details in the method noise, and blurs the output. In reverse, D + k·ofs² recovers such details better, therefore xNLMeans allows to set k as the parameter lcomp with a weird input scale: [-1...inf) is k according to the formula, (-inf...-1) is an automatic mode with -(lcomp + 1) as user settable control factor.

Original Pixel Output Weight
Each pixel has a perfect fingerprint compared to itself, and would theoretically add its own value to the output value with weight 1. In real world, this often yields almost unfiltered input, so the pixel itself is excluded from the neighbor search. This output without further 'postprocessing only produces good results for pixels with several and similar neighbors, while inaccurracies for pixels with only one or two still rather unsimilar neighbors create ugly artifacts.
The original paper proposes to observe all weights similar pixels contribute, and to assign the maximum weight from any similar pixel to the original pixel. This doubtlessly makes the filter robust: for pixels that find only one similar pixel, it produces a 50% blend effect, while reducing the blend where many similar pixels are found. But the paper lacks further justification of the method.
As another effect, numerical accuracy of the implementation is relevant. A numerical accurracy of e.g. 6 decimal digits might provide 'crispier' output than 20 digits, just because it finds no suitable neighbors at all in craggy image parts, so noisy input pixels are passed through, while a more accurate filter removes more pixel contrast.
In xNLMeans, the attempt is made to provide a smoother transition between fully filtered pixels and input pixels, yet also provide the original algorithm. The weight to add the input pixel to the filter output is (with 'wmax' being the maximum weight of the suitable neighbors):
	weight = wmin·wmax								if wmin > 0
	weight = -wmin·wmax·correction_factor					if wmin < 0
So, with wmin = 1 ? weight = 1, which is consistent with the paper. With wmin = -1, a blending which also depends on a correction factor is used. The factor uses the relation between the found neighbor weight and the digits parameter to make the input blend small when good neighbors were found and make it high when the weight of the best found neigbor was still small.
Wmin < 0 is intended as an automatic mode with -1 being the neutral setting.
Digits sets the soft weight transition point with wmin < 0. The default setting is chosen to balance good visual reproduction of small details, grass, foliage, pebble, water ripples and other rough surfaces and leveling of smooth image parts even with higher h settings. Lower digits setting keeps more details and stains. Internally, xNLMeans provides 3 more decimal digits accurracy headroom. Neighbors with even less weight are discarded to save a bit processing time.

Variance Compensation
Dealing with absolute differences has a strong effect when the values inside the fingerprint vary little. In image parts with edges and sharp objects, there is less chance to find equally 'similar' neighbor pixels. This may be undesirable depending on the noise source. Mosquito noise can cause very distinct noise pixels which are hard to fight with the absolute approach. XNLMeans can calculate the mean variance of the pixel fingerprints, and rate fingerprint dissimilarities relative to this variance:
	D_factor(pxy) = [Var(pxy fingerprint)]-vcomp				vcomp = [0...1], see above for Var() calculation
With vcomp = 0, all pixels' fingerprint D values weigh '1' regardless of the variance inside the fingerprints, which is the original behaviour. In typical applications, variance compensation does not level smooth image parts as well as the absolute approach, and kills crispness. So it's here for experiments e.g. on mosquito noise.

Mask
Since the algorithm consumes much computing power, it can be desirable to restrict the filtering to certain image parts, e.g. edges with artefacts. For this purpose, XNLMeans supports a mask clip. If a mask clip is defined, it must have the same dimensions and length as the input clip and either 'planes' must be 1 or the input clip must have pixel_format YV24, Y8, or RGB. Only luma from the mask clip is used for all filtered planes. Mask pixels with intensity 0 are left unfiltered to speed up the filter. Values 1..254 cause a blend between input and filter output and 255 outputs the filter result unblended.

(!) The mask must not be 'cored' to YUV conformant values [16...235], but the complete range [0...255] is used.

(!) Even for RGB32 clips, the contained alpha channel is not used as a mask (because that would have made filtering harder for RGB32 clips with transparency), but it must be provided separately, if desired.
The mask does only restrict which output pixels are to be filtered. Around these pixels, the full seek area will be processed.

(!) The mask is always of 8 bits type. With 16 bit clips, the mask's height must match the high byte part of the clip.

Rclip
Another clip than the source clip (e.g. a preprocessed copy of the source clip) may be used as the weighting source. If it is known that the three parts of an RGB or YUV image together provide a sharp and less noisy source for filtering these components, they may be combined in a ConvertToY8() or similar prefiltering and fed to xNLMeans as rclip. With the rclip parameter set, xNLMeans uses rclip to determine the neighbor weights and the source clip as input image of the filter.

Method Noise Preview & Output for External Processing
For preview of the filter effect, the 'method noise image' is a useful helper. When the diffout parameter is > 0, xNLMeans displays the inverted method noise, multiplied with the parameter, instead of the filtered output. This can also be used to feed it to external post processing before applying it to the source.

Sequence of Global Shift and Content Based Stop
XNLMeans allows to balance the filtering effort based on the image content. The parameter dw is a weight delta. In every frame plane run, xNLMeans adds up the weight of all processed pixels. If the following formula is true, then the filtering of the plane stops.
	last_pass_overall_weight < dw · previously_accumulated_overall_weight
I.e. when the last plane run was not able to add at least a given threshold to the already achieved overall weight, further processing is assumed to be waste of time. This is done per thread and might not be true for certain image parts, however. If xNLMeans produces incomplete filtering, set dw to very small values, or even zero, which provides the original, slower behaviour.
Because with this content based stop, it is not known in advance how much of the seek range will be completed, it is not a good idea to start the seek at the typical corner (x-a,y-a,current_frame-d). Instead, xNLMeans starts with direct neighbors of (x,y,current_frame) and enlarges the radius step by step. Non-motion-compensated temporal filtering tends to introduce ghost structures, so current_frame is evaluated completely first. To fetch other frames, xNLMeans uses Avisynth's frame cache.

YUY2, RGB24, RGB32, 16 Bits
XNLMeans natively processes the planar memory layouts Y8, YV411, YV12, YV16, YV24 as 8 bit or stacked 16 bit planes for the source clip. It converts YUY2 to YV16 via an Avisynth 'ConvertToYV16()' call. It converts the R, G and B components of RGB24 or RGB32 via Avisynth 'ShowRed()', 'ShowGreen()', 'ShowBlue()' to three Y8 planes and processes these separately. The planes parameter is ignored in this case. After processing, RGB24 is reconstructed via 'MergeRGB()', RGB32 via 'MergeARGB()'. The alpha channel is transferred 1:1 to the output clip. The alpha channel is not used as mask, because otherwise it would be impossible to process RGB32 clips without the mask specifics of xNLMeans. If the alpha channel is to be filtered, use 'xNLMeans(ShowAlpha(c)...)'. To  use it as the mask, use 'xNLMeans(...mask=ShowAlpha(c))'.
H, mask and rclip must be provided as planar frametype clips. For more restrictions on these clips see the resp. sections.

More Features
XNLMeans provides automatic settings for most parameters and the ability to provide h as a clip instead of a float value in order to adapt filtering to varying clip content. Automatic settings were found by statistical 'training' with a set of reference images. They are activated by setting the parameter to a negative value, which serves as a factor for the automatic operation.

Speed
XNLMeans is a MT_NICE_FILTER for Avisynth_MT or Avisynth+, but without SIMD instructions or GPU acceleration. However, it contains its own multithread logic for use in a single threaded script. Compared to a straightforward implementation, it provides these acceleration features:
Multi threading
Neighbor weight re-use for both compared pixels in many modes
Separated row- and column- convolution, with typical speedup 3.5 or more for the convolution if no mask is used
(horizontal mask structures degrade this benefit)
Content based stop depends on image content, but allows large a setting, while it often works with about a=1...2 speed without visible degradation (dw>0)
A mask is profitable if it fills < 20% of the frame. Otherwise if forces a lot of row convolutions to be done for only few pixels and dual use of the similarity is wasted with masked compare pixels. Even if the mask fills much of the frame, it costs only the additional mask processing. Convolution separation and dual use are still active and generate the more benefit, the more pixels the mask 'activates'.
Conditional processing is reduced to the minimum in the worker threads, especially in the convolution loops. Instead, the plugin uses a set of slightly different variations of the worker function, containing just the neccessary operations.
Compiled with MS Visual Studio 2015, all compiler switches set to "fast". The plugin needs the VS 2015 runtime library (CRT).


PARAMETERS
a	(default: 6)
Radius of the environment where to find suitable replacement pixels
s	(default: 4)
Radius of the similarity region forming the pixel fingerprints
d	(default: 0, maximum: 2)
Temporal radius of the pixel compare environment
h	(default: 1.)
A) float value: normalizer for the intensity difference weight function, filter strength. 
Hint: h is too high if you notice object details besides noise in the 'diff' output. Set h just below that limit.
0...	: manual setting
B) clip: must be of same length as source clip. First pixel of first plane is fetched and divided by 25 as the frame filter strength. Note: to have separate h for the source clip planes, too, the filter must be called three times with only one source clip plane to process each, and different h clips.
sdev	(default: -1.)
Gaussian standard deviation for the fingerprint geometrical weight exp(-r²/2sdev²), r being the distance of a pixel from the fingerprint center. With the default value, a pixel with 1.5 pixel sizes distance from the center pixel has a geometrical weight of exp(-1). Larger values give more weight to more distant pixels and make the output more blurry.
Negative values activate an automatic setting according to the h setting for the frame. sdev = 0 deactivates multiscale fingerprints and activates single flat fingerprint operation.
planes	(default: 1)
1= Y plane, 2= U plane, 4 = V plane. Add the plane values to be processed. Default only processes luma.
Note that for RGB24, RGB32 'planes' is not used, but all components are processed.
lsb	(default: false)
Activates 16 bit processing of stacked frames from the 'dither' package
mask	(no default)
If a mask clip is defined, it must have the same dimensions and length as the input clip and either 'planes' must be 1 or the input clip must have pixel_format YV24, RGB24 or RGB32. Only luma from the mask clip is used for all filtered planes. Mask pixels with intensity 0 are left unfiltered which can speed up the filter. Values 1..254 cause a blend between input and filter output and 255 outputs the filter result unblended. Please note that the mask must not be 'cored' to YUV conformant values 16...235, but the complete range 0...255 is used. Always 8 bits, also when lsb is used for input/reference/output clips.
rclip	(no default)
If a reference clip is defined, it must have the same dimensions and pixel type 	as the input clip. Note that RGB24 and RGB32 sources are processed as Y8, so they need the same Y8 as rclip for all three components. To determine the neighbor weights, the filter uses rclip and for the filter output, it replaces the rclip pixel intensity by the source intensity.
digits	(default: -1.)
Controls the smallest neighbor weight that is still filtered. See the 'Original Pixel Output Weight' section.
diffout	(default: 0.)
Returns the inverted difference between the filter output and the input clip (the inverted method noise), to help finding the best parameters, or for external post processing
lcomp	(default: -2.)
Neigborhood offset compensation, see the filter description above. Original behavior with lcomp=0. To provide the full range of possible settings, it uses a weird range scheme:
...-1)	: automatic setting with -(lcomp+1) factor
[-1...0)	: mean differences E(X²) is reduced to E(X²)+lcomp·[E(X)]²
0	: off and original behaviour like the paper proposal, faster loop
(0...	: mean differences E(X²) is increased to E(X²)+lcomp·[E(X)]²
vcomp	(default: 0.)
Variance compensation. D0 is the variance of the fingerprint itself. The difference weighting function then uses a (D0/h)vcomp factor.
wmin	(default: -1.)
Amount of original pixel to contribute to the filter output, relative to wmax, which is the weight of the most similar pixel found. wmin < 0 is an automatic mode, see the 'Original Pixel Output Weight' section.
dw	(default: -1.)
Delta weight of last pass to allow content based stop. When a pass produces less than dw percent of the previously achieved overall weight sum, the filter stops. Smaller values make the output more precise and the filter slower. dw < 0 is an automatic mode. Can be tuned to e.g. -0.1 to increase output precision but make filter slower.
threads	 (default: 0)
The number of threads for the filter algorithm. 0 chooses the number by the CPU cores.

REMARKS and TODOs
Multiscale processing speedup should be be achieved outside of the filter with AviSynth means.
The filter is designed for progressive video. Deinterlace first.
Bugs are probable.
	
CHANGE LIST
03/24/2016  v0.03
all automatic modes trained from scratch
automatic h estimator removed, too unreliable
multiscale fingerprints
fixed: fastmode calculation not correct near strip edge in multithread operation
vcomp possible with fast mode, modified default values and automatic control formulas
fixed: possibly 16bit LSB partially incorrect (missing explicit typecasts to float with divisions)
fixed: refclip minweight incorrect
fixed: diffout mode minweight incorrect
02/15/2016  v0.02
center parameter removed, no filter improvement noticeable
MT version, added new parameters, removed center parameter, bugfixes
12/12/2015  v0.01
initial release

CONTACT
doom9.org forum martin53