Cover song refers to a live performance, a remix, or a new recording of a previously recorded track. Cover song identification (CSI) is difficult and challenging because changing timbre, rhythm, song structure, main key, and lyrics, occurred during cover song generation, may produce highly different cover versions [1-2]. The practical applications of the CSI are copyright protection and music-archive management.
One commonly used musical property for CSI is the tonal contents of music, such as chromagram or pitch class profiles, which are independent of timber and loudness and thus suitable for CSI . The chromagram vector is extracted in a short-time interval (called a frame) by quantifying the spectral energy of the octave-folded subbands. By pooling spectral energy into one octave, chromagram features identify pitches that differ by an octave, which is necessary for the CSI. However; timber variations, which are caused by changing singer or instrumentation during covers song generation, cannot be dealt with. To cope with the large variations in timbre, severable approaches have been proposed. In , Muller and Ewert proposed the chroma DCT-reduced log pitch (CRP) by utilizing the upper-frequency discrete cosine transform (DCT) coefficients of the spectral energy in extracting chromagram with the assumption that the lower-frequency components of the spectral energy are closely related to the aspect of timber and should be reduced. In , the trend estimation filters, such as the moving average and the Hodrick-Prescott (HP)  filter, was used in removing the trend of the spectral energy, which is smoothly-varying component and assumed to be closely-related to the timber. Although the HP filter has improved the CSI accuracy , the HP filter has only one parameter for tuning its filtering characteristics, which is a limitation further enhancing the CSI performance. In an effort to find a trend-estimation filter which can be easily-adjusted for attaining best CSI performance, this paper employs the Savitzky-Golay (SG) filters . Employing the least-square fit and a polynomial function as a filter kernel, the SG filter is able to reduce noises and find a trend line for a signal. The SG filters are optimal in the sense that they minimize the least-squares error in fitting a polynomial to frames of a noisy signal. The SG filter was originally proposed for analytical chemistry and has been utilized in a number of applications including digital control systems, speech recognition, denoising, and signal enhancement. In this paper, the CSI performance of the chromagram using the SG filter is experimentally compared with that using other types of filters.
II. PROPOSED CHROMAGRAM EXTRACTION METHOD
The baseline of the chromagram used in this paper is the chroma log pitch (CLP) , whose extraction is based on a pitch-frequency scale as shown in Fig. 1(a). First, the input music signal is decomposed into 88 frequency bands with center frequencies corresponding to the MIDI pitches p = 21 to p = 108. Further details on the frequency band positions and bandwidth are described in . At each of the 88 subbands, the short-time mean-square power (local energy) is calculated. As in , we add 20 zeros at the beginning and 12 at the end to construct a 120-dimensional feature vector where the entries correspond to MIDI pitches from p=1 to p=120. Then a logarithmic compression on the pitch representation is applied to account for the logarithmic sensation of sound intensity. Finally, the 12-dimensional chromagram is obtained by chroma binning, which adds up the corresponding values of the pitch representation that belong to the same chroma.
This work is an extension of the previous work in , where the chroma trend-removed log pitch (CTP) was proposed. Trend estimation tries to decompose a time-series signal into a medium-to-long term trend part and a short-term cycle part to detect and predict tendencies and regularities in the time series signal without knowing any information a priori about the signal. Mathematically, the decomposition of the given time series yn into a trend xn and a cycle cn is expressed by
for n =1, 2, ..., N. The CTP is obtained by removing the trend of the spectral energy with an assumption that the trend of the spectral energy is closely related to the aspect of timber and thus should be removed for timber invariance. The overview of the CTP extraction is shown in Fig. 1(b): 1) estimating the trend of the 120-dimensional logarithmically compressed pitch representation; 2) subtracting the estimated trend from the pitch representation, which is equivalent to take the cycle part in (1); 3) taking the positive part of the trend-subtracted pitch by the half-wave rectification; and 4) performing the chroma binning. Since the local peaks of the spectral energy are related to music-specific harmony, emphasizing the peaky tonal components and reducing noise by removing trend along with the half-wave rectification are conducive in improving CSI accuracy. The performance of the CTP is contingent on the trend estimation , which is addressed using the SG filter in Section 2.2.
Throughout a number of different disciplines, such as macroeconomics, geophysics, biology, and social sciences, various trend estimation methods have been utilized. Among them, this paper considers the SG filter. Employing the least-square fit and a polynomial function as a filter kernel, the SG filter is able to reduce noises and find a trend line for a signal. The SG filters are optimal in the sense that they minimize the least-squares error in fitting a polynomial to frames of a noisy signal. Let y be a signal for a SG filtering. For a frame of length 2M + 1 from the signal y, denoted by y−M, y−M+1,…,ym,…,yM−1,yM, we should find the coefficients, denoted by ak, of a fitting polynomial with degree K given by
with the objective of the minimization of the following mean-squared fitting error:
The original paper by Savitzky and Golay  showed that at each position, the smoothed output value obtained by sampling the fitted polynomial is identical to a fixed linear combination of the local set of input samples; i.e., the set of 2M+1 input samples within the approximation interval are effectively combined by a fixed set of weighting coefficients that can be computed once for a given polynomial degree K and approximation interval of length 2M+1. Thus, the same weighting coefficients will be obtained at each group of 2M+1 input samples, and so we can think of least-squares smoothing as a shift-invariant discrete convolution process . That is, the output samples of the SG filters can be computed by a discrete convolution instead of the polynomial fitting process in (2) and (3). Detailed analysis on the derivation and the characteristics of the SG filters is presented in .
There is no particular recommendation for the polynomial degree K and frame length 2M+1, which needs to be set beforehand. For most cases, the polynomial of degree up to 3 has been used for the SG filters . Especially the polynomial with degree 0 is corresponding to the moving average filter. For a given polynomial degree, the larger the frame length, the more noise suppression and the less error variance of the filter output, but when the frame length is selected too large, the filter output, in comparison with the actual signal, becomes distorted and biased. The selection of the frame length and the polynomial degree of the SG filters with regard to the CTP extraction will be addressed in Section III.
III. EXPERIMENTAL RESULTS
The CSI accuracy of the proposed salient chromagram based on the SG filters was evaluated on two cover song datasets using the CSI method in . The first cover song dataset (abbreviated as covers80) is the one that was used by Dan Ellis in his work . The covers80 consists of 80 original and cover song pairs (160 songs in total), which are available online. The second cover song dataset (abbreviated as covers330) is composed of 1000 songs, where 330 songs are test data (30 original songs and 10 cover versions per each original song), and the other 670 songs were embedded as imposters. The covers330 was collected by the author. For the covers80 dataset, we calculated the precision at one, P@1, which is the rate of the covers correctly identified in top 1 when querying each song on the 160 songs in the dataset. For the covers330 dataset, we queried the 330 cover songs over the 1000 entire songs and computed the mean number of covers identified in top 10 (MNCI10). For both datasets, we evaluated the average rank of the first correctly identified cover (Rank1) and the mean of average precision (MAP). We follow the experimental procedures in the MIREX 2020 .
Each song in the datasets was converted to mono at a sampling frequency of 22050 Hz and then divided into frames of 200 ms overlapped by 100 ms where the 12-dimensional chromagram vector was computed as a low-level feature for each frame. The 12-dimensional chromagram vector was normalized with respect to the Euclidean norm to have unit length. In extracting chromagram, we utilized the pitch representation in the chroma toolbox  with the default parameter settings. From the pitch representation, we extracted three different types of the chromagram, CLP, CRP, and CTP, for evaluating the cover song identification performance. In extracting CTP, we utilize the SG filters and the HP filters. The HP filters performed best in the previous work .
Table 1 and Table 2 show the CSI performance of the CTP using the SG filters with different values of the frame length up to 13 was considered. We note that the impulse response of the SG filters with an odd degree, K, is the same as that with K-1 . Thus we consider the SG filters with even degrees up to second order. For both SG and HP filters, the CSI accuracy of the CTP was better than that of the previous chromagrams; CLP and CRP. We note that the performance of the CTP using the HP filter in Table 1 and 2 is the best CSI accuracy achieved using the HP filter by adjusting the smoothing parameter. Regarding the type of the trend filter, the SG filter with M=4 and K=2 showed best performance for both datasets. The SG filter was more effective than the HP filter in boosting identification accuracy. Depending on the value of M and K, the SG filters possess different frequency responses. The impulse and the frequency response of the SG filter with M=4 and K=2 are shown in Fig. 2. The 3-dB cutoff frequency of the SG filter with M=4 and K=2 was 0.243 .
|CTP-HP filter ||17.23||0.613||0.669|
|CTP-HP filter ||4.24||7.430||0.767|
In this paper, the SG filters have been utilized in removing the trend of the spectral energy for extracting a salient chromagram. The removal of trend emphasizes tonal contents of music, which are preserved against wide range of the possible distortions which may occur during cover song generation process. Appropriate choice of the trend-estimation filter is utmost important for attaining best performance, which is addressed in this paper. Experimental results on two datasets show that the use of the SG filter with appropriately-chosen parameters is effective in improving CSI accuracy. Further study includes a filter design for more discriminant and resilient chromagram extraction.