Learning Transferable Subspace for Human Motion Segmentation PDF
Document Details
Northeastern University
Lichen Wang, Zhengming Ding, Yun Fu
Tags
Summary
This paper proposes a novel transferable subspace clustering approach for human motion segmentation. The method leverages relevant source data to improve the clustering performance of target temporal data, transforming the original data into a shared low-dimensional feature space. A graph regularizer is employed to preserve sequence knowledge in the learned representation.
Full Transcript
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Learning Transferable Subspace for Human Motion Segmentation...
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Learning Transferable Subspace for Human Motion Segmentation Lichen Wang,∗ Zhengming Ding,∗ Yun Fu∗† ∗ Department of Electrical & Computer Engineering, Northeastern University, Boston, USA † College of Computer & Information Science, Northeastern University, Boston, USA [email protected], [email protected], [email protected] Abstract $FWLRQ $FWLRQ $FWLRQ $FWLRQ Temporal data clustering is a challenging task. Existing meth- ods usually explore data self-representation strategy, which 0$''DWDVHW /DEHOHG6RXUFH'DWD may hinder the clustering performance in insufficient or cor- 3URMHFWLRQ 6HJPHQWDWLRQ/DEHO rupted data scenarios. In real-world applications, we are eas- « ily accessible to a large amount of related labeled data. To &OXVWHUHG &OXVWHUHG $FWLRQ $FWLRQ this end, we propose a novel transferable subspace cluster- &OXVWHUHG ing approach by exploring useful information from relevant $FWLRQ 3URMHFWHG &RPPRQ source data to enhance clustering performance in target tem- 1HZ6SDFH :HL]PDQQ'DWDVHW poral data. We manage to transform the original data into a shared low-dimensional and distinctive feature space by jointly seeking an effective domain-invariant projection. In 8QODEHOHG7DUJHW'DWD this way, the well-labeled source knowledge can help obtain a more discriminative target representation. Moreover, a graph regularizer is designed to incorporate temporal information Figure 1: Framework of our approach. The information of to preserve more sequence knowledge into the learned repre- source attempts to help target data clustering through knowl- sentation. Extensive experiments based on three human mo- edge transfer, which is used to reconstruct the target data. tion datasets illustrate that our approach is able to outperform Furthermore, we seek a domain-invariant projection to align state-of-the-art temporal data clustering methods. different distributions between the two datasets. Introduction (Smyth 1999; Xiong and Yeung 2003), temporal-proximity- Temporal data segmentation is a critical technique utilized based (Keogh and Kasetty 2003) and representation-based in many real-world applications as data preprocessing pro- algorithms (Dimitrova and Golshani 1995; Chen and Chang cess, such as natural language processing, motion analysis 2000). Among them, representation-based approach is the and action recognition. The goal is to divide a long temporal most popular one, especially the methods based on subspace data sequence into several short, non-overlapping, meaning- clustering algorithms. ful and reasonable segments. Temporal segmentation can be Subspace clustering has ideal performance in wide appli- easily concatenated with other post-processing methods to cations. Several representative subspace clustering methods further enhance task performances. Assume there is a video have been proposed, including sparse subspace clustering sequence which contains several continuous actions. Since (SSC) (Elhamifar and Vidal 2009), least-square regression conventional action recognition approach is designed to rec- (LSR) (Lu et al. 2012) as well as low-rank representation ognize videos which only contain a single action. Thus, a (LRR) (Liu et al. 2013), etc. The core idea of subspace clus- preprocessing segmentation process is essential. tering is learning a distinctive and low-dimensional data rep- Compared with independent static data, the successive in- resentations. The representations are used as input for exist- formation residing in data points is a unique cue to guide the ing clustering algorithms, which are applicable to temporal clustering algorithms. And it is effective to fully utilize the data clustering. Several methods are designed for temporal data dependency to improve clustering performance. A com- data segmentation. (Li, Li, and Fu 2015) designed a dictio- prehensive survey (Keogh and Kasetty 2003) reveals that nary which is simultaneously updated in the learning pro- temporal data clustering is difficult due to the data dimen- cess to obtain expressive codings. (Talwalkar et al. 2013) sion and complex temporal connection. Based on the catego- proposed a novel divide-and-conquer algorithm for large- rization mentioned in (Yang and Chen 2011), there are three scale subspace segmentation. Since these methods utilize lines of temporal clustering methods, including model-based self-representation strategy, and thus, they may hinder the Copyright c 2018, Association for the Advancement of Artificial clustering performance when the data are insufficient or cor- Intelligence (www.aaai.org). All rights reserved. rupted. Furthermore, since the methods belong to unsuper- 4195 vised learning scenario without extra information to guide non-overlapping and meaningful groups. Semi-Markov K- the segmentation process. Thus, it is difficult to segment data means clustering (Robards and Sunehag 2009) is designed into expected groups. Supervised learning is still not an ideal to find repeated patterns residing in temporal format data. solution due to high cost to generate labeled data. Hierarchical aligned cluster analysis (Zhou, De la Torre, Although specific labeled data are difficult to achieve, re- and Hodgins 2013) utilizes a dynamic time alignment ker- lated and well-labeled data are widely available. It is reason- nel to cluster time series data. Maximum-margin clustering able to explore information from related data to help cluster- method (Hoai and De la Torre 2012) simultaneously rec- ing process. To this end, we propose a novel transferable ognizes the length and position of corresponding segments. temporal data clustering approach, and the goal is to explore Temporal Subspace Clustering (TSC) (Li, Li, and Fu 2015) the usage of source knowledge to improve the segmenta- jointly learns a dictionary and representations with a regula- tion performance in target domain. It’s a challenging task tion to decode temporal information. Basically, these tempo- due to the source and target distribution gap which would ral clustering methods dig clustering information only from easily cause negative transfer problems. The crucial idea of the data. These approaches are difficult to robustly cluster our model is to learn a domain-shared subspace, where a temporal data into a reasonable, meaningful and expected reconstruction-based strategy is applied to guide the knowl- result. Thus, we propose a transfer learning based segmenta- edge transfer. Therefore, the new learned representations for tion approach, which borrows information from labeled ex- target domain are effective enough to do further clustering. tra data to facilitate the target clustering performance. To our best knowledge, this is the pioneer work to explore Transfer Learning. Transfer learning is a popular tech- transfer learning in temporal data clustering. The contribu- nique which transfers knowledge from one task to dif- tions of this work are as follows: ferent but related tasks. A comprehensive survey can be found in (Pan and Yang 2010). According to (Pan and Yang We propose a novel transfer learning based subspace clus- 2010), our approach belongs to transductive transfer sce- tering approach, which adapts useful information from nario (Zhang, Zhang, and Ye 2012; Dai et al. 2007; Ando relevant data to improve the clustering performance in and Zhang 2005; Blitzer, McDonald, and Pereira 2006; target temporal data. Specifically, a reconstruction-based Daumé III, Kumar, and Saha 2010; Argyriou, Evgeniou, and strategy is adopted to guide the knowledge transfer by Pontil 2007; Wang and Mahadevan 2008). In transductive seeking effective new representations. transfer learning, the tasks of source and target are same, A domain-invariant projection is learned to mitigate the however, the domains of source and target are different. Do- data distribution differences between source and target main shifting is the key problem of transfer learning since domains. Meanwhile, a graph regularizer is built to cap- the distributions of source and target data are inconsistent. ture the temporal information of source and target for bet- One solution is to learn a data representation, which attempts ter clustering. to adjust both distributions and obtains a well-aligned fea- Non-trivially, an optimization algorithm is designed to ture space (Zhang, Zhang, and Ye 2012; Ding and Fu 2017). learn the representation and projection simultaneously Low-rank transfer learning is proposed to use incomplete and efficiently, which are used to construct a robust affin- multiple source data better (Ding, Shao, and Fu 2016; 2014; ity graph for further motion segmentation. 2015). Our approach is different from previous work, since the methods introduced above mainly focus on classification problems and transfer label information between domains. Related Work While our approach manages to transfer clustering informa- We discuss three related work including subspace clustering, tion for temporal data segmentation tasks. temporal data segmentation and transfer learning. Subspace Clustering. Subspace clustering seeks to find The Proposed Approach clusters in different and distinctive subspaces. Sparse based We explore the transfer learning idea in temporal data clus- clustering methods (SSC) (Elhamifar and Vidal 2009) uti- tering in a semi-supervised strategy. lizes sparse constraints to learn a sparse representation of data. Least-square regression (LSR) (Lu et al. 2012) encour- Learning Transferable Representation ages a grouping effect which tends to group highly corre- lated data together by using the Frobenius norm. Low-rank To transfer knowledge across different domains, we adopt representation (LRR) (Liu et al. 2013) analyzes the global a reconstruction-based scheme to seek new representations, structure in feature space and seeks the lowest-rank repre- which are lying in the similar data distribution. The recon- sentation. Discriminative subspace clustering (DSC) (Zo- struction strategy is shown as follows: grafos, Ellis, and Mester 2013) incorporates discriminative X ≈ XS Z, (1) information into the model by using a quadratic classifier trained from unlabeled data. However, these methods are not where X = [XS , XT ] ∈ R d×n is the concatenated of source well designed for temporal data segmentation. They model and target samples points XS ∈ Rd×nS and XT ∈ Rd×nT , the data points independently but neglect the temporal rela- each column in X represents a sample. d is the feature di- tionship in the sequential data points. mension, nS , nT are the sample numbers and n = nS + nT. Temporal Data Clustering. The goal of temporal data Z = [z1 , z2 ,..., zn ] is the learned representation of X, each clustering is to divide a long temporal data into several short, zi is the representation of xi. Z = [ZS , ZT ] ∈ RnS ×n is 4196 also the concatenation of source and target representations Cluster 1 Cluster 2 ZS and ZT , where ZS ∈ RnS ×nS , ZT ∈ RnS ×nT. As mentioned above, XS and XT have different data 0 distributions. If XS is directly used for reconstruction, the learned representation Z would contain high reconstruction errors. Therefore, we jointly seek a domain-invariant projec- tion P ∈ Rr×d which reduces the distribution gap between XS and XT during training process, where r is the dimen- sion of P which regulates the dimension of projected space. 0 Thus, we refine the reconstruction strategy as follow: P X ≈ P XS Z. (2) Labeled source data Unlabeled target data To solve the constraint of Eq. (2), we design a formulation Figure 2: Structure of W in a simple case where s = 3, based on least-square regression, and the objective function nS = 9 and nT = 6. is formulated as follow: min P X − P XS Z2F + λ1 (P 2F + Z2F ), (3) We assume that every target data point has temporal con- P,Z nections since the segmentation is unknown. 2 r Z where n F is the Frobenius norm, and ZF 2 = From Eq. (5), we can notice that when the distance be- i=1 j=1 |Z i,j |. λ 1 is a trade-off parameter. P X − tween ith and jth sample points is less than s, the function P XS Z2F captures the reconstruction error, P 2F and Z2F ft (Z) would regulate the learned representation to be close. are used to constrain the variable scale and model the global If the distance is greater than s, there is no regularization subspace structure in X. Moreover, Frobenius norm is help- between the two samples. In order to better visualize how ful to learn a more distinctive structure in Z(Lu et al. 2012). ft (Z) preserves label and temporal information, we illus- That is to say, such reconstruction can ensure that the same trate a simple structure case of W in Figure 2 when s = 3, cluster data reconstruct the same cluster in both source and nS = 9 and nT = 6. We can see that the sequential property target domain to guide knowledge transfer. is preserved by the constraint weight between neighbor data points. Furthermore, we set the correlation weights between Temporal Graph Regularizer two different groups in source data as zeros. This strategy fully utilizes label information to constrain the coding re- Since successive information is crucial to guide segmen- sult. In this work, only binary weights are used for generat- tation process, we design a graph regularization function ing W. Other graph weighting algorithms and weight values ft (Z) to incorporate the temporal information. The purpose separately for temporal constraint and label constraint could of ft (Z) is to make the neighbors of learned representation be deployed to further improve the performance. Moreover, samples close to each other. In our approach, zi is the ith to preserve non-negative property of features, a constraint column of Z which correlates to the ith sample xi. ft (Z) Z ≥ 0 is added to the objective function. Putting all the would regulate its neighbors [zi−s/2 ,..., zi−2 , zi−1 , zi+1 , terms together, our objective function is proposed as fol- zi+2 ,..., zi+s/2 ] close to zi , where s is the length of relevant lows: frames. The expression of ft (Z) is shown as follow: min P X − P Xs Z2F + λ1 (Z2F + P 2F ) P,Z 1 n n ft (Z) = wij zi − zj 22 = tr(ZLW Z ), (4) +λ2 tr(ZLw Z ), (6) 2 i=1 j=1 s.t. Z ≥ 0, P XHX P = I, where λ2 is utilized to balance the weights of different terms. where tr(·) denotes the matrix trace which is defined to The constraint P XHX P = I would preserve the data be the elements sum on the matrix main diagonal. LW = variance after adaptation, which implies and introduces ad- n− W is graph Laplacian D matrix (Merris 1994), Dii = ditional data discriminating ability into the learned model P. i=1 wij , where W ∈ R n×n is the weight regularization X ∈ Rd×n is input matrix. H = I − n1 1, where 1 ∈ Rn×n matrix. Each element of W is shown below: is all-one matrix and I is an identity matrix. 1, if |i − j| ≤ s, l(zi ) = l(zj ), for source After obtaining representation Z, a graph G is generated wij = 1, if |i − j| ≤ s, for target (5) for the sequential clustering process. In order to segment 0, otherwise, temporal data accurately, the within-cluster samples are al- ways highly correlated with each other. The definition of G where l(zi ) denotes the action label of zi of source data. is given by: Different from previous work (Tierney, Gao, and Guo z z 2014) which assumes every frame sample has temporal con- G(i, j) = z i zj , i 2 j 2 (7) nections. Our approach only constrains the representation where zi is an instance of representation. When G is gen- samples belonging to the same group with temporal con- erated, an effective conventional clustering algorithm, Nor- straint in source part. Note that there is no requirement if the malized Cuts (Shi and Malik 2000), performs the final clus- class of source data would be overlapped with target data. tering process. The approach is shown in Algorithm 1. 4197 Algorithm 1. Motion Subspace Clustering Input: Source data XS and target data XT , step size η, P = arg min P X − P XS Z2F + λ1 P 2F cluster numbers k, parameters λ1 , λ2 , α, s P XHX P =I Output: Clustering result Y = arg min tr(P (X − XS Z)(X − XS Z) P ) 1: Generate temporal constraint matrix W and LW P XHX P =I 2: while not converged do +λ1 tr(P P ) = arg min tr(P [(X − XS Z)(X − XS Z) 3: Update P(k+1) using (13), fix other variables; P XHX P =I 4: Update V(k+1) using (12), fix other variables; +λ1 I]P ). 5: Update Z(k+1) using (15), fix other variables; (13) 6: Update Λk+1 , Λk+1 = Λk + ηα(Vk+1 − Zk+1 ); Eq. (13) can be solved by using generalized Eigen- 7: k = k + 1 decomposition: 8: end while 9: Build an undirected graph G based on Eq. (7) [(X − XS Z)(X − XS Z) + λ1 I]ρ = γXHX ρ. (14) 10: Utilize NCut to cluster k clusters and get index Y where γ is the eigenvalue of the corresponding eigenvec- tor ρ of the generalized Eigen-decomposition formulation. P = [ρ0 , · · · , ρp−1 ] where ρi is the minimum eigenvalue Optimization solutions to the eigenvalue problem Solving Eq. (6) is challenging since it is hard to get explicit Update Z: By ignoring other variables, the function can solutions. Thus, we solve the variables based on iterative be written as follow: strategy (ADMM) (Boyd et al. 2011). We optimize one vari- α able by fixing others. An auxiliary variable V ∈ RnS ×n is min Λ, V − Z + V − Z2F. (15) Z 2 used in the optimization process. We transform Eq. (6) as follow: The closed-form solution of Eq. (15) is Z = V + Λ α. In order to meet the non-negative constraint for Z, the update min P X − P XS V 2F + λ1 (V 2F + P 2F ) rule is shown as follow: P,V,Z +λ2 tr(V LW V ), (8) Λ Z = F+ (V + ), (16) s.t. V = Z, Z ≥ 0, P XHX P = I. α Then we convert Eq. (8) to an augmented Lagrangian func- where (F+ (A))ij = max (Aij , 0), and it regulates the non- tion (Glowinski and Le Tallec 1989) negative value constraint and Aij is an element in matrix A. The update steps are iteratively executed until the equation L = 12 P X − P XS V 2F + λ1 (V 2F + P 2F ) is convergent. The optimization process is summarized as +λ2 tr(V LW V ) + Λ, V − Z + α2 V − Z2F , Algorithm 1. s.t., Z ≥ 0, P XHX P = I, (9) Complexity Analysis where Λ ∈ RnS ×n is the Lagrangian multiplier, and α is There are two key time-consuming parts in our model a trade-off parameter. ADMM approach alternatively min- during optimization. The first one is the Step 3 (Eigen- imizes L with respect to Z, V and P. At the beginning of decomposition) and Step 4 (V Updating). Specifically, for optimization, P and Z are initialized with random value. V step 3, it contains Eigen-decomposition, which costs O(d3 ) and Λ are initialized with zero matrix. for the matrix with size of d × d. Step 4 updates V by us- Update V: By ignoring other variables, the Lagrangian ing Bartels Stewart algorithm, and its complexity is O(n2S n). equation (9) can be written as follow: Denote t as the iterations number, the total computational min 1 2 P X − P XS V 2F + λ1 V 2F + λ2 tr(V LW V ) complexity of our approach is O(td3 + tn2S n). Step 3 can be V reduced to O(d2.376 ) using the Coppersmith-Winograd al- +Λ, V − Z + α2 V − Z2F. gorithm (Coppersmith and Winograd 1987). Step 4 mainly (10) suffers from the size of source data. Generally, we could We assign the derivation of L with respect of V to zero. seek a more effective basis from source to reduce its size. The equation is shown below: To this end, our approach is more applicable in real-world ∂L = (−P XS ) (−P XS V + P X) + 2λ1 V applications. ∂V (11) +2λ2 V LW + Λ + α(V − Z) = 0. Then we can get the following equation: Experiment Temporal Datasets [(P XS ) (P XS ) + (α + 2λ1 )I]V + V (2λ2 LW ) (12) Multi-modal Action Detection (MAD) Dataset (Huang = (P XS ) P X − Λ + αZ. et al. 2014) contains multi-modal actions of 20 subjects Eq. (12) is a Sylvester equation, which is solved by recorded by Microsoft Kinect in three formats. First is reg- Bartels-Stewart algorithm (Bartels and Stewart 1972). ular RGB image in resolution of 640 × 480. Second is 3D Update P: By ignoring other variables we simplify the depth image. The third is human skeleton information. Each equation by converting the equation from Frobenius norm to subject performs 35 actions in two different indoor environ- trace format. The transformed equation is shown below: ments. Figure 3(a) shows frame samples of MAD dataset. 4198 (a) MAD Dataset (b) Keck Dataset (c) Weizmann Dataset Figure 3: Example frames of three human motion datasets. Table 1: Clustering accuracies and NMI of compared methods on three human motion datasets. Names in brackets denote the respective source dataset. The best and second best clustering results are denoted by bold and italic respectively. (a) Results on MAD Dataset (b) Results on Keck Dataset (c) Results on Weizmann Dataset Method Acc NMI Method Acc NMI Method Acc NMI LRR 0.2397 0.2249 LRR 0.4297 0.4862 LRR 0.3638 0.4382 OSC 0.4327 0.5589 OSC 0.4393 0.5931 OSC 0.5216 0.7047 SSC 0.3817 0.4758 SSC 0.3137 0.3858 SSC 0.4576 0.6009 LSR 0.3979 0.3667 LSR 0.4894 0.4548 LSR 0.5091 0.5093 TSC 0.5556 0.7721 TSC 0.4781 0.7129 TSC 0.6111 0.8199 TSC (Weiz) 0.5418 0.7684 TSC (MAD) 0.4653 0.6935 TSC (MAD) 0.5961 0.8032 TSC (Keck) 0.5473 0.7691 TSC (Weiz) 0.4548 0.6862 TSC (Keck) 0.5931 0.7971 Ours (Weiz) 0.5736 0.8202 Ours (MAD) 0.5395 0.8049 Ours (MAD) 0.6208 0.8509 Ours (Keck) 0.5792 0.8286 Ours (Wei) 0.5485 0.7928 Ours (Keck) 0.6030 0.8326 Keck Gesture Dataset (Jiang, Lin, and Davis 2012) con- compared with five state-of-the-art methods listed below: sists of 14 different actions from military signals. The reso- lution is 640 × 480. Three subjects preforms the 14 gestures Low-Rank Representation (LRR) (Liu et al. 2013) and each action is repeatedly preformed 3 times by each sub- seeks the representations of the lowest rank among all ject. Thus, 3 × 3 × 14 = 126 human action videos are ob- candidates. LRR obtains the global structure from the tained. The actions are captured by a fixed position camera data. And it is a more effective approach for subspace seg- with static and simple background. The frame samples are mentation. shown in Figure 3(b). Weizmann Dataset (Gorelick et al. 2007) contains 90 Ordered Subspace Clustering (OSC) (Tierney, Gao, video sequences include 10 different actions performed by and Guo 2014) proposes to explicitly enforce the tempo- nine subjects in outdoor environments. The resolution is ral data representation to be close and achieves the best 180 × 144 with the frame rate of 50 fps. The subjects pre- performance for clustering data. form ten regular actions including run, walk, skip and so on. The frame samples are shown in Figure 3(c). Sparse Subspace Clustering (SSC) (Elhamifar and Vi- dal 2009) assumes that each point has a sparse represen- Experimental Setup tation, and enforces a sparse constraint to learn a sparse representation. We extract low-level HoG features (Dalal and Triggs 2005) and obtain a 324-dimensional feature vector from each Least Square Regression (LSR) (Lu et al. 2012) is frame. Since different datasets contain different number of an efficient clustering approach. By using the Frobenius actions, to make segmentation results comparable across dif- norm, LSR lets the highly correlated data to be effectively ferent datasets, we randomly choose ten actions performed grouped together. by the same subject from each dataset. In evaluation, we uti- lize 5 randomly selected sequences in source datasets, and Temporal Subspace Clustering (TSC) (Li, Li, and Fu report the average performance. Moreover, since Weizmann 2015) proposes a temporal Laplacian regularization as and Keck datasets contain only a single action per sequence, well as jointly learns a dictionary to obtain expressive and we follow the setting of (Hoai and De la Torre 2012) which distinctive codings for time series data. concatenates single action from the same subject to generate a long sequential data, each data contains 10 actions with We evaluate the compared methods by running the codes 1000-3000 frame samples. The parameter values λ1 and λ2 provided by authors. All parameters are tuned to achieve are set to be 0.015 and 12 as the default, the correlated frame the best performance. Both Normalized Mutual Information distance s = 9 and the projection size r = 80. Parameter (NMI) and Accuracy (ACC) are utilized to test the clustering sensitivity is analyzed later in this section. Our approach is performance of the methods. 4199 Table 2: Result of our model in manipulated dictionaries. Ours-1: Original dictionary. Ours-2: Shuffle dictionary se- quence. Ours-3: Add noise in dictionary. Dictionary ACC NMI Ours-1 0.5736 0.8202 Ours-2 0.5710 0.8188 Ours-3 0.5251 0.7598 are directly concatenated, the subject positions and motion patterns changed suddenly between two sequences in some cases, such as from moving to left switch to moving to right. Second, since the video captured in different time, the illu- mination situations in videos are inconsistent. Furthermore, the subject visual appearance also changes such as clothes and hair styles. However, in other two datasets the subject performs the action without any appearance changing, and the illumination is consistent in all actions which are sim- ilar to real-world applications. These differences give com- pared approaches distinctive cues while our method can only Figure 4: Visualization of clustering result. Ten colors de- obtain limited benefits from source data. Thus the improve- note ten different actions. The results illustrate that LRR, ment of our approach is not significant compared with other LSR and SSC cannot segment temporal data accurately methods due to the data is distinctive already. since no temporal information preserved in the model. The results of OSC and TSC are better but still contain fragments and inaccurate part. While our approach contains less frag- Source Data Analysis ments and sensitive to similar actions. To fairly compare with different methods, we concatenate source and target data together as input to the second best approach, TSC. This aims to demonstrate that our improve- Performance Comparison ment is not directly from data augmentation. Table 1 shows In our approach, we set one dataset as source and another the result. TSC performance drops slightly (about 1%) and one as target. Since there are three datasets for evaluation, the result denotes two facts. First, simply increasing data each target dataset is segmented based on other two datasets samples cannot improve clustering performance. Second, as source data respectively. Results are shown in Table 1. the performance would reduce if the model cannot align the For other methods, since they cannot utilize source infor- distribution gap of source and target appropriately. It demon- mation, we only input the target data and get the segmen- strates that our approach is able to well align different distri- tation results. From Table 1 we can see that our approach butions and transfer useful information to improve segmen- outperforms other methods. Compared with the second best tation performance. approach, TSC, our approach has averagely 5.3% and 7.1% To prove that source data are crucial to improve segmen- improvement in accuracy and NMI. tation performance, we manipulate the dictionary and test We visualize one MAD video segmentation results of our the model. We set Weizmann as source and MAD as target. approach and other methods in Figure 4. Different colors First, original source is set as baseline (Ours-1). Second, we indicate different segments (i.e., clusters). We observe that randomly shuffle the sequence of dictionary (Ours-2). Third, SSC, LRR and LSR segment a lot of fragments in one action a Gaussian distribution noise xn ∼ N (0, 1) is added to dic- sequence which is not reasonable. One reason is that these tionary (Ours-3). The result is shown in table 2. We observe methods do not preserve temporal information between ad- that shuffling the dictionary sequence has less than 0.2% jacent points. OSC and TSC have better performance since negative influence on segmentation performance. However, both methods are designed for temporal data segmentation. the result drops more than 7 % in accuracy and NMI when However, they still suffer from data variations. OSC still ex- adding Gaussian noise. Figure 6 further shows the perfor- ists fragments situation and TSC is not sensitive in segment- mance with different scales of noise added to dictionary. We ing similar temporal data. However, since our approach uti- observe that the model achieves the best performance with- lizes extra source data and temporal graph regularization, it out noise, and as the noise increases, the performance drops is able to obtain continuous segments and distinguish sim- significantly and then gradually becomes stable. The results ilar but different actions. Therefore, our approach achieves denote that clean and good structure of source data can help better and more accurate results than other methods. to learn more distinctive representations and improve seg- Our approach fails to achieve significant improvement in mentation performance. Adding noise would weaken or de- Weizmann dataset. We find out that there are various differ- stroy the structure information and weaken the reconstruc- ences in Weizmann Dataset. First, since the action sequences tion performance, thus it reduces the clustering performance. 4200 $&& 10, $&& 10, .HFN.HFN.HFN.HFN :HL]PDQQ :HL]PDQQ :HL]PDQQ :HL]PDQQ 0$' 0$' 0$' 0$' 6HTXHQWLDO1HLJKERUVV 6HTXHQWLDO1HLJKERUVV 3URMHFWLRQ6L]H 3URMHFWLRQ6L]H (a) (b) Figure 5: (a) Performance in different number of sequential neighbors s. (b) Performance in different project size r. 10, $&& $&& 10, 1RLVH9DULDQFH 1RLVH9DULDQFH Figure 7: Parameter sensitivity tested on Keck dataset. Left Figure 6: Segmentation result when different variances of and right are visualization result of NMI and ACC metrics Gaussian noise is added to dictionary. with different value of λ1 and λ2. Table 3: Results based on modifications of our model by Parameter Analysis changing the reconstruction term. Modified Model ACC NMI We use different values to test the parameter sensitivity of Model-1 0.5415 0.8111 our model on Keck and MAD as source. Figure 5(a) shows Model-2 0.5224 0.8066 that when s 5, the performance would be accurate and Model-3 0.5268 0.7736 stable. λ1 constraints P and Z scale, and λ2 regulates the Model-4 0.4530 0.6866 temporal data constraint. Figure 5(b) shows the parameter sensitivity of r. It indicates that if r 20, the result is rel- atively stable even though there is a little fluctuation as r Model Variant Analysis increases. s is also a major parameter in our model. The re- To verify the effectiveness of each term in our model, we sult in Figure 7 demonstrates that our approach can get an change the first reconstruction term in Eq. (3), and the results accurate result when λ1 in the range of [0.02, 0.1] and λ2 in are shown in Table 3. We set MAD as source and Keck as the range of [9, 20]. The ranges of λ1 and λ2 are wide. target. First, we test the original model as baseline in Model- 1. Second, we remove P and obtain Model-2: X −XS Z2F. Third, we remove P and concatenate XS and XT as dictio- Conclusion nary in Model-3: X − XZ2F. Fourth, we set both source and target data as dictionary in Model-4: P X − P XZ2F. We introduced a novel transfer learning based temporal data We observe that when projection P is removed, the segmen- clustering approach in this paper. This approach adapted tation accuracy drops more than 3% which indicates that P useful information from relevant source data, and transferred is essential to connect both source and target data, and to im- knowledge for target temporal data segmentation tasks. prove the performance. When we concatenate XT as dictio- Specifically, a reconstruction-based strategy was adopted to nary, the segmentation performance drops significantly. We guide the knowledge transfer. A domain-invariant projec- assume the reason for this situation is that since XT exists in tion was learned to mitigate the data distribution differences, the dictionary, so target data could be represented by XT in- and a graph regularizer was built to capture the temporal in- stead of XS , and XS would have negative influence for the formation. Our approach outperformed state-of-the-art tem- segmentation performance. From the experimental results, poral subspace clustering methods on three human motion we conclude that every term in our approach is required and datasets. The results also demonstrated that our approach is contributes to improve the segmentation performance. robust, accurate and parameter insensitive. 4201 Acknowledgments Gorelick, L.; Blank, M.; Shechtman, E.; Irani, M.; and Basri, R. 2007. Actions as space-time shapes. IEEE TPAMI This research is supported in part by the NSF IIS award 29(12):2247–2253. 1651902, ONR Young Investigator Award N00014-14-1- 0484, and U.S. Army Research Office Award W911NF-17- Hoai, M., and De la Torre, F. 2012. Maximum margin tem- 1-0367. poral clustering. In AI and Statistics, 1–9. Huang, D.; Yao, S.; Wang, Y.; and De La Torre, F. 2014. References Sequential max-margin event detectors. In ECCV, 410–424. Jiang, Z.; Lin, Z.; and Davis, L. 2012. Recognizing human Ando, R. K., and Zhang, T. 2005. A high-performance semi- actions by learning and matching shape-motion prototype supervised learning method for text chunking. In ACL, 1–9. trees. IEEE TPAMI 34(3):533–547. Argyriou, A.; Evgeniou, T.; and Pontil, M. 2007. Multi-task Keogh, E., and Kasetty, S. 2003. On the need for time series feature learning. NIPS 19:41. data mining benchmarks: a survey and empirical demonstra- Bartels, R. H., and Stewart, G. 1972. Solution of the ma- tion. Data Mining and Knowledge Discovery 7(4):349–371. trix equation ax+ xb= c [f4]. Communications of the ACM Li, S.; Li, K.; and Fu, Y. 2015. Temporal subspace clustering 15(9):820–826. for human motion segmentation. In IEEE ICCV, 4453–4461. Blitzer, J.; McDonald, R.; and Pereira, F. 2006. Domain Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; and Ma, Y. 2013. adaptation with structural correspondence learning. In ACL Robust recovery of subspace structures by low-rank repre- EMNLP, 120–128. sentation. IEEE TPAMI 35(1):171–184. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. Lu, C.-Y.; Min, H.; Zhao, Z.-Q.; Zhu, L.; Huang, D.-S.; and 2011. Distributed optimization and statistical learning via Yan, S. 2012. Robust and efficient subspace segmentation the alternating direction method of multipliers. Foundations via least squares regression. In ECCV, 347–360. and Trends in Machine Learning 3(1):1–122. Merris, R. 1994. Laplacian matrices of graphs: a survey. Chen, W., and Chang, S.-F. 2000. Motion trajectory match- Linear algebra and its applications 197:143–176. ing of video objects. In Storage and Retrieval for Media Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. Databases, volume 3972, 544–553. IEEE TKDE 22(10):1345–1359. Coppersmith, D., and Winograd, S. 1987. Matrix multipli- Robards, M. W., and Sunehag, P. 2009. Semi-markov cation via arithmetic progressions. In ACM Symposium on kmeans clustering and activity recognition from body-worn Theory of Computing, 1–6. sensors. In IEEE ICDM, 438–446. Dai, W.; Xue, G.-R.; Yang, Q.; and Yu, Y. 2007. Co- Shi, J., and Malik, J. 2000. Normalized cuts and image clustering based classification for out-of-domain documents. segmentation. IEEE TPAMI 22(8):888–905. In ACM SIGKDD, 210–219. Smyth, P. 1999. Probabilistic model-based clustering of Dalal, N., and Triggs, B. 2005. Histograms of oriented gra- multivariate and sequential data. In AI and Statistics, 299– dients for human detection. In IEEE CVPR, volume 1, 886– 304. 893. Talwalkar, A.; Mackey, L.; Mu, Y.; Chang, S.-F.; and Jordan, Daumé III, H.; Kumar, A.; and Saha, A. 2010. Frustratingly M. I. 2013. Distributed low-rank subspace segmentation. In easy semi-supervised domain adaptation. In ACL NLP, 53– IEEE ICCV, 3543–3550. 59. Tierney, S.; Gao, J.; and Guo, Y. 2014. Subspace clustering Dimitrova, N., and Golshani, F. 1995. Motion recovery for for sequential data. In IEEE CVPR, 1019–1026. video content classification. ACM TOIS 13(4):408–439. Wang, C., and Mahadevan, S. 2008. Manifold alignment Ding, Z., and Fu, Y. 2017. Robust transfer metric learning using procrustes analysis. In ACM ICML, 1120–1127. for image classification. IEEE TIP 26(2):660–670. Xiong, Y., and Yeung, D.-Y. 2003. Mixtures of arma models Ding, Z.; Shao, M.; and Fu, Y. 2014. Latent low-rank trans- for model-based time series clustering. In IEEE ICDM, 717– fer subspace learning for missing modality recognition. In 720. AAAI, 1192–1198. Yang, Y., and Chen, K. 2011. Temporal data clustering via Ding, Z.; Shao, M.; and Fu, Y. 2015. Deep low-rank coding weighted clustering ensemble with different representations. for transfer learning. In IJCAI, 3453–3459. IEEE TKDE 23(2):307–320. Ding, Z.; Shao, M.; and Fu, Y. 2016. Transfer learning for Zhang, C.; Zhang, L.; and Ye, J. 2012. Generalization image classification with incomplete multiple sources. In bounds for domain adaptation. In NIPS, 3320–3328. IEEE IJCNN, 2188–2195. Zhou, F.; De la Torre, F.; and Hodgins, J. K. 2013. Hier- Elhamifar, E., and Vidal, R. 2009. Sparse subspace cluster- archical aligned cluster analysis for temporal clustering of ing. In IEEE CVPR, 2790–2797. human motion. IEEE TPAMI 35(3):582–596. Glowinski, R., and Le Tallec, P. 1989. Augmented La- Zografos, V.; Ellis, L.; and Mester, R. 2013. Discriminative grangian and operator-splitting methods in nonlinear me- subspace clustering. In IEEE CVPR, 2107–2114. chanics. SIAM. 4202