数据

Source Data(Not directly related to the task)

Target Data(directly related to the task)

每种data分为labeled和unlabeled。一共有四种情况。Target data量少才需要迁移学习。

one-shot learning: only a few examples in the target domain.

Labeled Taregt Data + Labeled Source Data

Fine-Tune

用source data训练一个模型，然后用少量的target data去fine tune这个模型，主要是训练出来的模型侧重于target domain做的是否好。

问题是如果source data很少，本来做出来的模型就很烂根本不能在target domain有很好的表现怎么办？提供以下两个技巧。保守训练和层迁移。

保守训练 Consevative Training

大量的source data训练出一个NN1。

用少量的target data去训练一个NN2。

这两个model需要比较类似，但又要有不同。通过加一些regularization让NN2和NN1在面对相同的data的时候输出的结果不同。

层迁移 Layer Transfer

用source data训练出一个NN1。

把这个训练好的NN1种的几个Layer拿出来copy到新的用target data训练的NN2里面去，在训练NN2的时候只需要训练剩下的没有收到copy Layer的那些参数。

如何选择那些layer被transfer？依据不同的task而定。在语音辨识中需要用到最后几层，因为前几层是在判断发音方式（因人而异），后几层就是从发音方式到判断结果了（统一）。在图像中一般选择前几层，因为前几层是在找出最细节的简单图形或纹理，越到后面越复杂。

Multitask Learning

多任务学习中我们同时要求模型在target domain和source domain中表现良好。

有一种情况是taskA和B之间存在共通性并且可以直接表示出来，model的前面几个layer共享参数，后来衍生出分支。

另外一种情况是taskA和B之间共通的input feature没有办法确定，所以一开始model有多个分支的input，通过最初的几层独立的layer把input转移到同一个domain上去（先抽取feature），获得共通性表示之后中间layer共享参数，最后再重新分流。

Unlabeled Taregt Data + Labeled Source Data

Domain Adversarial Training

结构

feature extractor — label predictor + domain classifier

feature extractor的任务是把domain的特性去除掉，为了衡量feature extractor做得好不好，要在feature extractor的后面接一个domain classfier，它需要判断这个feature 来自哪一个domain，到最后训练效果很好的话domain classfier肯定不知道feature来自哪一个domain，也就是说最后它一定fail。

但是只有这样的结构就学不起来，因为只要feature总output0，dimain classfier怎么也分不出来。因此要增加featur extractor的任务难度，在它后面接一个label predictor，意思就是要feature extractor在消除domain特性的同时，尽可能保留原来数据feature的特性。

训练

在做back propagation的时候在feature extractor和domain classifier之间天一个gradient reversal layer，实现负反馈；feature extractor和label predictor之间就是正常的gradient decscent。

Zero-shot learning零样本学习

database中储存所有不同的可能的class和它的特性。

NN的input是数据，output是该数据的特性。然后根据特性查找database对应到具体的class。

attribute embedding

如果attribute非常复杂，考虑attribute embedding。也就是对input和attribute做降维。

有一个embedding space，每一个original input$x^n$和attribute of original input$y^n$都通过NN投射到同一个点或者相近的点$f(x^n), g(y^n)$上。训练这个模型不应该是最小化$f(x^n), g(y^n)$的距离（因为直接就把所有data都投射到同一个点上去了），应该让目$f(x^n), g(y^n)$的距离比所有$f(x^n), g(y^m)$的距离都近，设定一个defined margin k。

当有一个未知的original input，把它投射到embedding space上去，找和哪一个attribute在embedding space上的投影最接近。

attribute embedding+word embedding

适用于没有database的情况。

convex combination of semantic embedding

output不决定是哪一个class，只输出可能的class的概率，把这些概率和class vector做线性相加得到一个新的混合后的vector，再看哪一个人class vector和这个新的vector最接近。

Labeled Taregt Data + Unlabeled Source Data

Self-taught learning

Unlabeled Taregt Data + Unlabeled Source Data

Self-taught clustering