<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Jexus Scripts</title>
    <description>Scripts by a student in electrical engineering</description>
    <link>https://voidism.github.io//</link>
    <atom:link href="https://voidism.github.io//feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Mon, 18 Aug 2025 22:55:01 +0000</pubDate>
    <lastBuildDate>Mon, 18 Aug 2025 22:55:01 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>美國 EECS 博士班申請經驗分享 (ML/DL/NLP/Speech)</title>
        <description>美國 EECS 博士班申請經驗分享 (ML/DL/NLP/Speech)

這篇同時也發在PTT studyabroad版: https://www.ptt.cc/bbs/studyabroad/M.1616171844.A.181.html

研究領域
Machine Learning for Natural Language Processing, Speech Processing

Admission
(都是申請2021 Fall)
MIT EECS PhD (1/20 interview, 1/27 accept, 我最早拿到的正式offer, 其他間都要說掰掰了…)
CMU LTI MLT (1/20 interview, 2/4 accept, 投PhD降轉Master)
UCSB CS PhD (1/5 interview, 2/5 accept, 教授1/7口頭上給我offer)
UW ECE PhD (no interview, 2/5 accept)

Rejection
UC Berkeley EECS PhD (1/12 interview, 2/20 reject)
USC CS PhD (no interview, 3/16 reject)

Pending
(以下大概都不會再收到admission了)
Georgia Tech CS PhD (1/8+1/19 two interviews，原本有要發offer給我，但聽到我已經有MIT，就沒發了…)
UCLA CS PhD
NYU CS PhD

Background
NTU EE, BS, 2016-2020
GPA: overall 4.18/4.3, major 4.22/4.3
Rank: 15/177 (8%) 書卷獎 x4

GRE: 322 (V152/Q170/AW3.0)
TOEFL: 102 (R30/L26/S22/W24)

Publications
Interspeech 2020 x1 (first author)
EMNLP 2020 x2 (first author/long/main + co-first author/short/findings)
EMNLP 2019 workshop x1 (co-first author/long)
Under review: NAACL 2021 x2, ICASSP 2020 x1 (second author)

其他經歷
TA: Deep Learning for Human Language Processing 2020
Intern/RA@中研院(2018-2019)

推薦信
台大專題教授 x3

其他經歷詳細可以看我網站或 Google Scholar

Research Experience
當初大二下修完ML，開始覺得機器能夠理解及生成人類的語言(語音or文字)，用來回答問題or生成句子是一件很amazing的事情。加上當時覺得這個領域還有很多未知或是做得不夠好的部分尚待探索(那時候BERT還沒出來…)，心中期待著有一天我們能用電腦運算理解人類語言的秘密，就一頭栽進來做研究，也漸漸產生興趣(如: Multilingual BERT 竟然能unsupervised把不同語言的embedding space很準地align起來，完全不需要paired data等等神秘現象)。

大三到大四前三個學期的研究其實一直都沒有很順利，每次想了一個新idea想做做看，寫code跑實驗出來結果總是會遇到各式各樣的問題，一直面對失敗換方向寫code跑實驗，每週都花上大把時間栽在研究上，有去無回，經歷多次conference deadline前連續幾天趕到早上才入睡，數據跑出來不如預期膽戰心驚痛苦失眠etc…。幸好最後研究慢慢漸入佳境，在大四下好不容易擠出一些成果上了幾篇conference。

真的非常感謝我的恩師李宏毅老師一路上給我的支持，從大二開始就看老師的Youtube入門ML，老師總是能化繁為簡，把很複雜的概念用low-dimensional的manifold(?)傳達給學生瞭解，也能很快瞭解我在做的研究想法細節，做進一步延伸討論。老師看似不拘小節，實際上是個心思很細膩的人，總是能看到很細節的地方，也非常在乎學生做研究的心情與感受。大三上開始跟老師做專題，記得大三暑假兩個月做同一個題目做不出來，大四上第一次投conference被慘遭reject，都是老師一直不斷的加油打氣給予方向，才讓我慢慢調整心態堅持下去，最後大四下才順利投上了人生第一篇conference。

再來要感謝陳縕儂老師一路上給我的幫忙，在大三上加入Miulab當專題生，做過一堆最後沒成功的題目，還記得那時連續好一段時間跟學長討論到深夜兩三點，最後還是一直失敗。雖然如此，老師總是給我很多鼓勵跟研究方向上的建議，細心地幫我修paper，最後終於在大四下投上了兩篇EMNLP，在申請學校時也給了我很多選校建議跟幫助，讓我順利錄取了很多學校。

最後要感謝李琳山老師，雖然跟老師直接的討論跟交流不是前面兩位老師那麼多，但在跟老師改paper的過程中，學習到如何把研究內容說成一個好的故事(整篇paper每一句話都老師被仔細地刪掉重新寫+重排順序)，從大二修課以來也不斷學習到老師豐富的人生經驗，最後老師也不吝幫我寫了第三封推薦信，在我申請過程中幫了很大的忙。


  p.s. 老師的信號與人生 2019 &amp;amp; 2020 有錄影在Youtube
對才剛大一大二、對未來還很迷惘的學生很有幫助。
https://www.youtube.com/watch?v=PtmthIH1JJs
https://www.youtube.com/watch?v=mBsdgTYqMio



  另外這讓我想到最近大神我同學開辦的 NTUEE podcast，專訪一些在北美業界的畢業校友學長姐，各大平台皆有上線，有興趣可以聽看看: 
https://open.firstory.me/story/cklv1nqvxw5uo0996i1tc9x67/platforms



  說到podcast，我剛好可以分享NLP領域、AI2做的一個不錯的podcast，會請很多當紅教授或PhD學生來訪談，可以挑有興趣的題目來聽。
https://podcasts.apple.com/us/podcast/nlp-highlights/id1235937471
https://soundcloud.com/nlp-highlights


心得
最近幾年作AI/ML方向的CS PhD申請一年比一年競爭，你如果點開一些名校大咖實驗室的網頁，可能會發現他們收的學生早在大學時期早已周旋於美國各大名校大咖實驗室做研究or visiting，手拿一堆first-author頂會paper外加大佬推薦信(當然也有少數例外不是這樣)。我自己覺得”connnection, publication, align程度”這三者是在申請時的必要條件。我自己在選校時幾乎一半都是選擇投有台灣人學長姐去過的實驗室，至少確定教授會對台灣學校畢業的學生有興趣，而且他們也多半知道或認識我在台大的老師。另外就算沒connection，也可以多找找剛加入faculty的新老師，剛加入的第一兩年正在擴張期，通常比較缺學生，願意多面試看看不同適合的人選，而且最近很多學校開始建立或擴張NLP group，加入了一些有實力的新老師前途無量，可以考慮。


  p.s. 可以辦一個twitter帳號用來追蹤教授，通常大家會在上面分享自己group的新paper，可以收到第一手資訊，甚至一些新教授收學生的資訊也會出現。
這是我的帳號:https://twitter.com/YungSungChuang，有追蹤很多相關領域的教授。


Publication的話，這年頭AI/ML方向的PhD申請者大家手中都有個兩三篇top conference paper已經是標配。我覺得比數量更重要的是，跟你面試的教授對你做的東西有沒有fu，你跟他研究方向的align程度有多大。就算有很多篇paper，但是教授不懂這研究的價值跟厲害在哪，就比較沒用處(除非他真的缺人那可能比較沒差，能做研究的學生就收)。

我自己面試的時候，有遇到方向很吻合的教授，做的方向跟我之前的paper有重疊，甚至有一個教授說他們前陣子投了一篇paper到NAACL，發現內容跟我去年EMNLP long paper長得很像…。反之，我連面試都沒拿到的學校(USC, NYU)，其實大部分都是方向有點小歧異(即使都是做NLP研究的)，雖然我讀了很多他們實驗室的paper，也在SoP裡面闡述很多我對這些研究的看法，但教授們還是沒有買單…，可能他們已經有一堆更align的人選，所以自然輪不到我。

對於現在還在選題目做研究，有一兩年可以慢慢準備的學生，我覺得可以多考慮看看去作跟自己有興趣的實驗室相近的研究主題(所以可能要先定下目標)，有了那個方向的研究成果，之後在寫SoP的時候也會變得很流暢。另一方面，如果做的方向太過單一集中，而只適合用來申請特定實驗室的話，也是有點危險的。像我自己就是有同時做多個主題(NLP+Speech)，所以這兩邊都可以申請，降低風險，這部分就是一個trade-off。

SoP
因為我做過很多不同的研究主題，一開始寫出來的第一版SoP非常零散，像是記流水帳一樣，羅列我之前做的東西，也想不出什麼好的故事可以把他們串起來。後來有一天看到了Stanford大神Nelson Liu(Percy Liang的PhD學生)在他的部落格提供他自己的SoP 
https://blog.nelsonliu.me/2020/11/11/phd-personal-statement/，才學習他的方式，把所有的project集中框出兩個大主題，圍繞這兩個大主題去介紹，為何自己要做這些小project?(為了這個大主題)
如此一來，整篇SoP就會比較像是一個整體。像我自己是選擇用 1. Generalizability 2. Efficiency of NLP models 來當主題。在面試第一位教授時，他就有說我的SoP寫的兩個方向就正是他想要做的兩個方向(雖然project細節方向沒有完全一樣)，後來面試完隔天就給了我口頭offer。

另外Nelson Liu在他的部落格裡也有說到，他在SoP裡沒有很具體講述自己未來想做什麼樣的project，怕如果想法沒有很好，反而造成反效果，但是事後來看如果有這樣做應該更好：


  “I also regret not being more concrete about maybe specific projects I want to do in the future, but I recognize that it’s easy to say that in hindsight. As I was writing the statement, I was definitely afraid that my misinformed senior-year undergraduate research opinions would turn off any NLP faculty with the misfortune of reading my application, so I chose to be conservative instead and say less. Looking back, I think it might have actually looked better to have stronger personal opinions with more evidence for why I feel the way I do.”


一開始我寫SoP針對每個教授的段落時，其實也是怕自己的idea太naive，所以只有很保守地寫說「我對XXX主題有興趣」，但後來學長強力建議之下，我就改成幫每一個教授想一個與他之前發的paper有相關(或是也結合自己的work)的延伸題目，也就是：

「我對XXX主題有興趣，想朝YYY方向發展」
「我對XXX主題有興趣，想說可以結合AAA跟BBB達到CCC的效果」

為了寫出這些，每一個教授我都花了蠻多時間先看一輪他的paper，然後再想出可行一點的idea寫上去，這麼做雖然很累，我大概花了一兩個禮拜在做這件事，但至少能告訴教授說，我很認真在看你的work，而且我對這領域有想法、有概念可以往哪裡發展。


  p.s. Nelson Liu 的部落格還有一篇好文章，是訪問+統計了很多他同學的申請NLP PhD過程經驗總結，非常值得一看:
https://blog.nelsonliu.me/2019/10/24/student-perspectives-on-applying-to-nlp-phd-programs/


Interview
面試的部分，除了第一個面試我的教授有問了一些機率模型的數學問題，另外一個教授叫我回去推導backpropagation寫在overleaf隔天交，其他教授大部分像是純聊天，讓你介紹你的work(可做個簡單投影片)，他再針對細節問問題，或是問你一些延伸的看法。我自己覺得，如果教授在跟你面試時，可以跟你聊到研究很detail的內容(代表他有興趣)，而且談得愉快，甚至最後開始跟你推銷我們學校很好，那大概蠻大機率會上。反之如果他聽完好像沒問什麼細節問題(沒fu)，像是面試我的Berkeley教授，也沒有相談甚歡的感覺，那可能就不會上QQ。不過當然這只是我面試了少數幾個教授的經驗，僅供參考而已，其他領域的面試可能就不太一樣。

套辭
套辭信這東西，有的人說有用，有的人說沒用，不過我是覺得寄一下沒關係(當然內容要花時間好好雕琢一下，按下寄出之前你會猶豫再三XD)，沒有收到回信也沒損失，除非那個教授網站上有寫說不要寄信給他、很煩，那就真的不要寄比較好。

我自己寄信的時間點在申請deadline剛過，聖誕節之前。跟SoP很像，除了介紹一下我是誰我做過什麼之外，我也放了一些我對該教授研究想的idea，但不要讓信變的太長會讓人不想看完。我寫的6封信3封有被回，但有一封沒回也沒面試的卻有拿到offer(上了之後才聯絡我)，所以感覺也跟套辭信不一定有太大的關係。

另外，在申請deadline之前剛好我去參加EMNLP 2020，因為疫情關係是線上舉行的，有提供一個Rocket.chat聊天室，可以直接密任何一位與會者，包括一些你想申請的教授。如果是現場會議，我英文又很破，一定不太敢路邊亂搭訕教授XD，但這次是傳文字訊息就比較沒壓力。於是那時我就趁亂密了一些教授看看，大部分教授都會回(畢竟那幾天就是專心來交流的)，就有問到他們明年實驗室有沒有收人的資訊，如果當時教授就說不收的話，就可以換申請別的教授，也有一個教授當時有跟我約視訊聊聊研究(半面試)，所以如果有機會參加conference要好好把握這個機會，就算是virtual conference也是行的。

總結
其實一直到大三之前都還沒有很確定自己是不是想要出國，我的爸媽都是沒上過大學的平凡人，本來也不太支持我出國留學，希望我找個穩定工作餬口飯吃能養家就好，如果想出國讀碩士，家裡也是付不起昂貴學費的。

後來我在大三大四開始做研究之後，開始覺得自己不想只是當個engineer照著spec刻好需求做優化(雖然這樣其實也很快樂也很有貢獻)，而是能想出一些別人沒想過的東西，然後實現出來。每次讀paper看到國外學校實驗室做的好多有貢獻有想法的研究，想著有一天也要跟他們一樣厲害。

於是，我漸漸把目標轉為申請PhD(博士也才有薪水養活自己)，努力衝publication，申請的時候也就賭一把都沒有投master program，父母也才轉為支持態度。

大學這四年後兩年幾乎都在研究，灑下大把時間，沒什麼生活品質，寫code到深夜，近視也加深了，真的不太優，希望以後博士生活可以work/life balance一點(有可能嗎?XD

以上寫的都是針對我這個領域，其他領域的話就不一定適用了。
如果有想到什麼新東西會再補充。
希望這篇可以幫助到需要的人！
</description>
        <pubDate>Sat, 20 Mar 2021 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//notes/2021/03/20/PhD-Application/</link>
        <guid isPermaLink="true">https://voidism.github.io//notes/2021/03/20/PhD-Application/</guid>
        
        <category>NLP</category>
        
        <category>Speech</category>
        
        <category>PhD</category>
        
        <category>Application</category>
        
        
        <category>notes</category>
        
      </item>
    
      <item>
        <title>My Work - Lifelong Language Knowledge Distillation</title>
        <description>Lifelong Language Knowledge Distillation

This is my work accepted by the EMNLP 2020 conference (long paper).

TL;DR

Catastrophic forgetting problem in lifelong language learning (LLL) could be mitigated if lifelong language learners are continuously learning from different teachers (mainly for language generation, w/ seq-KD &amp;amp; word-KD).


  arXiv: https://arxiv.org/abs/2010.02123
  github: https://github.com/voidism/L2KD
  video: https://youtu.be/t3Ee5fA8mCo




Abstract
It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and pass the knowledge to the LLL model via knowledge distillation. Therefore, the LLL model can better adapt to the new task while keeping the previously learned knowledge. Experiments show that the proposed L2KD consistently improves previous state-of-the-art models, and the degradation comparing to multi-task models in LLL tasks is well mitigated for both sequence generation and text classification tasks.
</description>
        <pubDate>Sun, 01 Nov 2020 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//abstract/2020/11/01/L2KD/</link>
        <guid isPermaLink="true">https://voidism.github.io//abstract/2020/11/01/L2KD/</guid>
        
        <category>EMNLP</category>
        
        <category>nlp</category>
        
        <category>deep_learning</category>
        
        
        <category>abstract</category>
        
      </item>
    
      <item>
        <title>My Work - Dual Inference for Improving Language Understanding and Generation</title>
        <description>Dual Inference for Improving Language Understanding and Generation

This is my work accepted by the EMNLP 2020 (findings paper).

TL;DR

We improve the performance of NLU and NLG by leveraging the duality between NLU and NLG in inference time. Our method can improve NLU/NLG without re-training the models, considering larger and larger pre-trained models are brought to the world.


  arXiv: https://arxiv.org/abs/2010.04246
  github: https://github.com/MiuLab/DuaLUG


Abstract
Natural language understanding (NLU) and Natural language generation (NLG) tasks hold a strong dual relationship, where NLU aims at predicting semantic labels based on natural language utterances and NLG does the opposite. The prior work mainly focused on exploiting the duality in model training in order to obtain the models with better performance. However, regarding the fast-growing scale of models in the current NLP area, sometimes we may have difficulty retraining whole NLU and NLG models. To better address the issue, this paper proposes to leverage the duality in the inference stage without the need of retraining. The experiments on three benchmark datasets demonstrate the effectiveness of the proposed method in both NLU and NLG, providing the great potential of practical usage.
</description>
        <pubDate>Sun, 01 Nov 2020 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//abstract/2020/11/01/DualInf/</link>
        <guid isPermaLink="true">https://voidism.github.io//abstract/2020/11/01/DualInf/</guid>
        
        <category>EMNLP</category>
        
        <category>nlp</category>
        
        <category>deep_learning</category>
        
        
        <category>abstract</category>
        
      </item>
    
      <item>
        <title>Which layer preserves the best cross-lingual representations in multilingual-BERT?</title>
        <description>Which layer preserves the best cross-lingual representations in multilingual-BERT?

In the context of NLP research, variant in different languages is a non-negligible issue that does not appear in other DL research fields (e.g. Computer Vision). There are over 7000 languages in the world, while most of the NLP datasets/corpus are in English. Cross-lingual transfering from English to other languages is desirable especially for the languages with fewer training data.

Before BERT appeared, the main research about cross-lingual transfering focusing on aligning independent monolingual word embeddings for different languages in supervised/adversarial methods (e.g. MUSE). However, after Google released the Multi-lingual version of BERT (Multilingual-BERT), people surprisingly found cross-lingual transferability in BERT for many languages without any supervision or adversarial training process (How multilingual is Multilingual BERT?, Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model).

I found this interesting phenomenon last summer, and I did some t-SNE visualization on BERT hidden contextualized representations (from 8-th layer) using WMT-17 paired en-zh corpus.



  if you understand Chinese, it is a shock to see perfectly matching between the Chinese words with the English counterparts.


I also visualized each different layer in BERT, it seems that the most aligned layer is layer 7~8. On the contrary, the first few layers (1~4) and the last (11~12) are not aligned nicely.



  there are 12 pictures in this GIF from 12 different BERT layers.


To examine the alignment quality of BERT in different layers, I did some experiments on XNLI, a dataset to test the cross-lingual transferability of NLI models. To be fairly compared, I did not extract the features directly from each layer to feed into a XNLI model, because different layers have different level of functions to natural language understanding task, and lower level features may not get good enough results on XNLI even if they are aligned correctly.

Instead, I fixed the first N layers of BERT model and train the layers after N on XNLI dataset (so the whole model size that data propagate through is maintained).



  experimental results of XNLI


The results show that performance on XNLI directly correlated with the visualized alignment quality of BERT. The the representation from layer 8 is better for cross-lingual transfer. On the other hand, the performance of English dataset is not significantly affected by the layer number we fixed.



  visualization of XNLI results
-1 means finetune all
0 means finetune except embedding layer
1~11 means the first N-th layer is fixed

</description>
        <pubDate>Mon, 27 Jan 2020 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//notes/2020/01/27/which-layer-preserves-the-best-cross-lingual-representations/</link>
        <guid isPermaLink="true">https://voidism.github.io//notes/2020/01/27/which-layer-preserves-the-best-cross-lingual-representations/</guid>
        
        <category>NLP</category>
        
        <category>deep_learning</category>
        
        
        <category>notes</category>
        
      </item>
    
      <item>
        <title>What has the positional &quot;embedding&quot; learned?</title>
        <description>What has the positional “embedding” learned?

In recent years, the powerful Transformer models have become standard equipment for NLP tasks, the usage of positional embedding/encoding has also been taken for granted in front of these models as a standard component to capture positional information. In the original encoder-decoder Transformer model for machine translation (Vaswani et al. 2017), the positional “encoding” uses sinusoidal waves to fill the positional “encoding” weight matrix. It somehow makes sense to present position information in sin/cos waves, as different frequencies are used in different dimensions, as shown in the figure.



  source: http://nlp.seas.harvard.edu/images/the-annotated-transformer_49_0.png


However, for many Transformer-encoder-based pretrained models (BERT, XLNet, GPT-2… in 2018~2019), a fully-learnable matrix is used as positional “embedding” to take place the sinusoidal waves.
The way to train the positional embedding is just like we train a normal word embedding layer. Each row in the embedding matrix is independent, no matter what position index is presented. Up to now, there is few discussion about what is learned by the positional embeddings.

Because I wondered whether positional “embedding” learned physical meaning or it was only manipulated by black-box parameters, I do some small experiments to probe the positional “embedding” matrix:

Regression

I trained a linear regression model with the input = vector from positional embedding, output = scaler according to the position.

BERT
bert-base-cased is used. The model was trained for 10000 epochs.

The training set is the vectors with even position, that is, [2*x for x in range(position_size//2)].

The result of the regression model that tested on all even/odd position shows that BERT’s positional embedding poorly model the positional information, especially for the position &amp;gt; 400.


  X: the position of the input vector
Y: predicted scaler for input


I also trained a model with the training set equal to the whole vector (even and odd), however, the result shows little change.


  X: the position of the input vector
Y: predicted scaler for input


I think that the poor results for position &amp;gt; 400 are because the original BERT implementation does not fill each batch with full sequences with length=512. (But Roberta does it.)

Roberta

The same experiment on Roberta, with training set equal to the even vectors:



  X: the position of the input vector
Y: predicted scaler for input


Training set equal to all even+odd vectors:



  X: the position of the input vector
Y: predicted scaler for input


The results are better than BERT’s.

GPT-2

The same experiment on GPT-2, with training set equal to the even vectors:



Training set equal to all even+odd vectors:



GPT-2 has a longer positional embedding size (1024).
I think that the good results of GPT-2 are caused by left-to-right language modeling. The GPT-2 model needs to be more sensitive to the position of the input vectors. On the other hand, the masked language modeling task (BERT and Roberta) can rely on more bag-of-words information in the sentence. Positional information is not as important to BERT in its task.
</description>
        <pubDate>Sun, 26 Jan 2020 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//notes/2020/01/26/What-has-the-positional-embedding-learned/</link>
        <guid isPermaLink="true">https://voidism.github.io//notes/2020/01/26/What-has-the-positional-embedding-learned/</guid>
        
        <category>NLP</category>
        
        <category>deep_learning</category>
        
        
        <category>notes</category>
        
      </item>
    
      <item>
        <title>My Work - SpeechBERT：An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering</title>
        <description>SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering


  Updated in 2020/11/01 after published in Interspeech 2020


Authors: Yung-Sung Chuang (me), Chi-Liang Liu, Hung-Yi Lee, and Lin-shan Lee

ArXiv: https://arxiv.org/abs/1910.11559

Introduction

While various end-to-end models for spoken language understanding tasks have been explored recently, this paper is probably the first known attempt to challenge the very difficult task of end-to-end spoken question answering (SQA). Learning from the very successful BERT model for various text processing tasks, here we proposed an audio-and-text jointly learned SpeechBERT model. This model outperformed the conventional approach of cascading ASR with the following text question answering (TQA) model on datasets including ASR errors in answer spans, because the end-to-end model was shown to be able to extract information out of audio data before ASR produced errors. When ensembling the proposed end-to-end model with the cascade architecture, even better performance was achieved. In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks.

TL;DR in Chinese (for 1st version)

以往 Spoken Question Answering (SQA) 的 dataset 都需要先把語音用 ASR 語音辨識轉成文字之後，再當成一般的文字 QA dataset，接 QA model 下去做。然而前人指出，經過 ASR 後的文字因為包含辨識錯誤，會使得 SQA 的正確率與一般純文字 QA 相比大幅的降低 (甚至20%，如下圖)。



因此我們做了大概是第一個 End-to-end 的 SQA 模型 - SpeechBERT。



此模型藉助 pre-trained language model(BERT) 的方法，將語音跟文字用同一個 BERT (Cross-modal) model 下去做 pre-train，最後再 fine-tune 在 SQA 的 dataset 上，最後取得的 performance 已贏過過去 conventional 的 QA model，雖然跟 ASR + BERT 的方法還是差了一點，但在將 QA 題目以語音辨識的 Word Error Rate (WER) 分類後可以發現，在 WER 很高 (~80%) 的題目上 ASR + BERT 的方法會使錯題比例大幅提昇，而我們的 SpeechBERT 還是可以維持原有的水準。


</description>
        <pubDate>Sun, 17 Nov 2019 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//project/2019/11/17/SpeechBERT/</link>
        <guid isPermaLink="true">https://voidism.github.io//project/2019/11/17/SpeechBERT/</guid>
        
        <category>nlp</category>
        
        <category>speech_processing</category>
        
        <category>deep_learning</category>
        
        
        <category>project</category>
        
      </item>
    
      <item>
        <title>My Work - Towards Understanding of Medical Randomized Controlled Trials by Conclusion Generation</title>
        <description>Towards Understanding of Medical Randomized Controlled Trails by Conclusion Generation

In Proceedings of the 10th International Workshop on Health Text Mining and Information Analysis at EMNLP (LOUHI 2019)

Authors: Alexander Te-Wei Shieh, Yung-Sung Chuang, Shang-Yu Su, and Yun-Nung Chen

ArXiv: https://arxiv.org/abs/1910.01462
Code: https://github.com/MiuLab/RCT-Gen

Introduction
Randomized controlled trials (RCTs) represent the paramount evidence of clinical medicine. Using machines to interpret the massive amount of RCTs has the potential of aiding clinical decision-making. We propose a RCT conclusion generation task from the PubMed 200k RCT sentence classification dataset to examine the effectiveness of sequence-to-sequence models on understanding RCTs. We first build a pointer-generator baseline model for conclusion generation. Then we fine-tune the state-of-the-art GPT-2 language model, which is pre-trained with general domain data, for this new medical domain task.
Both automatic and human evaluation show that our GPT-2 fine-tuned models achieve improved quality and correctness in the generated conclusions compared to the baseline pointer-generator model. 
Further inspection points out the limitations of this current approach and future directions to explore.

Model

We modified the code from huggingface/pytorch-pretrained-bert and adjusted the attention mask for fine-tuning on seq2seq data format(from source to conclusion).



Requirements

  python3
  torch&amp;gt;=0.4.0
  nltk
  rouge


install them by:
pip install -r requirements.txt


Usage

Fine-tuning from official gpt-2 pretrained weights
usage: gpt2_train.py [-h] [--save_model_name SAVE_MODEL_NAME]
                     [--train_file TRAIN_FILE] [--dev_file DEV_FILE]
                     [--n_epochs N_EPOCHS] [--batch_size BATCH_SIZE]
                     [--pred_file PRED_FILE] [--example_num EXAMPLE_NUM]
                     [--mode MODE]

optional arguments:
  -h, --help            show this help message and exit
  --save_model_name SAVE_MODEL_NAME
                        pretrained model name or path to local checkpoint
  --train_file TRAIN_FILE
                        training data file name
  --dev_file DEV_FILE   validation data file name
  --n_epochs N_EPOCHS
  --batch_size BATCH_SIZE
  --pred_file PRED_FILE
                        output prediction file name
  --example_num EXAMPLE_NUM
                        output example number, set to `-1` to run all examples


Testing trained model
usage: gpt2_eval.py [-h] [--model_name MODEL_NAME] [--dev_file DEV_FILE]
                    [--pred_file PRED_FILE] [--example_num EXAMPLE_NUM]

optional arguments:
  -h, --help            show this help message and exit
  --model_name MODEL_NAME
                        pretrained model name or path to local checkpoint
  --dev_file DEV_FILE   validation data file name
  --pred_file PRED_FILE
                        output prediction file name
  --example_num EXAMPLE_NUM
                        output example number, set to `-1` to run all examples


Data

We used the PubMed 200k RCT dataset, which was originally constructed for sequential short text classification, with each sentence labeled as background, objective, methods, results and conclusions.

We concatenated the background, objective and results sections of each RCT paper abstract as the model input and the goal of the model is to generate the conclusions. If hint words is needed, just concatenate the hint words right after the results section. The transformed sample csv file can be found in data/.

Citation

Please use the following bibtex entry:

@inproceedings{alex2019understanding,
  title     = {Towards Understanding of Medical Randomized Controlled Trails by Conclusion Generation},
  author    = {Shieh, Alexander Te-Wei and Chuang, Yung-Sung and Su, Shang-Yu and Chen, Yun-Nung},
  booktitle = {In Proceedings of the 10th International Workshop on Health Text Mining and Information Analysis at EMNLP (LOUHI 2019)},
  eprint    = {1910.01462},
  year      = {2019}
}

</description>
        <pubDate>Sun, 17 Nov 2019 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//project/2019/11/17/RCT-gen/</link>
        <guid isPermaLink="true">https://voidism.github.io//project/2019/11/17/RCT-gen/</guid>
        
        <category>EMNLP</category>
        
        <category>nlp</category>
        
        <category>deep_learning</category>
        
        
        <category>project</category>
        
      </item>
    
      <item>
        <title>RoBERTa - A Robustly Optimized BERT Pretraining Approach</title>
        <description>RoBERTa: A Robustly Optimized BERT Pretraining Approach

Paper Link

TL;DR
在 BERT 的訓練找到更好的 setting，主要改良:

  Training 久一點; Batch size大一點; data多一點(但其實不是主因)
  把 next sentence prediction 移除掉
    
      (註：與其說是要把 next sentence prediction (NSP) 移除掉，不如說是因為你會有 50% 的時間在 BERT 的 input 放上不相干的句子，是這件事影響了 performance，但若是在都已經有放不相干句子的情況下，有做 NSP 還是比較好的。BERT 的原作者沒有發現 NSP 無用的原因就可能是因為他們在做 ablation 時還是放了不相干句子，所以誤以為做NSP會比較好)
    
  
  sequence 長度要長
  每次 mask 位置要重新 sample


RoBERTa 在 SQuAD, GLUE, RACE 都拿下 single model 的 state-of-the-art (唯SQuAD2.0拿第二名，差一小點)。

Slide:


  Please wait a minute for the embedded frame to be displayed. Reading it on a computer screen is better.


這是 Office 提供的內嵌 Microsoft Office 簡報。

但其實這篇實質上並沒有打敗XLNet，是在 training time 多很多的時候才贏得了，如下表比較:



但大家都似乎以為 XLNet 能夠贏過 BERT 是因為 dataset 大小是 BERT 原做的十倍大，故 XLNet 原作者發了這篇文打臉大家：

A Fair Comparison Study of XLNet and BERT with Large Models
https://medium.com/@xlnet.team/a-fair-comparison-study-of-xlnet-and-bert-with-large-models-5a4257f59dc0


</description>
        <pubDate>Thu, 01 Aug 2019 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//slideshare/2019/08/01/RoBERTa/</link>
        <guid isPermaLink="true">https://voidism.github.io//slideshare/2019/08/01/RoBERTa/</guid>
        
        <category>NLP</category>
        
        <category>deep_learning</category>
        
        <category>slide</category>
        
        
        <category>slideshare</category>
        
      </item>
    
      <item>
        <title>Visualizing and Measuring the Geometry of BERT​</title>
        <description>Visualizing and Measuring the Geometry of BERT​

Paper Link

TL;DR

小品文 paper，做的事情有：

  Attention weight 帶有 dependency 資訊​
  Tree embedding: 要用 squared Euclidean distance 來看​ (證明+視覺化)
  Word sense embedding 分離現象​


另有精美 blog 網站 https://pair-code.github.io/interpretability/bert-tree/

Slide:


  Please wait a minute for the embedded frame to be displayed. Reading it on a computer screen is better.


這是 Office 提供的內嵌 Microsoft Office 簡報。
</description>
        <pubDate>Thu, 01 Aug 2019 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//slideshare/2019/08/01/Geometric-Bert/</link>
        <guid isPermaLink="true">https://voidism.github.io//slideshare/2019/08/01/Geometric-Bert/</guid>
        
        <category>NLP</category>
        
        <category>deep_learning</category>
        
        <category>slide</category>
        
        
        <category>slideshare</category>
        
      </item>
    
      <item>
        <title>Multi-Source Unsupervised Domain Adaptation</title>
        <description>Multi-Source Unsupervised Domain Adaptation Challenge
DLCV 2019 final project

Poster Link


  We achieve 1st/2nd place on Kaggle public/private leaderboard! (team: b05901dlcv)
https://www.kaggle.com/c/dlcv-spring-2019-final-project-1/leaderboard


Yung-Sung Chuang, Chen-Yi Lan, Hung-Ting Chen, Chang-Le Liu
Department of Electrical Engineering, National Taiwan University

Abstract

  In this work, we tried many unsupervised domain adaptation (UDA) models for multi-source dataset - DomainNet [1]. We mainly used Adversarial Discriminative Domain Adaptation (ADDA) [2] and Maximum Classifier Discrepancy (MCD) [3] for Unsupervised Domain Adaptation in this work. We also slightly adjusted ADDA training process to make it most suitable for multi-source challenge, which will be described below.
  We also tried M$^3$SDA [1], which is designed for multi-source domain. However, the training accuracy is stuck and cannot beat single-source based methods in our experiments.




  Data Examples from DomainNet - http://ai.bu.edu/M3SDA/




  Data Statistics from DomainNet - http://ai.bu.edu/M3SDA/


Fuzzy Adversarial Discriminative Domain Adaptation
Method

We got the idea from ADDA [2] and do slight modification on training process. We named this method as FADDA:


  In Stage 1, we pretrain the feature extractor and classifier on source data using standard cross-entropy loss, until the model converges.

  In Stage 2, the full model is shown in Figure 2. We initialize both the source feature extractor, $G_s$, target feature extractor, $G_t$, and classifier $F$ with the pretrained model in the first stage . 



Training Steps in Stage 2

Note that $G_s$ and $G_t$ are initialized with the same model. With regard to the discriminator $D$, we use a two-layer fully connected neural network. Then we train the whole model jointly, with the following order, which is slightly different from the original ADDA model:

(For simplicity, we denote the three source domain training set as $S_1$, $S_2$, $S_3$ respectively, and the batches as $b_i^{S_1}$, $b_{i}^{S_2}$, $b_{i}^{S_3}$. The target domain training set and the batches are denoted as $T$ and $b_{i}^{T}$).


  Step 1. Train three batches $b_{i}^{S_1} \in S_1$, $b_{i}^{S_2} \in S_2$, and $b_{i}^{S_3} \in S_3$ consecutively. In each minibatch, we compute loss $\mathcal{L}_{adv,src}$ and


$\mathcal{L}_{class}$

, where $\mathcal{L}_{adv,src}$

stands for the adversarial loss for $D$ and $\mathcal{L}_{class}$

for the cross-entropy loss for $F$:


$$ \mathcal{L}_{class} = \sum_{i, j} \sum_{k \in K, \mathbf{x}_s \in b_{i}^{S_j}} \mathbf{1}_{[k = y_s]}\log F(G_s(\mathbf{x}_{s}))$$


$$ \mathcal{L}_{adv,src} = \sum_{i, j} \sum_{\mathbf{x}_s \in b_{i}^{S_j}} -\log D(G_s(\mathbf{x}_{s}))$$



  Step 2. Train three minibatches $b_{3i}^{T}$, $b_{3i+1}^{T}$, $b_{3i+2}^{T} \in T$ and compute $\mathcal{L}_{adv,tgt}$ for $D$, with all the other modules fixed, and only consider the gradient of $D$. We train three minibatches in this step to balance the amount of data seen in step 1.



$$ \mathcal{L}_{adv,tgt} = \sum_{j} \sum_{\mathbf{x}_t \in b^{T}_{j}} -\log (1 - D(G_t(\mathbf{x}_{t})))$$



  Step 3. Then we optimize $G_s$, $F$ and $D$ simultaneously after first and second step.



$$ \min_{G_s,F,D} \mathcal{L}_{class} + \mathcal{L}_{adv,src} + \mathcal{L}_{adv,tgt}$$



  Step 4. Lastly, optimize $G_t$ (acting like generator in standard adversarial training) using  $b_{3i}^{T}$, $b_{3i+1}^{T}$, $b_{3i+2}^{T}$ $\in T$ in order to fool $D$, with both $D$ and $F$ fixed.



$$ \min_{G_t} -\mathcal{L}_{adv,tgt}$$


Maximum Classifier Discrepancy (MCD)
Motivation
Distribution matching based UDA algorithms (e.g. ADDA…)  have some problems:

  They just align the latent vector distributions without knowing whether the decision boundary trained on source domain is still appropriate for target domain. (Figure 3)
  The generator often tends to generate ambiguous features near the boundary because this makes the two distributions similar. 



Method
To consider the relationship between class, MCD method [3] aligns source and target features by utilizing the task-specific classifiers as a discriminator boundaries and target samples. See example in Figure 4.




  First, we pick out samples which are likely to be mis-classified by the classifier learned from source samples.
  Second, by minimizing the disagreement of the two classifiers on the target prediction with only generator updated, the generator will avoid generating target features outside the support of the source.


Training Steps

  Step1. Train both classifiers and generator to
classify the source samples correctly.



$$\min_{G,F_1,F_2} \mathcal{L}_{class}(X_{s},Y_{s})$$



  Step2. Fix the generator and train two classifiers (F1 and F2) to maximize the discrepancy given target features. At the same time, we still train F1, F2 to minimize classification loss, in order to keep the performance on source data.



$$\min_{F_1,F_2} \mathcal{L}(X_{s},Y_{s}) - \mathcal{L}_{\rm adv}(X_{t})$$



$$\mathcal{L}_{\rm adv}(X_{t}) = {\mathbb{E}_{\mathbf{x_{t}}\sim X_{t}}}[d(p_1(\mathbf{y}|\mathbf{x_t}),p_2(\mathbf{y}|\mathbf{x_t}))]$$





  Step3. Fix the classifiers and train the generator to  minimize discrepancy between two classifiers. Step 3 is repeated for 4 times in our experiment.



$$\min_{G} \mathcal{L}_{\rm adv}(X_{t})$$




Model Details &amp;amp; Training Settings

Model details and training setting of our experiments in listed below:


  Feature extractor: We choose ResNet-50, ResNet-152, Inception-ResNet-v2 [4] as feature extractor $G$ in our experiments.
  Classifier: In all experiments(FADDA and MCD), we use a simple one-layer fully-connected network as classifier $F$, which projects from feature dimension (e.g. 2048, 1536, …) to class number (e.g. 345).
  Discriminator: In FADDA, we use a simple three-layer fully-connected network as discriminator. The input size is feature dimension, hidden size is 512 in hidden layers.
  Optimizer: We use SGD with learning rate $10^{-4}$, momentum 0.9, weight decay $10^{-4}$ in all modules in our experiments.


Experiment Results

Before applying our methods, we also train a baseline model with source data combined and no perform any adaptation, which called “naive” method in Table 1.

Table 1: Main Experiment Results


  
    
      Method
      inf, qdr, rel
      inf, skt, rel
      qdr, skt, rel
      inf, qdr, skt
    
    
       
      $\rightarrow$ skt
      $\rightarrow$ qdr
      $\rightarrow$ inf
      $\rightarrow$ rel
    
  
  
    
      weak baseline
      23.1
      11.8
      8.2
      41.8
    
    
      strong baseline
      33.7
      13.3
      13.0
      53.1
    
    
      naive - ResNet-50
      37.1
      9.9
      16.7
      55.0
    
    
      naive - ResNet-152
      42.7
      12.6
      18.8
      56.2
    
    
      naive - Incep-ResNet-v2
      47.4
      13.5
      21.5
      60.6
    
    
      FADDA - ResNet-152
      44.1
      16.3
      20.0
      59.4
    
    
      FADDA - Incep-ResNet-v2
      47.1
      15.2
      19.8
      63.6
    
    
      MCD - Incep-ResNet-v2
      48.8
      14.9
      22.8
      64.7
    
  


Table 2: Comparison between ADDA and FADDA.


  
    
      Method
      inf, qdr, rel
      inf, skt, rel
      qdr, skt, rel
      inf, qdr, skt
    
    
       
      $\rightarrow$ skt
      $\rightarrow$ qdr
      $\rightarrow$ inf
      $\rightarrow$ rel
    
  
  
    
      ADDA - Incep-ResNet-v2
      46.4
      12.8
      18.6
      62.3
    
    
      FADDA - Incep-ResNet-v2
      47.1
      15.2
      19.8
      63.6
    
  


Table 3: Comparison between single-source MCD, and multi-source MCD and M$^3$SDA.

we have also tried to use multiple pairs of classifier for different source domain in our MCD method, similar to M$^3$SDA (but without moment matching). However, the effect is not as expected and even can not be compared with source combined MCD method.


  
    
      Method
      inf, qdr, rel
      inf, skt, rel
      qdr, skt, rel
      inf, qdr, skt
    
    
       
      $\rightarrow$ skt
      $\rightarrow$ qdr
      $\rightarrow$ inf
      $\rightarrow$ rel
    
  
  
    
      single-MCD - ResNet-50
      43.2
      11.7
      19.5
      57.1
    
    
      multi-MCD - ResNet-50
      33.9
      9.3
      11.5
      44.6
    
    
      M$^3$SDA - ResNet-50
      -
      -
      -
      43.7
    
  


Conclusion


  Naive method performs not bad. It is strong enough to pass all strong baseline.
  In most of cases, Inception-ResNet-v2 can outperform ResNet-50 and ResNet-152.
  We proposed FADDA, which can perform slightly better than original ADDA.
  Single-MCD is more stable and powerful than multi-MCD and M$^3$SDA to train in our experiments. This indicates that multi-source based methods are still challenging to design. In our cases, it can not leverage the difference between each source domain to get improvement on accuracy.
  We think that the images from quickdraw dataset (composed only by white background and black lines) are so different from normal images, so the single best model is not Incep-ResNet-v2(MCD), but ResNet-152(FADDA).


References


  [1] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. arXiv preprint arXiv:1812.01754, 2018.
  [2] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2962–2971, 2017.
  [3] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3723–3732. IEEE Computer Society, dec 2018.
  [4] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.

</description>
        <pubDate>Sun, 30 Jun 2019 00:00:00 +0000</pubDate>
        <link>https://voidism.github.io//project/2019/06/30/MultiSourceDomainAdaptation/</link>
        <guid isPermaLink="true">https://voidism.github.io//project/2019/06/30/MultiSourceDomainAdaptation/</guid>
        
        <category>Computer_vision</category>
        
        <category>deep_learning</category>
        
        
        <category>project</category>
        
      </item>
    
  </channel>
</rss>
