Puyuan Peng

Why You Should Start an AI PhD Now

2025-10-14T00:00:00-04:00

1. You are pushing the boundary of this industrial revolution

You are pushing the boundary of this industrial revolution — and you are the one deciding which direction to push it. If you have big ambitions, there is nothing bigger than this. Yes, with just two GPUs at your school lab you won’t build the next LLM that tops every leaderboard, but you could design a new architecture that runs faster and uses fewer resources, you could analyze why those “PhD-level” LLMs still fail at basic reasoning tasks like counting, you could even invent an entirely new paradigm.

When Hinton worked on neural nets in the 1980s, most AI researchers were focused on symbolic reasoning and expert systems — neural nets were widely dismissed as impractical. When LeCun developed convolutional neural nets in the 1990s, the community was focused on support vector machines and other statistical methods — people even called them “convoluted” neural nets to mock their complexity. When the GPT3 came out in 2020, I had a job interview where the interviewer told me that OpenAI was only pushing GPT for branding purposes, because everyone else was using bidirectional Transformers like BERT, which performed much better than causal Transformers like GPT.

You might not be the author of a revolutionary paper like GPT-3, but that paper cited 146 others — you could be the author of one of them. Each of those 146 papers also cited hundreds more, and your name might appear somewhere down the chain. Most scientific progress is incremental: even if your work isn’t the one that changes everything, it could be the building block that makes the final breakthrough possible.

With all these tech companies dropping new models every week, it’s easy to get the illusion that research only happens in industry. But that’s completely wrong — we in industry constantly read academic papers, and sometimes realize that people in academia have already found smarter ways to solve the same problems. Academia isn’t falling behind; in fact, many of the best ideas originate there and are later scaled up and turned into products by industry.

2. Honing your all-round skills in building AI

During your PhD, you own your project end-to-end, which means you touch almost every part of an AI system — data, infrastructure, modeling, evaluation, and often even building demos and promoting your work. You become deeply familiar with every component that makes AI systems work.

Owning a project end-to-end might not sound like a big deal, but it’s actually rare in industry. In most companies, you work in a team where each member focuses on a small piece of the system. Because of this, someone who has hands-on experience with the full pipeline becomes much more valuable than someone who has only specialized in one area.

When I started my job search, I watched videos about how to get a job in big tech as a new grad — and was frustrated that the most important thing seemed to be solving LeetCode problems. My advisor pointed out that job hunting for CS undergrads and AI PhDs is very different. As a PhD, you already have a multi-year track record of publications and open-source contributions. If your record shows that your skills align with what a hiring team needs, LeetCode shouldn’t matter as much. During my own job search, I failed many coding interviews because I didn’t prepare LeetCode — yet half of the companies I interviewed still made me offers because my research stood out.

3. Suffering builds mental strength

To outsiders, AI might look like a glamorous field — every day there are new models, new products, new funding rounds. But doing a PhD in AI often means long periods of quiet thinking, working alone, and struggling with problems that don’t seem to move forward.

The nature of scientific research is that 80% of your ideas won’t work. Of the remaining 20%, half will fail due to implementation errors. When things don’t work, you’re usually the only one who can fix them, because you’re the only person who knows every detail of your project. You might wish everyone else were struggling too, but in such a fast-moving field, you’ll constantly see people posting shiny new results on social media — which makes the struggle even harder.

The life of an AI PhD often looks like this: wake up, figure out why yesterday’s idea didn’t work, modify it, code, debug, and think — all on your own. What keeps you going? You need to genuinely love your project, enjoy coding, and be addicted to problem-solving.

When you embark on a journey knowing that nine out of ten attempts will fail — and still do it anyway — you are a true adventurer who values the process more than the outcome. And if you can persist like that for five years and succeed in the end, the rest of your career will only feel easier.

The End

You’re not just building models — you’re building understanding. A PhD might turn you into a celebrated researcher whose every paper shakes the field, but more often, it makes you a quiet thinker — someone who works in silence, guided by the belief that what you’re doing matters, and patiently trying out every idea that might bring it to life.

Whether or not anyone notices, the world moves a little further because you tried.

Why You Shouldn’t Start an AI PhD Now

2025-10-14T00:00:00-04:00

1. Research direction is largely decided by industry

I pursued three different research directions during my PhD, and two of them were heavily inspired by papers from industry — my Prompting Whisper work was a follow-up to OpenAI’s Whisper, and my VoiceCraft model was inspired by Microsoft’s VALL-E.

In fact, industry is the one that decides how the game is played, because they have far more compute and data. I’ve heard it more than once: someone at school works on a task for three months, finally gets it to work, and then a big tech company releases a much larger model trained on vastly more data that achieves state-of-the-art results on ten tasks — including that very task the poor PhD student has been working on. It’s tough to accept that something you’ve spent months improving by 5% is suddenly overtaken by an industrial model that improves it by 30%.

Academia barely stands a chance when competing with industry in building state-of-the-art AI models. If you’re only interested in chasing leaderboards, very few academic labs will have the resources to support your work.

Perhaps more crushingly, some of the most impactful work in modern AI wasn’t done by people with PhDs. For example, Alec Radford — the creator of GPT, CLIP, and Whisper — does not have a PhD.

2. Poverty

Doing a PhD never makes financial sense, even in the hottest field — AI.

Right now, entry-level AI research scientists at big tech typically make between $300k and $700k a year. Let’s take the average, $500k. It sounds like a lot for a new grad, but it takes at least five years to complete a PhD before you can even interview for such a job. During those five years, you make $20–40k a year — barely enough to get by.

In another world, if you hadn’t gone for a PhD five years ago and instead started as a software engineer, working just as hard as you would in a PhD, you will be a senior or staff engineer at a big tech by now — with a median salary around $500k and $770k. Plus, during those five years, you’d have earned far more money as a junior and mid-level software engineer compared to a PhD student.

I still remember the feeling when I had only $2,000 in my bank account when I arrived at UT Austin for my PhD in September 2021. My rent was $1,100, and my PhD stipend after tax was $2,000. It wasn’t fun at all knowing that every time I got a direct deposit, more than half of it went straight to rent.

3. Struggling

Almost every PhD student goes through periods of depression. The situation is especially tough for AI PhDs right now because the field is so crowded — almost any idea you can think of is either already published or will be published next month by someone else.

Doing impactful research requires a mindset completely opposite to the state of AI right now. You need to spend endless hours surveying the literature, finding a unique angle to tackle a problem, adapting or building codebases, running experiments, and analyzing why things don’t work. During this time (often 3–8 months), there will be weeks or even months when you feel you’re not making any progress — and in the frantic pace of AI, where new models are released daily, you’ll feel even worse.

Unlike in industry, where you usually work in a team, most PhD projects are done alone. Yes, you’ll have collaborators who provide feedback during meetings, but you’re the one actually writing the code and running the experiments. Often, the reason something doesn’t work isn’t because the idea is wrong, but because the implementation is slightly off — and in those cases, you’re the only one who can fix it.

If you have reached here, please check out my article on why this is the best time to do a PhD in AI.

PhD in AI – My Experience

2025-09-29T00:00:00-04:00

In October 2024, I contacted 9 companies for research positions in AI. I received interview invites from 8 of them, finished full interview loops with 6, and got offers from 4. The total market cap of the 4 companies that offered me a research scientist job was $10 trillion by mid-September 2025. I chose Meta.

I couldn’t have imagined this 4 years ago, when I set foot in the CS department at UT Austin, with no prior CS degree and very little research experience.

1. Getting into AI without a CS degree

Jan 2020 – Dec 2020

I did my undergrad in math at Beijing Normal University and a master’s in statistics at the University of Chicago.

In my first year of the master’s program, I tried to get a job in data science. I practiced LeetCode but found it too hard and impractical. I tried Spark, scikit-learn, and other data analysis tools and found them too boring. After submitting about 30 applications, only 3 companies sent me online assessments, and none gave me an offer. I realized the field was overly crowded.

At the same time, I took a course offered by Prof. Karen Livescu at TTIC called Speech Technologies and did a successful course project, which convinced Prof. Livescu to continue working with me during the summer of 2020. She eventually recommended me to her academic sibling Prof. David Harwath at UT Austin for PhD studies. Both Karen and David did their PhDs at MIT under Prof. Jim Glass (the lab that produced many of the top speech researchers in the US).

I was shocked when I received the offer letter from the UT Computer Science department because their website explicitly said that their PhD program is very competitive and not recommended for people without a CS background—and I had no CS background. In addition, they required a TOEFL transcript, but I only had an expired IELTS score. With no hope of getting a reply, I emailed the admin. To my surprise, they replied immediately, saying that a barely expired IELTS would work—just email them the transcript and they’d add it for me.

2. Preparing for a fast-track PhD

Apr 2021 – Aug 2021

After accepting the PhD offer, during my meeting with David I asked him how many years he expected a PhD to take. David said 6 years was the standard, and he himself had taken 7 because he changed research direction during his PhD.

I wasn’t interested in spending 6 years, even though I might need to, since I had to take undergrad CS courses to make up for my non-CS background.

To make up for my weak research background, two months after I accepted the PhD offer, in June 2021, I emailed David for research directions. To my surprise, he replied with a long list of 10 research ideas (which I knew very little about). I tried my best to search for anything I could find before our next meeting, and during the meeting, we decided to go with using Transformers to model visually grounded speech. I picked this topic because:

Transformers had already proven to be the next-generation architecture—low risk and very rewarding for a research newbie like me.
The research area of visually grounded speech was co-created by my advisor David, so I could receive a lot of great feedback from him.

Before I even started working on this project, I spent a month reading the PyTorch Lightning codebase and wrote a simplified version of it for myself. That way, I could avoid writing boilerplate code while having a framework I was very familiar with.

Doing the infrastructure work first turned out to be extremely beneficial—the framework I wrote ended up being used in almost every one of my PhD projects, and as I progressed, I continuously refined it to add new features and make it more flexible.

3. First year – sweet academia

Sep 2021 – May 2022

Once the infrastructure was done, the project was very straightforward. Just like Transformers were outperforming ConvNets in many other fields, with enough data and proper pretraining, Transformers led to significantly better performance in visually grounded speech as well. I felt like I did very little work, and the project just worked. After about a month, our model started to outperform baselines, and another month later, we further boosted the performance and started writing the paper.

We posted the paper in September 2021—the first month of my PhD!

The wonderful start encouraged me to work harder. I decided not to take weekends off for the first year of my PhD. To force myself to go to campus every day rather than stay in my apartment, I bought an unlimited meal plan at UT. For $1,600 per semester, I got unlimited breakfast, lunch, and dinner at any dining hall. This also helped me avoid cooking or eating out. That was especially important because when I settled into my apartment near UT, I only had $2,000 left in my bank account. And in the first year, my salary after tax was only $2,000 per month, while my rent was $1,100.

The only thing I thought about in my first year was visually grounded speech. One issue with the model I published in September 2021 was that it had many complicated modules, making it difficult for others to use. I therefore proposed simplifying the architecture into a two-tower model: a vision tower and an audio tower. Unfortunately, after a few weeks of tuning, the simplified model could not surpass the old, more complicated one.

The vision transformer I used was called DINO. One intriguing behavior of DINO is that even though it was never trained on semantic labels, its internal representation could segment an image into different semantic parts. Out of curiosity, I looked at the self-attention and internal representation in my audio transformer, and was shocked to see that the model was segmenting out individual syllables and words in continuous speech! The model had never seen words during training. This was very similar to how humans acquire language—by looking and listening.

We summarized the finding and wrote it up into two papers on self-supervised word and syllable discovery. My work on visually grounded speech received a lot of attention, and I got to present at ENS Paris, TTIC, and UT’s Developing Intelligence Lab—all in my first year.

4. Second year – the realization and the pivot

May 2022 – May 2023

Publishing papers and giving (virtual) talks around the world made my first year feel smooth. What I didn’t expect was to then enter a 6-month stretch where nothing worked.

Hooked on the idea of emergent linguistic unit discovery, I explored every possible method to make models discover linguistic units more accurately without textual supervision. Around this time, David connected me with Abdelrahman Mohamed and Daniel Li from Meta (who later became long-term collaborators and had a huge impact on my life). Even though my connections broadened, my research stalled. During weekly meetings with David, Abdo, and Daniel, I sometimes felt embarrassed that everything I tried had failed.

To make things worse, I realized that citations are often treated as a key measure of PhD achievement. My visually grounded speech papers weren’t being cited much. It was disappointing to see that even though well-known researchers were interested, the work wasn’t spreading widely.

From April to September 2022, I gradually sank into depression. I had published 3 papers in my first year, but they seemed to have little impact. Now I was stuck, and nothing was working. I remember sitting in a chair listening to podcasts or audiobooks for hours, but I couldn’t shake the sense of failure.

Then, at the end of September 2022, OpenAI released their first speech model—Whisper. I immediately realized it was my chance to pivot. Whisper is a speech recognition model trained on web-scale data. Textual Large language models trained on web-scale data had already shown emergent zero-shot capabilities: during inference, you can prompt them to do tasks they weren’t explicitly trained on. Whisper is essentially an audio-conditioned LLM—could we prompt it to perform unseen tasks during inference?

After some trial and error, I found that with carefully designed prompts, Whisper could indeed handle unseen recognition tasks. Since I had less experience in recognition, I contacted Prof. Shinji Watanabe at CMU, who connected me with his student Brian Yan. We started collaborating and quickly landed the paper.

Our Prompting Whisper paper received far more attention from industry than my visually grounded speech work. For the first time, I saw people talking about my research online—without me initiating it.

Wanting to continue the momentum, I tried fine-tuning Whisper to push performance even further. That turned out to be a mistake: fine-tuning wasn’t intellectually interesting, and the massive compute required made iteration painfully slow. Sometimes I’d launch an experiment, wait days for results, and get nothing useful.

5. Third year – pivot again and an 8k-GitHub-star model!

May 2023 – May 2024

After struggling with finetuning Whisper for a month, I started looking for new directions. Speech recognition was maturing—approaching commercial quality—but speech generation, such as text-to-speech, was lagging behind. Commercial systems needed huge amounts of engineering to work properly.

In January 2023, Microsoft released VALL-E, a large language model for text-to-speech with zero-shot voice cloning. Its elegant, scalable design convinced me: this was the GPT moment for speech.

I decided to dive into speech generation, even though I was the only person in my lab working on it—and my advisor didn’t have much experience in the field either.

To catch up, I spent a month reading papers and reaching out to seasoned researchers for virtual chats. I chose to work on a unified LLM-based model for speech editing and voice cloning text-to-speech. I loved audiobooks and podcasts, and editing—removing filler words or replacing small sections—was a natural use case. Voice cloning TTS could be seen as a subtask of editing, so it made sense to unify them.

The project wasn’t easy. Being the only person in the lab working in a new area meant solving a lot of hidden problems that papers didn’t explain. I spent 8 months on it. For example, the baseline model wasn’t trained on large-scale data. To compare fairly, I had to retrain it on our data—but its code wasn’t scalable. I had to rewrite it entirely. It was also my first time coding up a language model from scratch, and small mistakes, like misplacing a special token, could break the whole system.

By January 2024, the model—VoiceCraft—was ready. We quickly realized there was nothing else as powerful. Instead of just publishing, we did three things that turned VoiceCraft into a brand:

Open-sourced the code and weights
Provided ongoing support for developers worldwide through GitHub
Built polished demos and shared them widely

That made all the difference. VoiceCraft went viral: 8,000+ GitHub stars, posts with hundreds of thousands of views, and even a subreddit created by the community. Someone posted on Reddit, “VoiceCraft: I’ve never been more impressed in my entire life!” Famous figures like Marc Andreessen followed me, and VCs from Sequoia and Microsoft Ventures reached out.

For two months, a community grew around VoiceCraft—building demos, making it accessible to non-technical users.

VoiceCraft became the first open-sourced LLM-based text-to-speech research project. Its success inspired more researchers to open-source their own models, sparking a boom in speech generation research.

Afterward, I did an internship at Meta NYC with Wei-Ning Hsu. (I’ll share more about that in another post.)

6. Fourth year – the busiest year of my life

Sep 2024 – Apr 2025

When I returned, I was in my fourth and final year.

The fall of 2024 was my busiest semester: I was juggling 4 things at once—1) thesis writing, 2) job search and interviews, 3) my final project at school, and 4) my internship project at Meta.

From mid-September to late November, my days started at 7:30. I’d read a mantra I wrote to prepare myself mentally, then go to the gym to prepare myself physically. Then the real work began.

Interviews were the hardest part. Early on, I bombed both coding and behavioral interviews. For coding, I wasn’t prepared—by choice. I decided not to grind LeetCode, because the problems felt disconnected from real research. I also didn’t have time, with my thesis and projects demanding attention. Looking back, now that I’ve gone through 50 interviews, I’d say: if you can make time, practice LeetCode as much as possible.

For non-technical interviews, I didn’t realize how much doing PhD had eroded my communication skills. I often said awkward things. Luckily, I kept notes reflecting on every interview, and gradually improved.

I usually stopped working after 7, went for a 40-minute walk, had dinner, listened to audiobooks, read my mantra again, and slept before midnight.

In the end, 4 out of 8 companies gave me offers. As it turned out, even if you bomb coding interviews, they’ll still hire you if your research skills stand out.

By early December, I started negotiating offers. For the first time that semester, I felt I didn’t need my mantra every morning and night for strength.

My final semester in spring 2025 was still busy, but much more relaxed. I went out with friends more and savored the achievements of my PhD. Time flew. On April 1st, I officially graduated.

Looking back

From arriving at UT with no CS background and just $2,000 in my bank account, to creating 8 thousand start GitHub repo and joining Meta—I could never have predicted this journey.

A PhD isn’t just about publishing papers. It’s about persistence through failure, learning when to pivot, and building something that excites others.

If you’re considering a PhD or a career in AI: you don’t need to have it all figured out at the start. Just begin, keep learning, and adapt along the way. The opportunities will come.

Deep RL 12 Reinforcement Learning and Control as Probabilistic Inference

2021-05-15T00:00:00-04:00

Please checkout Professor Sergey Levine’s excellent tutorial: Levine 18’

做一个更快乐的博士生

2021-05-15T00:00:00-04:00

[文] Kevin Gimpel

[译] Jason

写在前面

Kevin Gimpel是TTIC的助理教授。他的研究领域是自然语言处理，近年主要研究兴趣包括：representation learning, structured prediction, robust and data-efficient NLP, world modeling for NLP. 根据谷歌学术，截止2021年6月，Kevin有9600+被引量。

我在芝大上学时很荣幸上过Kevin的课，他上课非常认真，布置的作业自己会先亲自做一遍。Kevin的这篇给博士生的建议我读过多遍，觉得应该被更多的人看到。我表达了翻译此文的想法，他欣然同意。原文请见：https://home.ttic.edu/~kgimpel/etc/phd-advice.pdf。

正文

博士生的生活中充满各种危险。

敌人们从各个角度蹿出来。

一些攻击源自外部：低收入、无关的课程、匿名评审。

但是更多的攻击来自于博士生自己：自我怀疑、焦虑、缺乏安全感，这些才是主要的对手。

举个例子，著名的学者印第安纳·琼斯教授（系列电影《夺宝奇兵》中虚构的历史学教授）。电影中外部的危险有很多，但是琼斯教授总能机智化解，最激烈的斗争反而都是内部的。

博士生涯是很漫长的。生涯前期，你的时间主要花在技能训练上 — 数学成熟度，代码能力，实验管理能力，广阔的视野。之后，你将开始更多地为你所在的领域做出学术贡献。在这些方面，已经有很多很好的建议供你参考。例如[desJardin, Dredze and Wallach, Guo, Stearns, inter alia]

常常被忽视的，是如何管理自己丰富的内心世界。本文提供一些建议，从具体到抽象，希望能帮助你成为一个更快乐的博士生。一个更快乐的学生能做出更好的工作，但更重要的是，一个更快乐的学生是一个更快乐的人。

1 执行一个可持续的时间表

博士生常常为如何安排时间而焦虑。我的建议是，确定一个可持续的时间表，并严格执行。

对于我来说，这意味着每周工作五天，每天八小时。不是一天五小时，另一天十一小时。不是每周工作六天，每天七小时。不是待在办公室九小时，但是花两小时在与工作无关的事情上。不是不想工作就不工作，然后用几个通宵来赶进度。以上这些我都试过了。

严格地执行一个可持续的时间表既能防止你工作太少，也能防止工作太多。工作太少或太多都会降低你的幸福感。

不要工作太多。找到工作之外的生活。发展与实验室其他同学的友谊（所以你可以和朋友们讨论工作），找到实验室以外的朋友（所以和他们在一起是你没办法讨论工作）。如果工作是你的一切，它的重要性将支配你的存在，那么不可避免的挫折会让你感到难以承受。

不要工作太少。科研具有风险性和开放性。高风险的点子往往被漫不经心的探索，或者根本不被探索。因为这样的点子可能不会有效，学生迟迟不想开始研究。这会带来压力，从而让学生浪费更多时间，并使自己与导师的关系变得紧张。学生担心探索高风险的点子会浪费他的时间，但他其实是因为拖延在浪费时间！

当你为了严格执行你的时间表而工作，你会开始做那些你不想做的事。你会很容易的就花一个小时探索那个高风险的项目，因为这样会帮助你完成每天工作n小时的超短期目标。当你必须花这一个小时来工作，而不是为不工作找借口，或者为做什么而焦虑，你应该抱着试试看的心态探索一下。谁知道呢？说不定你会收获有趣的发现。这引出了下一点。

2 试试看

科研是对事实的追求，而事实常常难以被抓住。事情总是不会按我们预料的发展。所以有些时候，我们应该停止理论，编程实现你的想法。

一个执行得很好的实验会产出一个实验结果。而这个结果在某种意义下总是一个发现。要么你得到了别人从没得到过的结果，要么你成功地复现了别人的结果。不管怎么样，我们都应停下来想想 — 这个结果意味着什么？它改变了我们某些的认知吗？它应该如何更新我们对世界的看法？

令人比较难接受的一点是，你得到的结果可能并没有包含很多信息量。很多时候，你都在设法说明你那闪亮的新点子比标准的基线模型要好。当处于这种心态时，人很容易在代码实现或者实验设计环节中犯马虎。这引出了下一点。

3 你很可能有bug

不论你的研究顺利与否，不管进展快慢，不管结果是好于预期还是不及预期，你很可能有bug。无数已经发表的论文中有bug。这些bug往往在论文发表很久之后才被人们发现，有时这些bug不会影响论文结论，有时会。作为研究者，人类是灵活而有创造力的，但是也很容易经常犯一些小错误。

当一个学生实现他那闪亮的新点子并获得第一批实验结果时，他通常要么非常高兴，要么非常沮丧。等待和理解这些最初的结果是令人精疲力尽的。更糟糕的是，最初的结果往往甚至没有包含很多信息。

我的建议是对两种情况都制定应对方案。不管得到的是正面还是负面的结果，都要努力排除bug。尝试你想法的变种，这要求你多次重复你的实验过程，从而提高发现错误的几率。在代码中添加更多的声明和注释。每一个结果都称得上是一个发现，但前提是没有bug。

bug的形式有很多。对于实验计算机科学家来说，bug是代码中的错误；如果你的研究是证明理论，bug可以是一个不成立的假设，或者步骤中的问题；如果你是生物学家，bug可以是来自任何来源的污染。

不管你的进展是顺利，不顺利，还是一般般：你很可能有bug。在你找到并修正了十个bug之后，你可能还有一个bug；在你基于同一个代码库发表了十篇精彩的论文之后，你可能还是有bug。

4 忘记自己

把注意力放在现象上。让我们来想象一个戒备心很重，没有安全感，争强好胜的气象学家。总是为自己气象预测的准确性焦虑着。预测不准确时轻描淡写，预测准确时却大肆宣扬。总是在拿自己的预测与他的竞争对手做比较。

现在，我们再来想象另一个气象学家，他诚实而谦虚，对自己预测的准确和偏差都直言不讳，并且解释原因。他忘记自己，把注意力都集中在气候现象的美，复杂性，和重要性上。这样的天气预报才是你更想看的。

博士生们（或者是所有人）都更容易成为那个没有安全感的气象学家 — 尽管像其他人一样容易犯错，却总是想树立自己的光辉形象。没有人在乎你是否聪明。他们看天气预报是因为他们关心天气，而不是关心你。（唯一在乎你是否聪明的人可能就是那些考虑招聘你的人。但是同样的，招聘官更喜欢那些能把自己沉浸在丰富多彩的科研中的求职者）

如果你过分沉湎于自己的才华和名声，你永远不会快乐。过分的自我意识会带来压力，并扼杀享受科学发现之美的能力。谦卑能使人接受现实的奇妙和丰富。正如切斯特顿所说的：

“接下来的话可能会被误解；但我应该先告诉人们不要自我陶醉（but I shouldbegin my sermon by telling people not to enjoy themselves）。我应该告诉他们享受舞蹈、戏剧、游乐设施、香槟和牡蛎；享受爵士乐，鸡尾酒和夜总会，如果他们没有其他更好的选择；享受重婚和入室盗窃以及其他任何犯罪，而不是自我陶醉；永远不要学会自我陶醉。人类是快乐的，只要他们还会惊讶、接纳、感激……”

“当一个人的自我意识比外在的惊喜和冒险更占统治地位时，会变得极度自我挑剔，并有希望破灭的感觉。这是极度的饥渴和绝望的象征。”

宣传优秀的科研成果。如果那正好是你的成果，没问题。但不能因为成果是你的，所以它就是优秀的。切斯特顿接下来写到：“一个人的自负会使他自己成为事物唯一的评价标准，而不是让事实作为评价标准……”

研究是在一个社区里进行的，过度的自我宣传是格格不入并且有害的，这与社区的健康运转背道而驰。你认为印第安纳·琼斯会自我宣传吗？他正忙着拯救世界呢！如果拯救世界是你的日常，你不需要自我宣传，别人会为你拍电影。

5 不要站边（包括你自己这边）

你的领域会出现一些有争议的问题。比如说，谁对某项发明的贡献最大？某个成果是否有效？某个数据集或任务是应该被保留还是抛弃？应该接纳还是批评某个新的结论、度量方式或者研究方法？

有时候，一个坚定不移的立场是必要的。在这种情况下，果断选择自己的立场。

但是大部分问题都不值得为之站边，因为通常两边都有相当的证据。由于我们天生的不等所有证据都呈现之前就选择一个信念的不理性习惯，大部分研究者都会站边。有意识的或无意识的，通常是主观的标准在帮他们做选择，比如个性，智力，风格或者学术派别。有时这些主观标准所占权重大于任何客观标准。

科学家应该能够退一步，从更广阔的角度看待问题。最简单的方法就是拒绝站边。这样你会更快乐，因为当你回顾关于这个问题的新证据时，你不会因为需要确认自己偏见而感到压力。人们会注意到你在寻求真理，并尊重你，即使在站在中间可能会感到不舒服（而且孤独）。

当机构用纳税人的钱支持我们做科研时，他们是在支持客观的科学家在一个社区里共同工作，为人类知识添砖加瓦。这些钱不能被用来助长鸡毛蒜皮的八卦，或者为两派的斗争输送“弹药”，或者让人们相信你的贡献比某人更多。这都是滥用公款，而且不会让你更快乐。

6 别拿自己和他人作比较

博士生所经历的大部分压力来自于将自己与他人进行比较。学生试图与她臆想中的对手竞争，却无法辨别这种竞争是否合理。

工作时长是最具体的比较形式之一。有些人吹嘘工作时间长，有些人吹嘘工作时间短，有时同一个人甚至会既吹嘘自己工作的时间长，又吹嘘自己工作的时间短！

拒绝参与这个游戏，它毫无益处。时常为这种事情操心会让你把注意力过分的集中在自己身上。如果总在拿自己与他人作比较，你将没办法专注于你工作的质量，理解的深度，和视角的广度。

每个人都不一样，所以拿你自己和他人作比较是完全不合适的。只有将所有变量都控制了之后，比较才是合理的。但是任何两个人之间都有太多的不同点。所以任何的比较都将更多地反映那些无关紧要的不同点，而不是与你科研能力相关的品质。如果别人发的论文比你多，那不应让你陷入绝望；如果别人发的论文比你少，那不应给你带来狂喜。

让我们来明确一件事：你不是最好的。这没关系。你不能做所有事，你也不会做出人类历史上最伟大的发现。但是你可以做一些事，你有机会为科学做出真实的，经得起时间考验的，之前别人从未发现，甚至不会发现的贡献。

7 结语

现实很奇怪。我们构造出的所有世界都相对简单，而现实中总有些东西是我们无法理解的。我们也永远不会将它完全理解。研究是对现实的探索，这是一次美妙的探索，它让我们保持谦虚。真正的谦虚是努力看清事物本来的样子，而不是我们所希望的样子。

一旦你摆脱了内心的自负和自我陶醉，通过扎根于现实的生活进入真正的自由，你会更加快乐。

我将用两段引文来结束：

在成人的世界里，没有人是绝对的无信仰者，没有人不信仰某种事物。每个人都有信仰的对象，不同的只是选择信仰什么。我们之所以只能选择信仰某种神灵，或者心灵皈依 — 不论是耶稣或是阿拉，不论是耶和华、巫术母亲女神，或是佛教的四圣谛，亦或是某种不可亵渎的道德准则 — 因为基本上其他的选择（例如对金钱、权力的信仰 — 译者注）都会让你陷于危险的境地。
如果你信仰金钱与物质，如果你们依靠那些东西来追求生命的意义，那么你将永远无法被满足，永远无法感到满足；信仰你自己的形象与魅力，你会永远觉得自己丑陋，当年龄的痕迹慢慢浮现，当人们为你送最后一程而哀悼时，你的心早已死了一百万次了。我们多少都懂这个道理，它早被编入了神话、谚语、警句、寓言之中。这个道理是很多伟大故事的骨架。诀窍只有一个，那就是让真理成为你意识的最高准则。信仰权力，你会感到虚弱、害怕，你会需要更多的权力以凌驾于他人之上，好麻痹自己的恐惧；信仰智力，想做个看起来很聪明的人，你总觉得自己是个愚蠢的骗子，永远处于事情即将败露的恐惧中。以上这些形式的信仰，本质并不邪恶，这些信仰是无意识的，是默认的设定，它会让你逐渐沉沦，日复一日，你会对眼前的事物越来越挑剔，错误的估量每个行为的意义。所谓的真实世界不会阻止你采用这种默认设定，因为恐惧、轻视、挫败、欲望和自我崇拜的燃料，正让这由人、金钱，和权力构成的真实世界很好地运转。当今社会文化产出了巨大的财富，舒适的生活，和个人的自由，那是一种处于万物中心，做自己的上帝的自由。这种自由当然有可取之处，但是不要忘了还有其他类型的自由。而在这汲汲营营的，处处宣扬胜利、成功、炫耀的世界，你很难听到人们谈论那种最宝贵的自由，那种自由需要专注、意识、自律，需要你真正关心他人，并在日常生活中通过各种各样看似笨拙的方式为他人付出。那才是真正的自由。你不应该无意识地选择默认设定，那是无意义的老鼠赛跑，通过不停地啃咬来品尝曾经的拥有与失去，仿佛身处无间道。

—大卫·福斯特·华莱士 [Wallace]

最后一段的翻译超出了我的能力，为了不影响原意，直接用原文：

And from a Christian perspective: We must not think Pride is something God forbids because He is offended at it, or that Humility is something He demands as due to Hisown dignity—as if God Himself was proud….He wants you to know Him: wants togive you Himself. And He and you are two things of such a kind that if you really get into any kind of touch with Him you will, in fact, behumble—delightedly humble, feeling the infinite relief of having for once gotrid of all the silly nonsense about your own dignity which has made you restless and unhappy all your life. He is trying to make you humble in order to make this moment possible: trying to take off a lot of silly, ugly, fancy-dress in which we have all got ourselves up and are strutting about like the little idiots we are. I wish I had got a bit further with humility myself: if I had, I could probably tell you more about the relief, the comfort, of taking the fancy-dress off—getting rid of the false self, with all its ‘Look at me’ and ‘Aren’t I a good boy?’ and all its posing and posturing. To get even near it, even for a moment, is like a drink of cold water to a man in a desert.

—C. S. Lewis, Mere Christianity References

References

[Chesterton]Chesterton, G. K. The Common Man.
[desJardin] desJardin, M. How to succeed in graduate school: A guidefor students and advisors. https://www.eng.auburn.edu/~troppel/Advice_for_Grad_Students.pdf. Accessed: 2021-06-20.
[Dredzeand Wallach] Dredze, M. and Wallach, H. M. How to be a successful PhD student(in computer science (in NLP/ML)). https://people.cs.umass.edu/~wallach/how_to_be_a_successful_phd_student.pdf. Accessed: 2021-06-20.
[Guo]Guo, P. J. Advice for new Ph.D. students. http://pgbovine.net/early-stage-PhD-advice.htm. Accessed: 2016-01-31.
[Lewis]Lewis, C. S. Mere Christianity.
[Stearns]Stearns, S. C. Some modest advice for graduate students. http://stearnslab.yale.edu/some-modest-advice-graduate-students.Accessed: 2021-06-20.
[Wallace]Wallace, D. F. This is Water. https://en.wikipedia.org/wiki/This_Is_Water.My translation partly refered to https://www.youtube.com/watch?v=nSYLeqWZwSw.Accessed: 2021-06-20.

Deep RL 11 Model-Based Policy Learning

2021-05-10T00:00:00-04:00

In this section, we study how to learn policies utilize the known (learned) dynamics. Why do we need to learn a policy? What’s wrong with MPC in the previous lecture? The answer is that MPC is still an open loop control methods, even though the replanning machanism provides some amount of closed-loop capability, but the planning procedure still is unable to reason under the fact that more information will be revealed in the future and we can act based on that information. This is obviously suboptimal in the stochastic dynamics setting.

On the other hand, if we have an explicit policy, we can make decision at each time step based on the state at that time step, and therefore no need to plan the whole action sequence all in one go. This is closed-loop planning and it’s more desirable in the stochastic dynamics setting.

1 Learn Policy by Backprop Through Time (BPTT)

Suppose we have learned dynamics $s_{t+1} = f(s_t, a_t)$ and reward function $r(s_t, a_t)$, and want to learn optimal policy $a_t = \pi_{\theta}(s_t)$. (Here is use deterministic dynamics and policy to make a point. The point also applies to stochastic dynamics, but the derivation is slightly more involved and will be introduced in the future; Also I drop the parameters notation in the dynamics and reward function for simplicity). Same as policy gradient, our goal will be:

\[\begin{align} \theta^* = \text{argmax}_{\theta} \mathbb{E}_{\tau\sim p(\tau)}\sum_t r(s_t, a_t) \end{align}\]

Since we have dynamics and reward function, we can write the objective as

\[\begin{align} \mathbb{E}_{\tau\sim p(\tau)}\sum_t r(s_t, a_t) = \sum_t r(f(s_{t-1}, a_{t-1}), \pi_{\theta}(f(s_{t-1}, a_{t-1}))), \text{ where } s_{t-1} = f(s_{t-2}, a_{t-2}) \end{align}\]

Very similar to shooting methods, the objective is defined recursively, which lead to high sensitivity to the first actions and lead to poor numerical stability. However, for shooting methods, if we define the process as LQR, we can use a dynamical programming to solve it in a very stable fashion. Unfortunately, unlike LQR, since the the parameters of the policy couple all time steps, we cannot solve by dynamical programming (i.e. can’t calculate the best policy parameter for the last time step and solve for the policy parameters for second to last time step and so on).

What we can use is backpropagation:

If you are familiar with recurrent neural network, we might realize that the kind of backprop shown above is the so called Backpropagation Through Time or BPTT, which is usually used on recurrent neural nets like LSTM. BPTT famously has the vanishing or exploding gradients issue, because all the jacobians of different time steps get multiplied together. This issue can only get worse in policy learning, because in sequence deep learning, we can choose architectures like LSTM that has good gradient behavior, but in model-based RL, the dynamics has to fit to the data and we don’t have control over the gradient behavior.

In the next two sections, we will introduce two popular ways to model-based RL. The first way is a bit controvesal, it does model-free optimization (policy gradient, actor-critic, Q-learning etc.) and use model to only generate sythetic data. Despite looking weird backwards, this idea can work very well in pratice; the second way is to use simple local models and local policies, which can be solved using stable algorithms.

2 Model-Free Optimization with a Model

Reinforcement learning is about getting better by interacting with the world, and the interacting, try-and-error process can be time consuming (even in a simulator sometimes). If we have a mathematical model that represent how the world works, then we can effortlessly generate data (transitions) from it for model-free algorithms to get better. However, it’s impossible to have a comprehansive mathematical model of the world, or even of the environment we want to run our RL algorithms. Nevertheless, a learned dynamics is a representation of the environment and we can use it to generate data.

The general idea is to use the learned dynamics to provide more training data for model-free algorithms, it does it by generate model-based rollouts from real world states.

The general algorithm is the following:

run some policy in the environment to collect data $\mathcal{B}$
sample minibatch ${(s, a, r, s')}$ from $\mathcal{B}$ uniformly
use ${(s, a, r, s')}$ to update model and reward function
sample minibatch {(s)} from $\mathcal{B}$
for each $s$, perform model-based k step rollout with neural net policy $\pi_{\theta}$ or policy indiced from Q-learning , get transitions ${(s, a, r, s')}$
use the transitions for policy gradient update or updating Q-function. Go to step 4 for a few more times (inner loop); Go to step 1 (outer loop).

There are a few things to be cleared. Above algorithm is very general and explicitly considers both policy gradient and Q-learning, this will affect what we actually do in step 1, 5, and 6. If we use policy gradient, in step 1 and step 5, we can run the learned policy, and in step 6, we run policy gradient update; if we use Q-learning, then in step 1 and step 5, we run policy indiced by the learned Q-function, e.g. $\epsilon$-greedy policy. And in step 6 we update the Q-function by taking the gradient of temporal difference.

Model-based rollout step k is an very important hyperparameter, since we completely reply on $f_{\phi}(s_t, a_t)$ during model-based rollout, the discrepancy between $f_{\phi}(s_t, a_t)$ and the ground truth dynamics can result in a distribution shift problem, i.e. the expectation in the objective we are optimizing is over a distribution that is very different from the true distribution. We’ve encounter this issue several times (e.g. imitation learning, TRPO etc.). We know that if there is a discrepancy between fitted dynamics and true dynamics, the error between true objective and the objective we optimizes grow linearly with the length of rollout. Therefore, we don’t want the length of model-based rollout to be too long; on the other hand, too short a rollout provide little learning signal, which is undesirable for policy update or Q-function update. Therefore, we need to choose an appropriate k for the algorithm.

Since for every model-based rollout, the initial state is sampled from real world data, this algorithm can be intuitively understand as imagining different possibilities starting from real world situations:

Here we give one instatiation of the general algorithm introduced above, which combines model-based RL with policy gradient methods. The algorithm is called Model-Based Policy Optimization or MBPO Janner et al. 19’:

For instatiation with Q-learning, see Gu et al. 16’ and Feinberg et al 18’.

3 Use Simpler Policies and Models

3.1 LQR with Learned Models

A local model is a model that is valid only in the neighborhood of one or some trajectories. Previously, we learned (i)LQR, which assumes linear dynamics (approximates dynamics by a linear function), which could be too simple for most scenarios, but it might be a good assumption locally, i.e. for one or a few very close trajectories, we can assume a linear dynamics. Suppose we are given these trajectories, we want fit a linear dynamics to it by linear regression at each time step, and then perform (i)LQR to get actions and execute these actions in the environment, we can get new trajectories, and we can again fit a linear dynamics to these trajectories, and then run (i)LQR and execute the planned actions……

The procedure looks like the following:

Where the local linear dynamics is defined as

\[p(x_{t+1}\mid x_t, u_t) = \mathcal{N}(A_t x_t + B_t u_t + c_t, \Sigma)\]

Where $A_t, B_t, c_t$ are to be fitted using trajectories $\{ \tau_i \}$. $\Sigma$ can be tuned as a hyperparameter or also be estimated from data.

The policy is defined as

\[p(u_t\mid x_t) = \mathcal{N}(K_t (x_t - \hat{x_t}) + k_t + \hat{u_t}, \Sigma_t)\]

Note that this correspond to iLQR, i.e. $K_t, k_t$ are calculated from the fitted dynamics and $\hat{x_t}, \hat{u_t}$ are the actual states and actions in the trajectories $\{ \tau_i \}$. $\Sigma_t$ is set to be $Q_{u_t, u_t}^{-1}$, which is also intermediate result of running iLQR. Intuitively, $Q_{u_t, u_t}$ is gradient of the cost to go w.r.t. the action. If the gradient is low, that means the total reward doesn’t depend very strongly on the action, which means many different actions may lead to similar reward, then it’s a good idea to test out different actions, so we want the variance of $p(u_t\mid x_t)$ to be high, and vice versa. Setting $\Sigma_t$ to be $Q_{u_t, u_t}^{-1}$ gives us such property.

One more thing to notice is that since the fitted dynamics is only valid locally, if the action we take lead to very different state distribution then the subsequent actions planned might be very bad and lead to even worse result. Therefore, we need to make sure the new trajectory distribution is close enough to the old distribution. This can be inforced by using again using KL divergence:

\[D_{\text{KL}}(p_{\text{new}}(\tau) \lvert p_{\text{old}}(\tau))\]

For details about how this is implemented, please see Sergey and Abbeel 14’.

3.2 Guided Policy Search

If we have a bunch of local policies e.g. $\{\pi_{\text{LQR, i}}\}_{i}$, which are derived from local models (e.g. LQR models), we can distill the knowledge of these local policies to get a global policy by supervised learning.

The idea above can be view as a special case of a more general framework is known as knowledge distillation (Hinton et al. 15’). Here we have a bunch of weak policies (local policies) and we can ensemble them to get a strong policy, but rather than directly using the ensemble, we distill knowledge from this ensemble to get one global neural network policy $\pi_{\theta}$.We train the neural network using the trajectories we used for training LQR parameters and policies, except that instead of directly train the neural net policy to output the one actual action sequence at each time step, we train the neural net to predict probability of each action given the state.

In order for the algorithm to work better, we want the LQR policy $\{\pi_{\text{LQR, i}}\}_{i}$ to be close to the neural net policy $\pi_{\theta}$. We use KL divergence to inforce that, and it can be implemented as modifying the cost function of LQR.

The algorithm sketch is the following:

Optimize each local policy $\pi_{\text{LQR, i}}$ on initial state $x_{0,i}$ w.r.t $\tilde{c}_{k,i}(x_t, u_t)$
use samples from step one one to train $\pi_{\theta}$ to minic each $\pi_{\text{LQR, i}}$
update cost function $\tilde{c}_{k+1,i}(x_t, u_t) = c(x_t, u_t) + \lambda_{k+1, i}\log \pi_{\theta}(u_t\mid x_t)$. Go to step 1.

Where $k$ index is number of the step in the algorithm and $i$ index different LQR models which is instantiated by starting from different initial state. Step 3 is for making local policies and the global policy close to each other in terms of KL divergence and $\lambda_{k+1, i}$ is the Lagrangian multiplier. This is just a sketch of the algorithm, and for details, please checkout the original paper by Levine and Finn et al. 16’.

Similar approach can also be extended to multitask transfer scenario:

Where the loss function for training the global policy is

\[\mathcal{L}^i = \sum_{a\in \mathcal{A}_{E_i}} \pi_{E_i(a\mid s)}\log \pi_{\theta}(a\mid s)\]

For deteils, please see Parisotto et al. 16’.

4 Demo: End-to-End Training of Deep Visuomotor Policies

Levine and Finn et al. 16’

Deep RL 10 Model-based Reinforcement Learning

2021-05-05T00:00:00-04:00

Previous lecture is mainly about how to plan actions to take when the dynamics is known. In this lecture, we study how to learn the dynamics. We will also introduce how to incorporate planning in the model learning process and therefore form a complete decision making algorithm.

Again, most of the algorithms will be introduced in the context of deterministic dynamics, i.e. $s_{t+1} = f(s_t, a_t)$, but almost all of these algorithms can just as well be applied in the stochastic dynamics setting, i.e. $s_{t+1}\sim p(s_{t+1}\mid s_t, a_t)$, and when the distinction is salient, I’ll make it explicit.

1 Basic Model-based RL

How to learn a model, the most direct way is supervised learning. Similar to the idea used before, we run a random policy to get transitions, and then fit a neural net to the transition:

run base policy $\pi_0(a\mid s)$ (e.g. random policy) to collect $\mathcal{D} = \{ (s_i, a_i, s'_i) \}$
learn dynamics model $f_{\theta}(s,a)$ to minimize $\sum_{i}\left\| s'_i - f_{\theta}(s_i, a_i) \right\|^2$
plan through $f_{\theta}(s,a)$ to choose actions.

Where in step 3, we can use CEM, MCTS, LQR etc.

Does this work? Well, in some cases. For example, if we have a full physics model of the dynamics and only need to fit a few parameters, this method can work. But still, some care should be taken to design a good base policy.

In general, however, this doesn’t workm and the reason is very similar to the one we encountered in imitation learning — distribution shift. The data we used to learn the dynamics comes from the trajectory distribution induced by random policy $\pi_0$, but when we plan through the model, we can think of the algorithm is using another policy $\pi_f$, and the trajectory distribution induced by this policy can be very different from the one induced by the base policy. The consequence is that, when we plan actions, we’ll arrive at the state action pair that the dynamics is very uncertain about, because it has never trained on similar data! Therefore, it will make bad prediction on the following state, which will in turn lead to bad actions, this will go on and on and in the end we are completely planning on the wrong states (prediction is different the reality). The intuitive plot is shown below:

How to deal with it? Same as how DAgger deals with distribution shift in imitation learning, we just need to make sure that the training data comes from the current dynamics (current policy). These lead to the first practical model-based RL algorithm:

run base policy $\pi_0(a\mid s)$ (e.g. random policy) to collect $\mathcal{D} = \{ (s_i, a_i, s'_i) \}$
learn dynamics model $f_{\theta}(s,a)$ to minimize $\sum_{i}\left\| s'_i - f_{\theta}(s_i, a_i) \right\|^2$
plan through $f_{\theta}(s,a)$ to choose actions.
execute those actions and add the resulting transitions $\{ (s_j, a_j, s'_j) \}$ to $\mathcal{D}$. Go to step 2.

However, even though the data is updating based on the learned dynamics, as long as we are replanning, it will always induce a new trajectory distribution which will be a little different from the previous distribution. In another word, the distribution shift will always exist. Therefore, as we plan through $f_{\theta}(s,a)$, the actual trajectory will gradually deviate from the predicted trajectory which will lead to bad actions.

We can improve this algorithm by only execute the first planned action, and observe the next state that this action leads to, and then replan start from that state. and then take the first action etc. In a word, at each step, we only take the first planned action and observe the state and then replan from there. Because at each time step, we always take the action based on the actual state, this is more reliable than executing the whole plan actions all in one go. The algorithm is

run base policy $\pi_0(a\mid s)$ (e.g. random policy) to collect $\mathcal{D} = \{ (s_i, a_i, s'_i) \}$
learn dynamics model $f_{\theta}(s,a)$ to minimize $\sum_{i}\left\| s'_i - f_{\theta}(s_i, a_i) \right\|^2$
plan through $f_{\theta}(s,a)$ to choose actions.
execute the first planned action and add the resulting transition $(s, a, s)$ to $\mathcal{D}$. If reach the predefined maximal number of planning steps, go to step 2; else, Go to step 3.

This algorithm is call Model Predictive Control or MPC. Replanning at each time step can drastically increase the computation load, so people sometimes choose to shorten the time horizon of the trajectory. While this might lead to a decrease in the quality of actions, since we are constantly replanning, we can take the cost that individual plans is less perfect.

2 Uncertainty-Aware Model-based RL

Since we plan actions replying on the fitted dynamics, whether or not the dynamics is a good representation of the world is crucial. When we use high capacity model like neural networks, we usually need to feed it with a lot of data in order to get a good fit. But in model-based RL, we usually don’t have a lot of data at the beginning, in fact, we can only have some bad data (generated by running some random policy), and then if we use neural network to fit the dynamics, it will overfit the data, and not have a good representation of the good part of the world. This will lead the algorithm to take bad actions, which can lead to bad states, which can then lead to neural net dynamics trained only on trajectories and thus it’s predictions on good states in unreliable, which again lead to algorithm to take bad actions…… This seems to be a chicken-and-egg problem, but if you think about it, the origin is that planning on unconfident state prediction can lead to bad actions.

The solution is to quantify uncertainty of the model, and take into consideration this uncertain in planning.

First of all, it’s important to know that uncertainty of a model is not the same thing as the probability of the model’s prediction on some state. Uncertainty is not about the setting where dynamics is noisy, but about the setting where we don’t know what the dynamics are.

The way to avoid taking risky actions on uncertain state is to plan based on expected expected reward. Wait, what is it? Yes, this is not a typo, the first expected is with respect to the model uncertainty, and the second expected is with respect to trajectory distribution. Mathematically, the objective is

\[\begin{align} &\int \int \sum_t r(s_t, a_t) p_{\theta}(\tau) p(\theta) \text{d}\tau \text{d}\theta \\ &= \int \left[\mathbb{E}_{\tau\sim p_{\theta}(\tau)}\sum_t r(s_t, a_t)\right] p(\theta)\text{d}\theta \\ &= \mathbb{E}_{\theta\sim p(\theta)}\left[ \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\sum_t r(s_t, a_t) \right] \end{align}\]

Having an uncertainty-aware formulation, the next steps are:

how to get $p(\theta)$
how to actually plan actions to optimize this objective

2.1 Uncertainty-Aware Neural Networks

In this subsection we discuss how to get $p(\theta)$. First of all, we should make sure that the general direction is to learn $p(\theta)$ from data, thus we should explicitly write it as $p(\theta\mid \mathcal{D})$ where $\mathcal{D}$ is the data.

The first approach is Bayesian Neural Networks, or BNN. To consider the problem from a Bayesian perspective, we can first rethink our original approach, i.e. what is it that we are estimating when doing supervised training in step 2 in MPC? (Here we write it slightly differently for illustration)

learn dynamics model $f_{\theta^*}(s,a)$ to minimize $\sum_{i}\left\| s'_i - f_{\theta}(s_i, a_i) \right\|^2$

The $\theta$ that we find are actually the maximal likelihood estimation, i.e.

\[\theta^* = \text{argmax}_{\theta}p(\mathcal(D)\mid \theta)\]

Adopting the Bayesian approach, we want to estimate the posterior distribution

\[\begin{align*} p(\theta\mid \mathcal{D}) =\frac{p(\mathcal(D)\mid \theta)p(\theta)}{p(\mathcal{D})} \end{align*}\]

However, this calculation is usually intractable. In neural network setting, people usually resort to variance inference, which approaximates the intractable true $p(\theta\mid \mathcal{D})$ by a tractable variational posterior $p(\theta\mid\phi)$ by minimizing the Kullback-Leibler divergence (KL divergence)between the ground truth and approximation, where $\phi$ is to be learned from data. We will introduce variational inference in future lectures, for now, we give a simple hand wavie example.

We define variational posterior to be fully factorized Gaussian:

\[p(\theta \mid \phi) = \prod N(\theta_j \mid \mu_j, \sigma^2_j)\]

Where $\mu_j$ and $\sigma^2_j$ are learned s.t. the variational posterior is close to the true posterior. Then we use $p(\theta \mid \phi)$ as the distribution over dynamics and take actions accrodingly.

The second approach, which is conceptually simpler and usually works better than BNN, is boostrap ensembles. The idea is to train many independent neural dynamics models, and average them. Mathematically, we learn independent neural network parameters $\theta_1, \theta_2, \cdots, \theta_m$ and the ensembled dynamics is

\[p(\theta\mid \mathcal{D}) = \frac1m sum_{j=1}^{m}\delta(\theta_j)\]

Where $\delta$ is the delta function, and the probability of state $s_{t+1}$ from the dynamics ensemble is an average of the probabilities of each independent neural dynamics:

\[\begin{align} \int p(s_{t+1}\mid s_t, a_t, \theta)p(\theta\mid \mathcal{D}) \text{d}\theta = \frac1m \sum_{j=1}^m p(s_{t+1}\mid s_t, a_t, \theta_j) \end{align}\]

But how do we get the $m$ independent neural dynamics? We use boostrap. The idea is to resample the dataset $\mathcal{D}$ with replacement to get $m$ dataset and for each of the $m$ dataset, train a dynamics. The bootstrap method is developed by statistician Bradley Efron since 1979. It has solid statistical foundation and has been applied to many areas. I encourage interested reader checkout this book by Efron and Tibshirani.

In practice, people find that for neural dynamics, it is not necessary to resample the data. What people do is just train neural nets with same dataset but set different random seed. The use of SGD will make each neural net sufficiently independent.

2.2 Plan with Uncertainty

Having uncertainty-aware dynamics i.e. a distribution over dynamics. It’s very natural to derive an uncertainty-aware MPC algorithm. Recall that in the MPC algorithm, we plan using the objective

\[J(a_1, \cdots, a_T) = \sum_{t=1}^{T}r(s_t, a_t), s_t = f_{\theta}(s_{t-1}, a_{t-1})\]

Now the objective has changed to

\[\begin{align}\label{un_obj} &J(a_1, \cdots, a_T) = \frac1m \sum_{j=1}^{m}\sum_{t=1}^{T}r(s_{t,j}, a_t)\\ &\text{ where } s_{t,j} = f_{\theta_j}(s_{t-1,j}, a_{t-1}) \text{ or } s_{t,j} \sim p(s_t\mid s_{t-1,j}, a_{t-1}, \theta_j)\\ &\text{ and } \theta_j \sim p(\theta\mid \mathcal{D}) \end{align}\]

With this, we can write out the uncertainty-aware MPC algorithm:

run base policy $\pi_0(a\mid s)$ (e.g. random policy) to collect $\mathcal{D} = \{ (s_i, a_i, s'_i) \}$
estimate the posterior distirbution of dynamics parameters $p(\theta\mid \mathcal{D})$
sample $m$ dynamics from $p(\theta\mid \mathcal{D})$
plan through the ensemble dynamics to choose actions.
execute the first planned action and add the resulting transition $(s, a, s)$ to $\mathcal{D}$. If reach the predefined maximal number of planning steps, go to step 2; else, Go to step 3.

You might notice that this algorithm seems do not use the objective i.e. equation $\ref{un_obj}$, but actually at step 4, the algorithm is actually planning based on equation $\ref{un_obj}$, and since the reward relies on ensemble dynamics, we conveniently say “plan through the ensemble dynamics to choose actions”.

3 Model-Based RL with Images

Previously we’ve been assuming that state is obserable, because we’ve been using transitions $\{ (s_i, a_i, s'_i) \}$ for supervised learning of dynamics (or distribution of dynamics). In some cases, especially when the observation is image, directly treating it as state for supervised learning of dynamics can be troublesome, and the reasons are:

High dimensionality. We are fitting $s_{t+1} = f_{\theta}(s_t, a_t) \text{ or } s_{t+1} \sim p_{\theta}(s_{t+1}\mid s_t, a_t)$, if $s_{t+1}$ is image, then the dimension is $3\times\text{H}\times\text{W}$, which can be very large in many cases and thus accurate prediction is very difficult.
Redundancy. Many parts of the images can stay unchanged during the whole process, this leads a redundancy in the data.
Partial observability. There are things that static images can not directly represent, such as speed and acceleration, although you might derive this from the image, but that requires extra potentially nontrivial effort and might not be accurate.

We will now introduce the state-space model that models POMDPs, which treats states as latent variables and model observation using distributions conditioned on states.

Let’s recall how dynamics is learned when we assume states are observable. We parameterize the dynamics using a neural net with parameter $\theta$:

\[p(s_{1:T}) = \prod_{t=1}^{T}p_{\theta}(s_{t+1}\mid s_t, a_t)\]

Note that we slightly abuse the notation for clarity, for example $p_{\theta}(s_{1}\mid s_0, a_0) = p_{\theta}(s_1)$.

And solve for $\theta$ using maximal likelihood on collected transitions $\{ (s^i_{t+1}, s^i_t, a^i_t) \}_{i,t=1}^{N,T}$:

\[\max_{\theta}\frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(s^i_{t+1}\mid s^i_t, a^i_t)\]

Now consider state unobservable, We have:

\[p(s_{1:T}, o_{1:T}) = \prod_{t=1}^{T}p_{\theta}(s_{t+1}\mid s_t, a_t)p_{\phi}(o_t\mid s_t)\]

Where $p_{\theta}(s_{t+1}\mid s_t, a_t)$ is the transition model and $p_{\phi}(o_t\mid s_t)$ is the observation model. Similarly, we solve for $\theta \text{and} \phi$ using maximal likelihood

\[\begin{align} &\log \prod_{t=1}^{T} p_{\phi}(o_{t}\mid s_t) \nonumber \\ &=\log \mathbb{E}_{(s_t, s_{t+1}) \sim p(s_t, s_{t+1}\mid o_{1:t}, a_{1:t})}\prod_{t=1}^{T} p_{\theta}(s_{t+1}\mid s_t, a_t) p_{\phi}(o_{t}\mid s_t) \nonumber \\ &\geq \mathbb{E}_{(s_t, s_{t+1}) \sim p(s_t, s_{t+1}\mid o_{t}, a_{t})} \log \prod_{t=1}^{T} p_{\theta}(s_{t+1}\mid s_t, a_t) p_{\phi}(o_{t}\mid s_t) \nonumber \\ &\approx \frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(s^i_{t+1}\mid s^i_t, a^i_t)+ \log p_{\phi}(o^i_{t}\mid s^i_t) \label{latent_obj} \end{align}\]

We maximize equation $\ref{latent_obj}$, which is lower bound of the log likelihood, it actually uses one sample estimation for estimating the expectation (in terms of $(s_t, s_{t+1})$), more sample can be used.

One issue is that by Bayes’ rule,

\[\begin{align} &p(s_t, s_{t+1}\mid o_{t}, a_{t}) \\ &= p_{\theta}(s_{t+1}\mid s_t, a_t) p(s_t\mid o_t) \\ &= p_{\theta}(s_{t+1}\mid s_t, a_t) \frac{ p_{\phi}(o_t\mid s_t)p(s_t) }{p(o_t)} \end{align}\]

and $p(s_t\mid o_t)$ is intractale. Thus we can learn another neural net $q_{\psi}(s_t\mid o_t)$. A full treatment involvs variational inference, which we will cover in future lectures. In this lecture, we simplify the case and model posterior of state as delta function, i.e. $q_{\psi}(s_t\mid o_t) = \delta(s_t = g_{\psi}(o_t))$, which is just $s_t = g_{\psi}(o_t)$.

Plug this in the objective equation $\ref{latent_obj}$, we have

\[\begin{equation}\label{real_obj} \frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(g_{\psi}(o^i_{t+1})\mid g_{\psi}(o^i_t), a^i_t)+ \log p_{\phi}(o^i_{t}\mid g_{\psi}(o^i_t)) \end{equation}\]

We maximize this to find $\theta, \phi$ and $\psi$. In case you are wondering, assuming $s_t$ can be deterministically derived from $o_t$ doesn’t indicate $p_{\phi}(o_{t}\mid s_t)$ is also a delta function, because $g_{\psi}(\cdot)$ can be a one-to-many function.

Lastly, if we want to plan using iLQR or plan better, we usually also want to model the cost function, it can be modeled as a deterministic function like $r_t = r_{\xi}(s_t, a_t)$ or stochastically like $r_t \sim p_{\xi}(r_t\mid s_t, a_t)$. With the observed transitions and rewards $\{ (o^i_t, a^i_t, r^i_t) \}_{i,t=1}^{N,T}$, we similar to how to derived $\ref{real_obj}$, we maximize the objective

\[\frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(s^i_{t+1}\mid s^i_t, a^i_t)+ \log p_{\phi}(o^i_{t}\mid s^i_t) + \log p_{\xi}(r^i_t\mid s^i_t, a^i_t)\]

Lastly, I want to point out that sometimes it’s difficult to build a compact state space for the observations, and directly modeling observations and making prediction on future observations can actually work better. I.e. instead of modeling $o_t = g_{\psi}(s_t)$, we model $p(o_t \mid o_{t-1}, a_t)$ and plan actions acrodingly. We will not introduce these branch and encourage interested readers to check out Finn et al. 17’ and Ebert at al 17’, this two papers both directly model observations and plan actions using MPC.

4 Demo: Embed to Control (E2C)

Deep RL 9 Model-based Planning

2021-04-29T00:00:00-04:00

Let’s recall the reinforcement learning goal — we want to maximaze the expected reward (or expected discounted reward in the infinite horizon case)

\[\begin{equation} \mathbb{E}_{\tau\sim p(\tau)}\sum_{t=1}^T r(s_t, a_t) \end{equation}\]

where

\[\begin{equation} p(\tau) = p(s_1)\prod_{t=1}^{T}p(s_{t+1}\mid s_t, a_t)\pi(a_t\mid s_t) \end{equation}\]

In most methods that we’ve introduced so far, such as policy gradient, actor-critic, Q-learning, etc. the transition dynamics $p(s_{t+1}\mid s_t, a_t)$ is assumed to be unknown. But in many cases, the dynamics is actually known to us, such as the game of Go (we know what the board will look like after we make a move), Atari games, car navigation, anything in simulated environments (although we may not want to utilize the dynamics in this case) etc.

Knowing the dynamics provides addition information, which in principle should improve the actions we take. In this lecture, we study how to plan actions to maximize the expected reward when the dynamics is known. We will mostly study deterministic dynamics, i.e. $s_{t+1} = f(s_t, a_t)$. Although we will also generalize some methods to stochastic dynamics, i.e. $s_{t+1} \sim p(s_{t+1}\mid s_t, a_t)$.

1 Open-loop planning

If we know the deterministic dynamics, then giving the first state $s_1$, we should be able to know all the remaining states given the actions sequence (and therefore the rewards). Open-loop planning aims at directly giving optimal actions sequences without waiting for the trajectory to unfold.

Below we introduce two methods that completely ignore the feedback control and optimize the objective as a blackbox, that is to say, the methods do not even utilize the known dynamics. For simplicity, Let’s write the objective i.e. the expected retrun as $J(\mathbf{A})$, where $\mathbf{A} := a_1, a_2, \cdots, a_T$. The goal is to find $\mathbf{A^*}$ that maximizes this objective.

The first method is called random shooting, which can be explained in one line: randomly sample $\mathbf{A_1}, \mathbf{A_2}, \cdots, \mathbf{A_N}$ from some distribution (g.e. uniform) and then choose the one that gives the highest $J(\mathbf{A_i})$ as $A^*$.

Random shooting seems to be a bad idea, but it actually works well on some low action dimension, short horizon problem. And it’s very easy to implement and parallelize.

However, this is still a overly simple method that completely relies on luck. One method can dramatically improve random shooting method while still maintaining the benefits is called cross-entropy method or CEM. Below the algorithm of CEM:

Initialize the actions sequence distribution $p(\mathbf{A})$
sample $\mathbf{A_1}, \mathbf{A_2}, \cdots, \mathbf{A_N}$ from $p(\mathbf{A})$
evaluate $J(\mathbf{A_1}), J(\mathbf{A_2}), \cdots, J(\mathbf{A_N})$
pick the elites $\mathbf{A_{i_1}}, \mathbf{A_{i_2}}, \cdots, \mathbf{A_{i_M}}$ with the highest value, where $M < N$
refit $p(\mathbf{A})$ to the the elites. Go to 2.

Where setting $M = 10\%N$ is usually a good choice. The key of CEM is that the action distribution is constantly changing based on the action evaluation. This help the algorithm to find and concentrate the probability mass on areas where actions are more likely to give high value.

Similar to random shooting, CEM is easy to implement and parallelize, while also has harsh dimensionality limits (actions space dimension times the horizon), the exactly limit obviously depends on the problem, but generally these methods cannot go beyond $60$ dimension, e.g. action dimension is $5$ and time horizon is $12$.

2 Monte Carlo Tree Search (MCTS)

In this section we introduce the famous Monte Carlo Tree Search algorithm or MCTS, which has been use in AlphaGO. MTCS is used in cases when the action space is discrete.

We formulize the problem of planning as a tree search, where the nodes are states and taking different actions leads to the tree branching out to different nodes. Note that the transition can be stochastic and the state space can be contiuous, and in fact, we don’t worry to much about the actual state but only focus on the time step of a state, i.e. $s_t$ can represent different state at time step $t$.

Start from the initial state $s_1$, an naive idea is to just try to take different actions at every state and collect the reward. And after the tree is fully unfold, pick the path that gives the biggest reward.

However, this is prohibitly expensive as the computation complexity is $O(T^{\lvert\mathcal{A}\rvert})$. MCTS is heuristic method that can approximate the state action value without exactly expand the whole tree. The algorithm is the following:

Choose a leaf node $s_l$ by applying TreePolicy recursively from $s_1$
Run DefaultPolicy($s_{l}$) and evaluate the the value of $s_l$
Update all values in tree between $s_1$ and $s_l$. While within the computational budget, go back to step 1.

When the algorithm is done, we take the best action starting from $s_1$.

Now let’s first explain in detail what each steps means, and then we will show an example of how MCTS works.

Step 1. The TreePolicy is basically a node selection strategy. We start from $s_1$ and recursively apply it to descend through the tree until we find a node that satisfies the strategy and select the node. While there are many strategies, we only introduce one most popular one, namely Upper Confidence Bounds for Trees, or UCT policy. UCT($s_t$) works this way, if $s_t$ is not fully expanded, i.e. there are possible actions that we haven’t taken, then take that action, if there are multiple actions, just randomly choose one; else, choose a child node $s_{t+1}$ with the best score Score$(s_{t+1})$, with Score$(s_{t+1})$ is defined as

\[\begin{equation}\label{score} \text{Score}(s_{t+1}) = \frac{Q(s_{t+1})}{N(s_{t+1})} + 2C \sqrt{\frac{2\ln N(s_t)}{N(s_{t+1})}} \end{equation}\]

Where $Q(s_{t+1})$ is the value of the node $s_{t+1}$, but note that this is not the one that we’ve defined previously in this course, but is an accumulated value - every time we evaluate itself and it’s descedents, we add the value to it. For example, for node $s_{t+1}$, if we evaluate it self to be $10$ and later on in the algorithm we evaluate it’s two decendents to be $5$ and $11$, then $Q(s_{t+1}) = 10 + 5 + 11 = 26$. $N(s_{t+1})$ is the number the node has been visited, in this example, $N(s_{t+1})$ is $3$.

Equation $\ref{score}$ is very intuitive. The first term measure the exact value of the node, the second term measure how often this node has been visited — if $N(s_t)$ is big, while $N(s_{t+1})$, that means a lot of visits to $N(s_t)$ has not pass down to $N(s_{t+1})$ but other descendents of $N(s_t)$, and this indicates that we migth want to visit $N(s_{t+1})$ more often.

Step 2. When we decide to take some action and go to node $s_l$, we run DefaultPolicy from this state (till it terminates) and collect reward (which we called evaluate the value this node).

Step 3. We add the reward to the value $Q$ of every node along the path which we follow to get to node $s_l$. Also update the $N$ of each node along the path.

Here we put the illustration by Prof. Sergey Levine, Where the illustration starts at 16:50.

3 Linear Quadratic Regulator (LQR)

You might notice that methods introduced in previous two sections actually do not require a known dynamics. In this section, we will finally introduce methods that do require and utilize a known dynamics. since the methods in this section are mostly studied in the optimal control community, we will follow the their notation and denote action as $u_t$, state as $x_t$, dynamics as $x_{t+1} = f(x_t, u_t)$ or $x_{t+1}\sim p(x_{t+1}\mid x_t, u_t)$, and cost as $c(x_t, u_t)$. This is the first time the term “cost” appears in this series, but it’s really just the opposite of reward, where reward measures how good an action state pair is, and cost measures how bad an action state pair is. Note that different from the classic RL setting, in addition to the dynamics, we also assume the cost function is known.

Similar to policy gradient methods, we aim at directly minimizing the sum of cost:

\[\begin{align} &\min_{u_1,\cdots, u_T}\sum_{t=1}^{T}c(x_t, u_t) \\ &\text{s.t.} x_{t+1} = f(x_t, u_t) \end{align}\]

We can actually incorporate the constraint into the objective and make it an uncontraint optimization problem:

\[\begin{align}\label{obj} &\min_{u_1,\cdots, u_T}\, c(x_1, u_1) + c(f(x_1,u_1), u_2) + \cdots + c(f(f(\cdots)\cdots),u_T) \end{align}\]

Linear Quadratic Regulator or LQR further simpifies this problem by assume a linear dynamics and quadratic cost:

\[\begin{equation} \begin{aligned} & f(x_t, u_t) = F_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t \\ & c(x_t, u_t) = \frac12 \begin{bmatrix} x_t \\ u_t \end{bmatrix}^T C_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + \begin{bmatrix} x_t \\ u_t \end{bmatrix}^T c_t \end{aligned} \end{equation}\]

Note that $F_t, f_t, C_t, c_t$ are all known quantities.

To solve LQR, the simplest method is just take the derivative of the objective w.r.t. actions and set them to $0$. But this is numerically very unstable because the sensitivities of actions at different time step to the cost is very different, for example, the first action is in every term of the objective and has a huge effect on the total cost, while the last action has a very small effect.

We introduce a stable iterative method to solve LQR. We start from the last action $u_T$, since it doesn’t affect previous states and also has no affect on future states (there is no future state!). Treat all terms that are not effected by $u_T$ as constant, we can write the cost as

\[\begin{equation} \begin{aligned} Q(x_T, u_T) = \text{const.} + \frac12 \begin{bmatrix} x_T \\ u_T \end{bmatrix}^T C_T \begin{bmatrix} x_T \\ u_T \end{bmatrix} + \begin{bmatrix} x_T \\ u_T \end{bmatrix}^T c_T \end{aligned} \end{equation}\]

take the derivative

\[\begin{align*} &\nabla_{u_T}Q(x_T, u_T) = C_{u_T, x_T}x_T + C_{u_T, x_T}u_T + c_{u_T}^T = 0 \\ &\Rightarrow u_T = -C_{u_T, u_T}^{-1}(C_{u_T, x_T}x_T + c_{u_T}) \end{align*}\]

where

\[\begin{equation} \begin{aligned} C_T = \begin{bmatrix} C_{x_T, x_T} & C_{x_T, u_T}\\ C_{u_T, x_T} & C_{u_T, u_T} \end{bmatrix} \quad c_T = \begin{bmatrix} c_{x_T} \\ c_{u_T} \end{bmatrix} \end{aligned} \end{equation}\]

To better see the pattern (useful for later derivation), we denote

\[\begin{align*} &K_T = -C_{u_T, u_t}^{-1}C_{u_T, x_t} \\ &k_T = - C_{u_T, u_T}^{-1}c_{u_T} \end{align*}\]

and write $u_T$ as

\[\begin{equation}\label{xt} u_T = K_Tx_T + k_T \end{equation}\]

This equation shows that the optimal $u_T$ is a linear function of $x_T$.

Our goal is to represent $u_t$’s using $x_t$’s and then once we have the first state $x_1$, we can get $u_1$, and then via the dynamics we have $x_2$ and then $u_2$ etc. This way, we can get all the actions (and states).

Now let’s try to represent optimal $u_{T-1}$ using $x_{T-1}$. Note that $u_{T-1}$ can only affect $x_T, u_T$, and thus we can treat all terms that are not effected by $u_{T-1}$ as constant and write the objective as

\[\begin{equation} \begin{aligned} &Q(x_{T-1}, u_{T-1}) \\ &= \text{const.}+ \frac12 \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T C_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T c_{T-1} \\ +& \frac12 \begin{bmatrix} x_T \\ K_Tx_T + k_T \end{bmatrix}^T C_T \begin{bmatrix} x_T \\ K_Tx_T + k_T \end{bmatrix} + \begin{bmatrix} x_T \\ K_Tx_T + k_T \end{bmatrix}^T c_T \\ &=\text{const.}+\frac12 \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T C_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T c_{T-1} + \frac12 x_T^TV_Tx_T + x^T_T v_T \end{aligned} \end{equation}\]

Where $V_T, v_T$ are terms that depends on $C_T, c_T$ only. We can see that this is again a sum of linear and quadratic terms of $x_{T-1}, u_{T-1}$.

We can take the derivative of it w.r.t. $u_{T-1}$ and set it to $0$. We will get:

\[\begin{equation}\label{xt-1} u_{T-1} = K_{T-1}x_{T-1} + k_{T-1} \end{equation}\]

Where $K_{T-1}$ and $k_{T-1}$ are functions of $F_{T-1}, f_{t-1}, C_{T-1}, c_{T-1}. V_T, v_T$, the expression is a bit hairy, but the important thing to known that $K_{T-1}$ and $k_{T-1}$ are known quantities.

Therefore we show that we can always represent $u_t$ as a linear function of $x_t$.

The full algorithm contains first starting from time step $T$ and go backward to represent $u_t$ using $x_t$, and then run forward from time step $1$ to get action state and action at every time step.

Concretly, the backward iteration is

And the forward iteration is

4 LQR for Stochastic and Nonlinear Systems

4.1 Guassian Dynamics

When the dynamics is stochastic, we want to minimize the expected cost

\[\begin{equation}\label{sto} \min_{u_1, \cdots, u_T}\mathbb{E}\sum_{t=1}^{T}c(x_t, u_t) \end{equation}\]

Where the expectation is taken w.r.t dynamics $p(x_{t+1}\mid x_t, u_t)$.

Here we briefly introduce applying LQR in a special case of stochastic dynamics — Guassian linear dynamics

\[\begin{align*} p(x_{t+1}\mid x_t, u_t) = \mathcal{N}( F_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t, \Sigma_t ) \end{align*}\]

It turns out that if the cost is still quadratic in state and action, the objective in equation $\ref{sto}$ can be solved analytically and we can apply the same iterative procedure and actually get the same solution $u_t = K_t x_t + k_t$. Details are left to the readers.

4.2 Iterative LQR (iLQR) for Nonlinear Systems

Now we get rid of the assumption that the dynamics is linear and cost is quadratic.

We can use first Taylor expansion to approximate the dynamics as

\[\begin{align} f(x_t, u_t) \approx f(\hat{x}_t, \hat{u}_t) + \nabla_{x_t, u_t}f(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix} \end{align}\]

Use second order Taylor expansion to approximate cost as

\[\begin{align} c(x_t, u_t) \approx c(\hat{x}_t, \hat{u}_t) + \nabla_{x_t, u_t}c(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix} \\ + \frac12 \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}^T \nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix} \end{align}\]

Denote

\[\begin{align} \delta x_t = x_t - \hat{x}_t \\ \delta u_t = u_t - \hat{u}_t \\ f_t = f(\hat{x}_t, \hat{u}_t) \\ F_t = \nabla_{x_t, u_t}f(\hat{x}_t, \hat{u}_t) \\ c_t = \nabla_{x_t, u_t}c(\hat{x}_t, \hat{u}_t) \\ C_t = \nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \end{align}\]

No need to worry about the constant term $c(\hat{x}_t, \hat{u}_t)$ in cost approximation, as it will disappear when we take the derivative, i.e. it will not affect the solution.

We can first randomly pick sequence of actions as $\hat{u}_t$’s and then get the states $\hat{x}_t$ based on the true dynamics. Then, run backward and forward LQR algorithm on

\[\begin{equation} \begin{aligned} & f(\delta x_t, \delta u_t) = F_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix} + f_t \\ & c(\delta x_t, \delta u_t) = \frac12 \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^T C_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix} + \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^T c_t \end{aligned} \end{equation}\]

which gives $\delta x_t, \delta u_t$, add them by $c(\hat{x}_t, \hat{u}_t)$ and we get the $x_t$’s and $u_t$’s, we then denote them as $\hat{x}_t, \hat{u}_t$, and repeat the process. Put it in one place, the algorithm is the following:

Note that in the forward pass of LQR, we use the true dynamics rather than the quadratic approximation, to get the states. When $\hat{x}_t, \hat{u}_t$’s are very close to $x_t, u_t$ newly obtained the current LQR forward iteration, we say the algorithm has converged.

This algorithm is very similar to Newton’s method, and in fact, the only difference is that Newton’s method will approximate dynamics using second order Taylor expension.

Since we are using approximations, too big a step in the update may lead to worse result due too the approximations being inaccurate. To rememdy this, when runnig the forward pass to get $u_t$, we introduce a parameter $\alpha$, and change the update rule to be

\[\begin{equation} u_t = K_t(x_t - \hat{x}_t) + \alpha k_t + \hat{u}_t \end{equation}\]

$\alpha$ controls the step size in the update (how much $u_t$ will deviate from $\hat{u}_t$). And we can perform a search over $\alpha$, until we see improvements on the cost.

5 Demo: Autonomous Helicopter (Stanford) and Complex Behaviour Sythesis (UWashington)

Deep RL 8 Advanced Policy Gradient

2021-04-25T00:00:00-04:00

At the end of previous lecture, we talked about the issues with Q-learning, one of them is that it’s not directly optimizing the expected return and it can take a long time before the return starts to improve. On the other hand, policy gradient methods are direclty optimizing the expected return, although we cannot guarantee that the return will improve every gardient update. At the same time, we know that classic policy iteration can improve the expected return at each iteration, but this method cannot be applied to large scale problems.

In this section, we derive stable policy gradient methods, by firstly framing them as policy iteration.

1 Policy Gradient as Policy Iteration

Let’s write down the difference between expected return under previous policy $q$ and under new (updated) policy $\pi_{\theta'}$:

\[\begin{align} &J(\theta') - J(\theta)\\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{s_0 \sim p(s_0)}\left[ V^{q}(s_0) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ V^{q}(s_0) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ V^{q}(s_0) + \sum_{t=1}^{\infty}\gamma^{t}V^{q}(s_{t}) - \sum_{t=1}^{\infty}\gamma^{t}V^{q}(s_{t}) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t\gamma^{t}(\gamma V(s_{t+1}) - V^{q}(s_{t})) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t(r(s_t, a_t) + \gamma V(s_{t+1}) - V^{q}(s_{t})) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \end{align}\]

Here we have proved an intersting equality:

\[\begin{equation}\label{diff} J(\theta') - J(\theta) = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \end{equation}\]

The difference of expected return equals to the expected value of the advantage of the previous policy $q$ under the trajectory distribution of the new policy $\pi_{\theta'}$.

Note that we haven’t done any policy gradient specific operation, so this equality is universal. We can use this to understand why policy iteration improve the expected return at every iteration i.e. $J(\theta') - J(\theta) \geq 0$: in policy iteration, the policy is deterministic and updated as $\pi'(s) = \text{argmax}_a A^{\pi}(s_t, a_t)$. Therefore when the $s_t, a_t$ are from the new policy $\pi'$, we always have $A^{\pi_{\theta}}(s_t, a_t) \geq 0$, and thus $J(\theta') - J(\theta) \geq 0$.

Now let’s consider how to have this monotonic improvement in expected return in policy gradient methods. Well, this cannot be guaranteed theoretically because we need to introduce some approximation in order to derive a policy gradient algorithm from equation $\ref{diff}$. Nevertheless, the resulting method — TRPO — is the first stable RL algorithm in that during training the return will improve gradually (whereas another popular methods at the time — DQN — is very unstable).

2 Trust Region Policy Optimization (TRPO) Setup

As a policy gradient method, TRPO aims at directly maximizing equation $\ref{diff}$, but this cannot be done because the trajectory distribution is under the new policy $\pi_{\theta'}$ while the sample trajectories that we have can onlu come from the previous policy $q$.

This might reminds you on importance sampling that we used for deriving off-policy policy gradient. Yes, we will rewrite equation $\ref{diff}$ using importance sampling:

\[\begin{align} &J(\theta') - J(\theta) \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim \pi_{\theta'}} \gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \label{diff_importance} \end{align}\]

However, even though we don’t need to sample from $p_{\theta'}(\tau)$ now, $p_{\theta'}(s_t)$ is still impossible. A natural question is, can we just use $p_{\theta}(s_t)$? I.e. approximating the equation above by

\[\begin{align} &\approx \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \\ &= \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[ \sum_t \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \label{final} \end{align}\]

Eqaution $\ref{final}$ will lead to almost the same gradient as the off-policy policy gradient, but with reward $r(s_t, a_t)$ begin replaced by advantage $A^{\pi_{\theta}}(s_t, a_t)$. And you might remember that we also used $p_{\theta}(s_t)$ to approaximate $p_{\theta'}(s_t)$ and briefly mentioned that this approximation error “is bounded when the gap between $q$ and $\pi_{\theta'}$ are not too big”.

Now let’s try to quantitative give the gap between $q$ and $\pi_{\theta'}$. The first quantitative gap actually has been introduced in lecture 2 when we introduce the error bound on DAgger for imitation learning — we define $\pi_{\theta'}$ is close to $\pi_{\theta}$ if

\[\begin{equation}\label{cond1}\left| \pi_{\theta'}(a_t\mid s_t) - \pi(a_t\mid s_t)\right|< \epsilon, \forall s_t\end{equation}\]

This will give

\[\begin{align*} &\left| p_{\theta'}(s_t) - p_{\theta}(s_t) \right|\\ &= \left| (1-\epsilon)^tp_{\theta}(s_t) + (1-(1-\epsilon)^t)p_{\text{mistake}}(s_t) - p_{\theta}(s_t) \right|\\ &= (1-(1-\epsilon)^t)\left| p_{\text{mistake}}(s_t) - p_{\theta}(s_t) \right|\\ &\leq 2(1-(1-\epsilon)^t)\\ &\leq 2\epsilon t \end{align*}\]

This is very similar to the derivation we have for DAgger, and if there is anything that is unclear to you, please see lecture 2 section 3.2.

Now let’s reveal what $\lvert p_{\theta'}(s_t) - p_{\theta}(s_t) \rvert \leq 2\epsilon t$ can bring us:

Since

\[\begin{align*} &\mathbb{E}_{p_{\theta'}(s_t)}\left[ f(s_t) \right]\\ &= \sum_{s_t}p_{\theta'}(s_t)f(s_t) \\ &\geq \sum_{s_t}p_{\theta}(s_t)f(s_t) - \left|p_{\theta'}(s_t) - p_{\theta}(s_t)\right|\max_{s_t}f(s_t)\\ &\geq \sum_{s_t}p_{\theta}(s_t)f(s_t) - 2\epsilon t \max_{s_t}f(s_t) \end{align*}\]

Therefore, we have

\[\begin{align} &\sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \\ &\geq \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] - \sum_t 2\epsilon t C \\ &\geq \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] - \frac{4\epsilon\gamma}{(1-\gamma)}D_{\text{KL}}^{\text{max}}(\theta,\theta') \\ \end{align}\]

Where $C \propto O(Tr_{\text{max}})$ in finite horizon case or $C \propto O(\frac{r_{\text{max}}}{1-\gamma})$ in infinite horizon case. This tells us two things: first, the approximate objective

\[\sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\]

is a lower bound of the original objective

\[\sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\]

and this is good as maximizing this approximate objective is maximizing an lower bound on the thing that we initially want maximize. Second, the error bound of the approximation is $\sum_t 2\epsilon t C$, while this error might seem big because C is linearly time and maximal reward, but we can keep it very small by keeping the gap between new and old policy to be very small.

But how do we impose this constraint (equation $\ref{cond1}$) in practice?

Well, it’s not a very convenient constraint to use in practice, luckily, we have

\[\begin{equation}\label{cond2} \left| \pi_{\theta'}(a_t\mid s_t) - q(a_t\mid s_t)\right| < \sqrt{\frac12 D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'})}, \forall s_t \end{equation}\]

and the KL divergence has nice properties that make it much easier to approximate!

Now, we have the Trust Region Policy Optimization set up:

\[\begin{align} &\theta' \leftarrow \text{argmax}_{\theta'}\, \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\\ & \text{subject to } D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) < \epsilon \end{align}\]

For small enough $\epsilon$, this is gauranteed to improve $J(\theta') - J(\theta)$.

How do we solve this constrained optimization problem?

3 Solving TRPO

In this section we introduce two ways for solving the TRPO — dual gradient ascent and natural policy gradient.

3.1 Dual Gradient Ascent

Dual gradient ascent introduces augmented the objective with the Lagrangian multiplier to incorporperate the constraint:

\[\begin{align} \mathcal{L}(\theta', \lambda) &= \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \\ &- \lambda (D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) - \epsilon) \end{align}\]

This can be maximized by running the following two steps iteratively:

Where the first step can be imcomplete, i.e. we just need to to run a few gradient updates and go to step 2.

3.2 Natural Policy Gradient

Natural policy gradient was introduced much earlier than TRPO, but it turns out to be a special case of TRPO.

To ease the notation, let’s denote the objective as

\[\bar{A}(\theta') := \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\]

The idea of natural policy gradient is to use linear approximation to the objective $\bar{A}(\theta')$ and quadratic approximation to the constraint. This will lead to a very simple optimization problem that can be solved analytically by hand.

Use first order Taylor expension on $\bar{A}(\theta')$, we have

\[\begin{align*} &\bar{A}(\theta') \\ &\approx \bar{A}(\theta) + \nabla_{\theta'}\bar{A}(\theta)^T(\theta' - \theta)\\ &\propto \nabla_{\theta'}\bar{A}(\theta)^T(\theta' - \theta) \end{align*}\]

Where we drop the constant in terms of $\theta'$

As a side note, we have

\[\begin{align} &\nabla_{\theta'}\bar{A}(\theta) \\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta}(a_t\mid s_t)}\gamma^t \nabla_{\theta}\log \pi_{\theta'}(a_t\mid s_t) A^{\pi_{\theta}}(s_t, a_t)\right] \\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \gamma^t \nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t) A^{\pi_{\theta}}(s_t, a_t)\right] \\ \end{align}\]

Which is actually the actor-critic policy gradient.

Then we expend the constraint to the second order

\[\begin{align*} D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) \approx \frac12 (\theta' - \theta)^T\nabla^2 D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'})(\theta' - \theta) \end{align*}\]

Where the constant and first order term can be shown to be both zeros. We can approximate the constraint using sample:

\[\{(s_t, a_t, r_t)\}_{t=0}^{T}\] \[\begin{equation}\label{second} \frac12 (\theta' - \theta)^T \left[\frac1T \sum_{t=1}^{T} \frac{\partial^2}{\partial \theta_i \partial \theta_j} D_\text{KL}(\pi_{\theta}(\cdot\mid s_t) \lVert \pi_{\theta'}(\cdot\mid s_t))\right](\theta' - \theta) < \delta \end{equation}\]

Where the KL term can usually be calculated analytically.

Also, since

\[\begin{equation*} \nabla^2 D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) = \mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \nabla_{\theta}\pi_{\theta}(a_t\mid s_t) \nabla_{\theta}\log\pi_{\theta}(a_t\mid s_t)^T \right] \end{equation*}\]

where the right hand side is the Fisher information matrix of $\pi_{\theta}(a_t\mid s_t)$.

With this, we can also approximate the constraint by

\[\begin{equation}\label{fisher} \frac12(\theta' - \theta)^T \left[\frac1T \sum_{t=1}^{T} \frac{\partial}{\partial \theta_i}\log \pi_{\theta}(a_t\mid s_t) \frac{\partial}{\partial \theta_j}\log \pi_{\theta}(a_t\mid s_t)^T\right](\theta' - \theta)< \epsilon \end{equation}\]

Which approximation should we use? Equation $\ref{second}$ use the fact that KL divergence of policy can usually be calculated analytically and therefore the MC estimator is more stable, but it requires taking second order derivative, which is not very compatible with automatic differentiation packages. Equation $\ref{fisher}$ doesn’t require taking second order derivative, but it requires we store all the policy gradients along trajectories, also since we need to use single sample estimate to approximate the value of $\log \pi_{\theta}(a_t\mid s_t)$, this approximate has larger variance.

Nevertheless, Since the course uses the fisher information matrix, we will follow it and express the contraint as

\[\begin{align*} \frac12 (\theta' - \theta)^T\mathbf{F}(\theta' - \theta)< \epsilon \end{align*}\]

With the objective:

\[\max_{\theta'} \nabla_{\theta}\bar{A}(\theta)^T(\theta' - \theta)\]

We can easily solve the constraint optimization by hand and arrive:

\[\theta' = \theta + \alpha \mathbf{F}^{-1}\nabla_{\theta}\bar{A}(\theta)\]

Where

\[\alpha = \sqrt{\frac{2\epsilon}{\nabla_{\theta}\bar{A}(\theta)^T\mathbf{F}\nabla_{\theta}\bar{A}(\theta)}}\]

4 Proximal Policy Optimization (PPO)

PPO is proposed to deal with the issues of TRPO while maintain it’s advantages. The component that makes TRPO stable is the trust region (i.e. the constraint), but the constraint optimization problem it leads to is difficult to solve.

Essentially PPO differs from TRPO by the way it formulize the trust region in optimization. Let

\[r_t(\theta') = \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\]

To makes sure the new and old policy are close, in TRPO, we formulize it as a constraint on the KL divergence; in PPO, we directly incorporate it in the object:

\[\begin{equation}\label{ppo_obj} \mathcal{L}^{\text{CLIP}} = \sum_t \mathbb{E}_{s_t,a_t \sim p_{\theta}(s_t, a_t)}\left[ \gamma^t \text{min}\left(r_t(\theta')A^{\theta}(s_t, a_t), \text{CLIP}(r_t(\theta'), 1-\epsilon, 1+\epsilon)A^{\theta}(s_t, a_t) \right) \right] \end{equation}\]

The first term in the min is the original TRPO objective (without incorporating the constraint). The clipping removes the incentive for moving $r_t(\theta')$ outside of the interval $[1 − \epsilon, 1 + \epsilon]$. (the paper shows empirically that setting $\epsilon=0.2$ gives best results). Since we take the “minimum of the clipped and unclipped objective, the final objective is a lower bound on the unclipped objective.” “With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.” (quoted sentences are directly from the PPO paper by Schulman et al. 17’).

Demo: OpenAI PPO

Link to the article by OpenAI

Deep RL 7 Q-learning

2021-04-22T00:00:00-04:00

In this section we extend the online Q-iteration algorithm in the previous lecture by identifying the potential issues and introducing solutions. The improved algorithm can be very general and contains famous special cases such as DQN.

A little bit terminology: Q-learning and Q-iteration mean the same thing, the crucial part is if there is a “fitted” in front of them, when there is, that means the Q-function is approximated using some parametric function (e.g. a neural network).

to see the issues of online Q-iteration, let’s write out the algorithm:

run some policy for one step and collect $(s, a, r, s')$
gradient update: $\phi \leftarrow \phi - \alpha \frac{\partial Q_{\phi}(s,a)}{\partial \phi}\left(Q_{\phi}(s,a) - \left(r + \max_{a'}Q_{\phi}(s', a')\right)\right)$. Go to step 1.

The first issue with this algorithm is that, the transitions that are close to each other are highly correlated, this will lead to the Q-function to locally overfit to windows of transitions and fail to see broader context in order to accurately fit the whole function.

The second issue is that the target of the Q-function is changing very gradient step while the gradient doesn’t account for that change. To explain, for same transition $(s, a, r, s')$, when the current Q-function is $Q_{\phi_1}$, the target is $r + \max_{a'}Q_{\phi_1}(s', a')$, however, after one step of gradient update and the Q-function is $Q_{\phi_2}$, the target also change to $r + \max_{a'}Q_{\phi_2}(s', a')$. This is just like the Q-function is chasing it’s tail.

1 Replay Buffers

We solve issue one in this section. Note that as pointed out in the previous lecture, different from policy gradient methods which view data as trajectories, value function methods (including Q-iteration) view data as transitions which are snippets of trajectories. This means that the completeness of data as whole trajectories doesn’t matter in terms of learning a good Q-function.

Follow this idea, we introduce replay buffers, a concept that has been introduced to RL in the nineties.

A replay buffer $\mathcal{B}$ stores many transition tuples $(s,a,s',r)$ which are collected ever time we run a policy during training (so the transitions doesn’t have to come from the same/latest policy). In Q-iteration, if the transitions are random samples from $\mathcal{B}$ then we don’t have to worry about them being correlated. This gives algorithm:

Note that the data in the replay buffer are still coming from the policies induced from the Q-iteration policy (original policy, epsilon greedy, or Boltzmann exploration etc.). It very common to just set $K=1$, which makes the algorithm even more similar to the original online Q-iteration algorithm.

We can represent the algorithm using the following graph to make it more intuitive

Since we are constantly adding new and possibly more relevent transitions to the buffer, we evict old transitions to keep the total amount of transtions in the buffer fixed.

In the rest of this lecture, we will always use replay buffers in any algorithms that we introduce.

2 Target networks

We deal with the second issue in this section. Instead of calculating the target always using the latest Q-function (which results in the Q-function chasing it’s own tail), we use a target network (also output Q-value) which is not too far from the latest Q-function, but fixed for a considerable amount of gradient steps.

Let’s see the Q-learning algorithm with both replay buffer and target network:

Note that the loop contains step 2, 3 and 4 is just plain regression, as the target network $Q_{\phi'}$ is fixed within the loop. In practice, we usually set $K$ to be between $1$ and $4$, while set $N$ to be something like $10000$.

As a specially case of the above algorithm, setting $K = 1$ give us the famous classic DQN algorithm (Minh et al. 13’). We can switch step 1 and 2, and the resulting algorithm also works.

You might feel a little uncomfortable with this algorithm because after we just assign the target network parameters $\phi'$ to be the current Q-function parameters $\phi$, during the first few gradient steps, the lag between $Q_{\phi'}$ and $Q_{\phi}$ will be small, and as we update the Q-function $Q_{\phi}$ in step 4, the lag become larger. We might not want the lag to be constantly changing during gradient update. To remedy this, we can use exponentially decaying moving average to update target network $\phi'$ after every gradient update of $\phi$ (or make N much smaller than $10000$)

\[\phi' \leftarrow \tau \phi' + (1-\tau)\phi\]

Where $\tau$ can be some value that is very close to $1$, such as $0.999$

For simplicity, we will sometimes just use “update $\phi'$” or “target update” in the remaining lecture, rather than specifying exactly how $\phi'$ is updated.

3 Overestimation in Q-learning

This section is based on van Hasselt 10’ and van Hasselt et al. 15’.

Recall that using definition, we can derive the relation between value function and Q-function in Q-learning:

\[\begin{equation} \label{value} V(s) = \max_{a}Q(s,a) \end{equation}\]

Since we don’t know the true Q-function, we need to estimate it using Monte Carlo samples.

Let’s use an simple example to show how we end up using the wrong estimator and overestimate $\max_{a}Q(s,a)$.

Suppose there are three different actions that we can take $a_1, a_2, a_3$, this means we need to estimate $Q(s, a_1)$, $Q(s,a_2)$, and $Q(s, a_3)$ using their Monte Carlo samples and then take the max. For each value, we use

\[\begin{equation}\label{maxexp} \max\{ \mathbb{E}Q(s,a_1), \mathbb{E}Q(s,a_2), \mathbb{E}Q(s,a_3) \} \end{equation}\]

and we will use one sample estimate to estimate equation $\ref{maxexp}$

\[\begin{equation}\label{esti} \max \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \}\end{equation}\]

However, this is not unbiased estimator of equation $\ref{maxexp}$, but an unbiased estimate of

\[\begin{equation} \label{expmax} \mathbb{E} \{\max\{ Q(s,a_1), Q(s,a_2), Q(s,a_3) \}\} \end{equation}\]

Since we have

\[\mathbb{E} \{\max\{ Q(s,a_1), Q(s,a_2), Q(s,a_3) \}\} \geq \max\{ \mathbb{E}Q(s,a_1), \mathbb{E}Q(s,a_2), \mathbb{E}Q(s,a_3) \}\]

Our estimator equation $\ref{esti}$ will over estimate the target in equation $\ref{value}$.

To make it even more concrete, consider the case where for all three actions, the true Q-values are all zero, but our estimated Q-values are

\[Q_{\phi}(s, a_1) = -0.1, Q_{\phi}(s, a_2) = 0, Q_{\phi}(s, a_3) = 0.1\]

Then $\max \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \} = 0.1$.

To see why $\ref{esti}$ overestimates from another angle, the function approximation $Q_{\phi}$ we are using is a biased estimate of $Q$, and in equation $\ref{esti}$, we use this $Q_{\phi}$ to both estimate the Q-values and select the best Q-value, i.e.

\[\begin{equation} \max \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \} = Q_{\phi}(s,\text{argmax}_{a_i}\, \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \}) \end{equation}\]

Thus the noise in $Q_{\phi}$ will get accumulated and lead to overestimation.

This leads to one solution to the problem — Double Q-learning, which uses two different Q-functions for estimation and selection separately:

\[\begin{align} &a^* = \text{argmax}_{a}Q_{\phi_{select}}(s,a) \\ &\max_{a}Q(s,a) \approx Q_{\phi_{eval}}(s, a^*) \end{align}\]

And if $Q_{\phi_{select}}$ and $Q_{\phi_{evak}}$ are noisy in different ways, the overestimation problem will go away!

So, we need to learn two neural networks? Well, that’s one possible way, but we can actually just use the current network as $Q_{\phi_{select}}$ and the target network as $Q_{\phi_{eval}}$. I.e.

\[\begin{align} &a^* = \text{argmax}_{a}Q_{\phi}(s,a) \\ &\max_{a}Q(s,a) \approx Q_{\phi'}(s, a^*) \end{align}\]

These two networks are actually correlated, but they are sufficiently far away from each other (note that we assign the current network to the target network every 10000 or more gradient steps) that in practice this method works really well.

4 Q-learning with N-step Returns

This section is based on Munos et al. 16’.

In actor-critic lecture, we talked about the bias-variance tradeoff between estimating the expected sum of rewards using $\sum_t \gamma^{t}r_t$ and $r_t + \gamma V_{\phi}(s_{t+1})$. The former is a unbiased one sample estimate of the sum of return, which has high bias; the later is one step reward plus future rewards estimated by a fitted value value function, which can be biased but has less variance. Based on this, we can tradeoff bias and variance by using

\[\sum_{t'=t}^{t+N-1}\gamma^{t'-t}r_{t'} + \gamma^{N} V_{\phi}(s_{t+N})\]

Where bigger $N$ leads to smaller bias and higher variance.

Similarly, for Q-learning, we can estimate the target Q-value by

\[\begin{equation} \label{trade}y_t = \sum_{t'=t}^{t+N-1}\gamma^{t'-t}r_{t'} + \gamma^{N} \max_{a_{t+N}}Q_{\phi}(s_{t+N},a_{t+N})\end{equation}\]

This seems ok at the first glance, but recall that $y_t$ is estimating the Q-value under the current policy (our objective is to minimize $\sum_t\left\|Q_{\phi}(s_t) - y_t\right\|^2$), we need to make sure that the transitions $(s_{t'}, a_{t'},s_{t'+1})$ and rewards $r_{t'}$ for $t < t' \leq t+N-1$ come from running the current policy.

There are several ways to deal with this:

Just ignore it and use whatever from the buffer. This actually often work well in practice.
Compare every action along the trajectory with the action our current policy will take and set N to be the biggest number before the trajectory action and policy action disagree. This way, we change $N$ adaptively to get only on-policy data. This works well when actual data are mostly on-policy, and action space is small.
Importance sampling. Please see the original paper for detail.

5 Q-learning with Contiuous Actions

So far we’ve been assuming that $\max_{a}Q_{\phi}(s,a)$ is tractable and fast operation, because it appears int the inner loop of Q-learning algorithms. This is true for discrete action space, where we can just parametrized the $Q_{\phi}$ to take input $s$ and output a vector of dimension $\left\|\mathcal{A}\right\|$, where each entry of the vector is the Q-value for a specific action.

What if the action space is continuous?

We will briefly introduce three techniques that make Q-learning algorithms work in continuous actions space by making the operation $\max_{a}Q_{\phi}(s,a)$ fast.

5.1 Randomized Search

The simplest solution is just randomly sample a bunch of actions and choose the one that gives the best estimated Q-value as the action we will take and the corresponding value as the value of the state, i.e.

\[max_{a}Q_{\phi}(s,a) \approx \max\{Q_{\phi}(s,a_1),Q_{\phi}(s,a_1),\cdots, Q_{\phi}(s,a_N)\}\]

where $a_i \sim \mathcal{A}$, $\forall i=1:N$.

The advantages of this method is that it’s extremely simple and can be parallized easily, and the disadvantage is that it’s not very accurate, especially when the action space dimension is high.

There are other more complicated randomized search method such as cross-entropy method (we will introduce in detail in later lectures) and CMA-ES. However, these methods do not really work when the dimension of the action space is higher than $40$.

5.2 Using Easily Maximazable Q-function Parameterization

We can easily find the maximal value of $Q_{\phi}(s,a)$ is it is quadratic in $a$. This leads to the Normalized Advantage Functions or NAFs (Gu et al. 16’), which parameterizes Q-function as

\[Q_{\phi}(s,a) = -\frac12 (a - \mu_{\phi}(s))^TP_{\phi}(s)(a - \mu_{\phi}(s)) + V_{\phi}(s)\]

And the architecture is

Where the network takes in state $s$ and output vector $\mu_{\phi}(s)$, positive-definite square matrix $P_{\phi}(s)$ and scaler value $V_{\phi}(s)$.

Using this parameterization, we have

\[\begin{align*} &\text{argmax}_a\,Q_{\phi}(s,a) = \mu_{\phi}(s)\\ &\max_a Q_{\phi}(s,a) = V_{\phi}(s) \end{align*}\]

The disadvantage of this method is that the representation power is sacrificed because of the limited quadratic form.

5.3 Learn an Approximate Maximizer

Recall that in double Q-learning

\[max_{a}Q_{\phi'}(s,a) = Q_{\phi'}(s, \text{argmax}_a Q_{\phi}(s,a))\]

the max operation can be fast if we can learn an approximate maximizer that output $\text{argmax}_a Q_{\phi}(s,a)$. And this is the idea of Deep Deterministic Policy Gradient or DDPG (Lillicrap et al. 15’).

We parameterize the maximizer as a neural network $\mu_{\theta}(s)$, that is to say we want to find $\theta$ s.t.

\[\mu_{\theta}(s) = \text{argmax}_aQ_{\phi}(s,a)\]

and therefore

\[max_{a}Q_{\phi'}(s,a) = Q_{\phi'}(s, \mu_{\theta}(s))\]

This can be solved by stochastic gradient ascent with gradient update

\[\theta \leftarrow \theta + \beta \frac{\partial Q_{\phi}(s,a)}{\partial \mu_{\theta}(s)}\frac{\partial \mu_{\theta}(s)}{\partial \theta}\]

To aviod the maximizer to chase its own tail similar to what happend to the Q-function in vanilla Q-learning, we use a target maximizer $\theta'$ when assign

\[y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s'))\]

And update $\theta'$ based on the current $\theta$ by schedule during training.

The algorithm of DDPG can be writen as

6 Tips for Praticioner

Here are some tips for applying Q-learnig methods

Q-learning takes some care to stablize. Runs with different seeds might have inconsistent. Large replay buffer helps improve stability.
It takes some time to start to work — might be no better than random for a while.
Start with high exploration and gradually reduce.
Bellman error gradients can be big; clip gradients or use Huber loss. (Bellman error is $\left\|Q_{\phi}(s,a) - (r + \gamma\max_{a'}Q_{\phi'}(s',a')\right\|^2$)
Double Q-learning helps a lot in practice, simple and no downsides.
N-step returns also help a lot, but have some downsides (see previous section on N-step returns)/
Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too.