Generative AI at Work

An LLM Exercise on Retrieval: Essential of Context Engineering

Nga Than — Fri, 18 Jul 2025 15:01:06 GMT

Context Engineering

The emerging discourse on context engineering has captured everyone’s attention in the AI world. This concept, and practice captures the way that engineers now structure information that they put in the context window of an LLM. Different from prompt engineering, which describes how the engineer structures each prompt to make the LLM to create a certain output. Now with the increasing capability of LLM such as using tools, thinking, etc. information being funneled into an LLM is a lot more structured, and requires a lot more system design.

Emerging discourse on Context Engineering by leading AI experts

Andrej Karpathy describes how context engineering is both an art and a science for creating industrial-strength LLM app [emphasis mine]. First and foremost, it’s a science because there is a structure around what information one should put into the context window such as examples, tools, chain of thought prompting, etc. This is all engineering practices that we have learned in the past three years (since 2022 when chatGPT was launched). We know that certain prompt structures work better now. And in the age of agentic AI, LLM can plan, take actions given the right tools, right functions, context engineering describes an entire practice where what kind of tools being used might determine whether an agentic system succeed or not. Second, it is an art because “of the guiding intuition around LLM psychology of people spirits.“ This second part might need further unpacking. My reading of this phrase “the guiding intuition around LLM psychology of people spirits“ means each LLM has its own quirks, and by mimicking human-generated data, it has internalized certain human behaviors. Thus, the art part of context engineering lies in the idea that LLMs are not deterministic, and have behaviors that are unpredictable, and need experience to get at. In other words, context engineering is “experiential,“ one that needs a lot of tinkering to get right. Understanding the fundamentals themselves won’t replace building, and tinkering with different LLMs in order to create an industrial-grade LLM app.

In many ways, whatever we’re doing now while creating an LLM app is one way or another context engineering.

In creating the AI exercise series, I am mindful of this emerging discourse, and would like to include as many context engineering exercises as I can. The most important exercise regarding context engineering in my opinion is retrieval, and retrieval augmented generation (RAG). RAG extends the long term memory of the LLM app, keeps the LLM up to date with the most relevant events. In AI engineering, Chip Huyen argues that before considering fine tuning an LLM, one should experiment with prompting, then adding RAG. If those two fail, fine-tuning might be the next option. RAG helps the model focus its response to a specific piece of information, thus arguably increase the specificity and accuracy of the final output.

RAG is particularly helpful in domains where LLMs have little information on such as closed source data such as medical records, law cases, etc. I created such an exercise for the legal domain which is ripe for innovation, and is a place where industrial grade LLM apps, and agentic systems might make the most impacts because the field relies so much on structured documents.

Following is the description of the problem, which can be found on my github: https://github.com/ngathan/ai-exercises/tree/main/01_llm_retrieval

Retrieval Exercise

Problem Statement: For domain specific problems such as answering complicated legal questions, using pre-trained LLMs as they are might not suffice. In such scenarios, retrieving the right documents, then add them to the context, and ask the same LLM to see how the accuracy might change. Once it is determined that using the right document, the system might do better than using the pretrained LLM alone, we then set up a retrieval augmented generation system, and use an LLM to answer the original question.

The dataset to be used in this exercise is housing_qa from Stanford Regulation, Evaluation and Governance Lab. You can read more about the dataset on Hugging Face, and the original paper A Reasoning-Focused Legal Retrieval Benchmark by Zheng et al.

This exercise helps you set up a simple RAG system.

AI system design

Step 1: Evaluate when an LLM answer legal questions alone

Step 2: Evaluate when an LLM answer legal questions with the right document

Step 3: Evaluate when the RAG system where LLM answer legal questions with the documents retrieved using a retrieval system

At each step, there is an accuracy number associated with that particular step. Remember to calculate these accuracies along the way, and see if at the end they follow your expectations.

Hints:

Hint 1: When building the pipeline experimenting with the first 20, 50, 100, 200 data points in the dataset to see whether the pipeline works as expected Hint 2 (Advanced level): Beyond TFIDF and ngram in the problem statement, you can experiment with more advanced embedding methodology such as embedding models from OpenAI, Anthropic.
Hint 3 (Advanced level): Implement an index using FAISS to store all the statutes to speed up the cosine similarities calculation for the entire dataset

Conclusion

RAG is bread and butter of context engineering. Thus understanding how retrieval works is essential for AI engineers’ daily practice. This exercise is to help you visualize and concretize how RAG works in the context of legal documents. The proposed solution is one of many solutions. If you come up with better solutions (faster, higher accuracy), please send pull and merge requests.

AI exercises: First stop - LLM as Judge

Nga Than — Fri, 11 Jul 2025 17:52:38 GMT

AI-generated cat judge

Coding into Knowing

I have recently finished reading the book AI engineering by Chip Huyen. One of the readers of the book on Amazon made the analogy that this book is akin to literature review papers. In other words, it’s a book-length review of AI engineering literature. It summarizes at a very high level what generally works in practice. Unlike Chip’s previous book, Design ML Systems, which spells out issues that MLEs need to deal with practical considerations when putting an ML system in place, AI Engineering to me was a lot more theoretical, and not as practical. One of the reasons is my positionality. I engage in putting ML models in deployment on a daily basis, and not necessarily running AI applications every day. When I read AI Engineering, my brain ran wild. Many questions arose. Why eval datasets should be thought of in advance before even designing an AI system? How does an LLM judge work in practice? We know that retrieval augmented generation works, but can it be shown more concretely that with RAG, a system accuracy increases significantly? If so, by how many percentage points? What are the right number of iterations, and experimentation for such a system?

When I was trained as a social scientist, one of the tools that I used to get myself familiarized with a theoretical concept is writing into knowing. In other words, I would not even know whether I understand a theoretical concept or a hypothesis until write something down to delineate the concept, to paint its different contours, and understand its different dimensions.

As a machine learning engineer, my equivalent tool is coding. So over the weekend, I literally coded myself into knowing some of the most important concepts in the book such as LLM as judge, RAG, temperature and top_p sampling, agentic systems.

I thought to myself why not turning what Chip Huyen wrote in the book into bite-size exercises that would be beneficial for AI engineers who want to turn concepts into reusable codes. These exercises serve a few purposes: (1) AI engineers can use them in their production system, (2) interview candidates can practice coding before their job interviews, (3) Hiring managers can use them in the coding rounds of the AI engineering interviews, and (4) of course undergraduate and graduate students can translate concepts that they have learned in class/ read in research papers into something concrete.

There are a few reasons why I thought these exercises would help those four groups. First, as an engineer myself, I find translating theory into practice requires a lot of engineering imagination, which requires time, and thinking that most MLEs cannot afford. A friend of mine said that LLMs has leveled the playing field. Most people have at the most 2 to 3 years of work experience in pre-training, post-training, and building AI applications. So upskilling for MLEs is an absolute must. Yet, I can attest to the difficulty of this upskilling journey as a product MLE working day in day out trying to improve business. Not all ML problems are AI problems, and carving out time to learn new AI methods and applications is challenging. These exercises first and foremost are for my working full time engineer colleagues.

Second, for those who are trying to get a job in AI, and not already developing AI applications, it’s a tough market out there. The main reason is that if you have not developed an AI applications, you might not even be aware of many important engineering decisions. To be able to speak intelligently in an AI engineering interview without real-world experience is like setting yourself up for disappointment. These exercises are for the aspiring AI engineers candidate to go over the fundamentals, as well as the standard practices in industry to prepare for their upcoming interviews.

Third, for hiring managers who want to test candidates real AI knowledge instead of the boring leetcode questions, these exercises are for them to use (freely). The best AI engineering experience I had was at a AI startup, where they asked me to do some data engineering by interacting with their own LLM API. The exercise was quite long, almost 3 hours, but I had so much fun. I was writing prompts to understand the quirks of the LLM, trying to extract passages that I thought were relevant. By the end, I felt like I learned something from the interview. It was not like the typical leetcode interview where the answers were relatively scripted, and I had to practice those a lot. Not coming from a computer science background, I always have anxiety around certain leetcode questions, and praying every time to not have to deal with them during interviews. LLM prompting was different. It felt like that’s what I was doing on a daily basis, and what I would do on the job anyways. Who cares if as an AI engineer I couldn’t reverse a linked list. AI can solve that question much better than I do, but evaluating an AI system sounds much more relevant to the job.

And finally, students and professors in academia are always in my mind. These exercises are for you to simulate what could be done in industry.

In this post, I describe the first exercise in the AI exercise series. The first exercise is LLM as judge. For each exercise, there will be two python files: the problem file, and the answer file that has my example answers. I have posted in this AI exercises repo the first set of files. Over the weekend, I will also convert those files into jupyter notebooks for those who are more familiar with using jupyter notebooks to understand the workflow step by step.

LLM judge exercise

The passage I tried to tease out the meaning in AI Engineering is this one (emphasis is mine):

The challenges of evaluating open-ended responses have led many teams to fall back on human evaluation. As AI has successfully been used to automate many challenging tasks, can AI automate evaluation as well? The approach of using AI to evaluate AI is called AI as a judge or LLM as a judge. An AI model that is used to evaluate other AI models is called an AI judge.¹⁵

Alright the idea that you can prompt one LLM to evaluate its own result or the result of another LLM is simple enough. But how does it really work in practice? How can one quantify the evaluation and the improvement?

I set up a simple exercise, using the GSM8k dataset, which is a grade school math problem that has golden answer to each problem. You can take a look at the description of this dataset on Hugging Face. One might argue that this dataset is probably already used in the pre-training phase of all the frontier models (such as the recent ones from OpenAI, Anthropic). However, the point of this exercise is not to evaluate how much of the benchmark dataset was already remembered by an LLM. The point of this exercise is to create an evaluation pipeline to demonstrate that LLM can evaluate its own answer, and show concretely what the main steps are.

Following is a diagram of the LLM judge evaluator system.

LLM as judge pipeline by Nga Than

There are essentially four steps in this pipeline. First using prompt engineering, we ask an LLM to generate the result for each math problem. Then you evaluate the accuracy of this first step. Second, send all the questions to an LLM judge to evaluate each answer. You also have to evaluate the LLM judge answers against the previous answers to gauge whether the judge differs or agrees with what the previous generator provided. Third, the self reflection step adds one extra step to generate new answers to any problem that the judge has identified to be wrongly answered previously. Forth, we calculate the accuracy at the final step to evaluate whether the accuracy of the system improves after the self-reflection step. Seeing the accuracy improvement at the end is the entire reward at the end of this exercise.

In the problem file, llm_judge_problem.py, you’ll find in the repo, I have set up those three parts (llm_generator, llm_judge, and self_reflection) as three different functions that you’ll need to implement on your own.

In the llm_judge_answers.py file, you’ll find my proposed answers to the different parts.

One prompting trick here is that this system should be thought of as a continuous chat, where the LLM generator creates the first assistant’s response. Then the assistant’s response is added to the chat, and then the user can ask an LLM judge question, and then the answer keeps being appended until the end of the pipeline. To demonstrate what I described, here’s the piece of code at the end of the pipeline at the self reflection step:

chat = [
    {"role": "system", "content": math_answer_prompt},
    {"role": "user", "content": question},
    {"role": "assistant", "content": model_reasoning},
    {"role": "user", "content": "Is the provided solution correct or not? Check the reasoning and if there is any doubt then call it incorrect. "
                                "Please put your answer inside  tag and only answer 'correct' or 'wrong'"},
    {"role": "assistant", "content": judge_reasoning},
    {"role": "user", "content": "Okay since you think the answer is wrong can you generate a better response?"}
]

This is the messages that will be used as a parameter in an OpenAI chat completion API call at the self reflection step. Essentially, I kept appending the content to the chat history. It took me a while to figure out that LLM as judge pipeline essentially functions as a continuous chat.

Using only 150 examples, after the first step, my LLM generator provided answers with accuracy of 91.33%, LLM judge agrees 92% with the previous generator, and after the self reflection step, the accuracy increases to 93.33%. So with this simple set up, this system accuracy increases by 2 percentage point compared to only prompt engineering.

After the first round of initial success, I increased the number of examples from 150 to the 1,319, which is the full test dataset. The three numbers are 90.30% for the LLM generator, 89.92% for LLM judge, and final accuracy is 90.67%. The improvement is relatively small, only 0.37 percentage point increase. One could reasonably ask should I waste many tokens on this small increase. I would argue it is worth it if the problem requires very high recall such as in the case of medical record extraction, legal document information retrieval. And the token cost might be minimal. In my case I use gpt4.1 nano which didn’t even cost me $1 for this entire exercise.

Conclusion

This exercise demonstrates that one can relatively easily put an LLM as judge evaluator into the AI system that they are building if they have a golden eval dataset. If you have seen some LLM-as-judge system in deployment, open-source code, please send it my way. I would love to see how it is deployed at an industrial scale. What are the considerations in industry you need to take into account when building LLM as judge? Are you using the same LLM in the previous generation and self-reflection steps? Or are you using different LLMs for different steps? How do you reduce latency when the numbers of API calls are in the millions or billions?

Vibe Coding

Nga Than — Sun, 25 May 2025 04:43:23 GMT

Photo by Tai Bui on Unsplash

Vibe Coding has become a buzz word since Feb 2025 when Andrej Karpathy introduced the concept on X. In March, TechCrunch reported that YC's winter 2025 cohort startups have codebases that were generated mostly by AI. It has gained traction in many circles. It excites both critics and proponents alike.

I found a primer for Vibe Coding from IBM. The basic idea is that with the advancement of AI, coder assistants, and that common IDEs have incorporated coding assistants such as Github’s Copilot, one does not need to know how to code to be able to produce functional code that might be ready for production.

Andrej Karpathy explained in his original X post:

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.

I took his words and made the following illustration:

Vibe Coding process, Illustrated by Nga Than

Essentially, a user interacts with an AI powered by LLM to generate code (java, python, SQL, etc). They then accept or reject that piece of code by reading it, or running it through some test programs. Whatever they are not happy with they can use prompt engineering or a voice assistant to interact with the LLM code generator. Everything happens via natural language. No code is generated by the user at all. Code is generated, fixed, and debugged by the LLM. Andrej Karpathy himself said that vibe coding is good for “throwaway weekend projects.” He didn’t make a prediction that people actually would use this in the production environment.

In this blog post, I argue that machine learning scientists (ML scientists) such as machine learning engineers, data scientists, and researchers should embrace vibe coding in their workflow. However, relying on vibe coding word-for-word as what Andrej Karpathy suggested will eventually deskill you in regard to coding. Thus, I recommend that ML scientists use vibe coding in the exploratory steps, then read through the code, create unit tests, and rewrite certain parts for production ready software. Furthermore, when a repo is made public as a result of a research project that employs vibe coding as part of the scientific process, make it known to the community.

Trust LLMs to produce reliable code

If you know what you’re looking for, then vibe coding is a good exercise. I remember working as a data scientist and doing a lot of (ad hoc) data analysis and first level prototyping. I was literally a Jupyter Notebook engineer, writing one note book at a time. It was a lot of time rewriting certain pandas commands, looking up how to conduct certain statistical tests. For every cell of the notebook that I was writing, I had to do a few Google searches to verify its correctness as well as the statistical tests I wanted to use actually do it was supposed to do.

This way of working as a data scientist reminded me of how I would write as a sociologist. As a social scientist under the pressure of the publish or perish culture, I would open a blank word document every day, stared at it, and started thinking about what to write at the beginning of each writing session. Then I would meander into what I actually thought about the topic, my position, what I would argue for, and also conduct some web research on the side. As one of my favorite writing professors said, “I will know what I think about subject ABC by the end of the writing session.“ In other words, writing is thinking. Writing helps the writer articulate their thoughts in precise terms, clear up any uncertainties, and understand their position in a particular debate.

Similarly, as a data scientist I understand the subject matter, a statistical test, a business challenge by first staring at blank Jupyter Notebook, writing one test at a time, writing one line of code at a time, building one ML model at a time. This process helped me dissect a business problem, and figure out how to solve it using ML tools. The process is relatively messy, and one has to iterate many times.

In this scenario, I think vibe coding is really helpful. If I know what I was looking for, I should either write a prompt such as “use the given data file data.csv, perform feature engineering on all the features, and train an XGBoost model taking click through rate as the target variable and the remaining features as predictive features. Calculate the precision, recall at the end.“ I hope at the end, a Jupyter notebook or a model.py file would be generated, and I can look through the code to understand how the AI would generate code differently that what I would have done manually. If this is exactly what happened the prototyping time would reduce significantly. Instead of writing one line of code at a time, I essentially would become an evaluator, verifying the correctness of the code, and the analysis.

Another way of mini vibe coding is to generate one Jupyter Notebook cell at a time. This is possible if one is creating a Jupyter Notebook using VSCode. This is powered by Github’s Copilot. In my experience, Github copilot is more of an autocompletion than a total code generation where you would write a prompt, or a code and copilot would generate the next code block for you. The ML scientist still has to make sense of the entire structure from the beginning to the end of the code. Think of a software as an essay. Copilot tends to generate one paragraph at a time instead of generating the entire essay for you. This version of mini vibe coding is being used on a daily basis. Sometimes I use it to save time. However, because my attention is limited, I often prefer Copilot to generate fewer than 10 lines of code such that I could skim through and approve quickly.

What I have not tried is to simply prepare some data, prompt an LLM to conduct the analysis, build the model, and calculate precision and recall for me. I guess, it should be able to do such a procedure since ML model development workflow is relatively well defined.

Iterate over the LLM-generated Code for Production

When I became an ML engineer, I realized the difference between what an ML engineer does and what a data scientist does is in the productionization aspect of the ML workflow. ML engineers are essentially software engineers with some ML background. In contrast, data scientists are statisticians and machine learning scientists who happen to also code. In other words, the emphasis of the first category is in shipping products while the emphasis for the later job category is in the ML development aspect. ML engineers write more production-ready code. They also write more unit tests. In my experience the idea of “test-driven development“ is more of a daily practice for MLEs than for data scientists. Data scientists can explore more, while MLEs are more driven towards production.

With this distinction in mind, when it comes to vibe coding and production, one has to be very careful with what the LLM has written. The human in the loop here, i.e. ML scientists (MLEs or DS) has to control the quality of the code, knowing where the code would break in production and how to fix it. This comes with practice.

There is no such thing as vibe debugging, vibe troubleshooting, vibe code review, or vibe test-driven development. Vibing is fun and useful in the early exploratory phase of model development. But once the model is locked in, software development has to follow the standard rigorous process of test-driven development, meaning the software has to pass unit tests, and approved by someone (maybe an AI agent in the near future).

Using LLMs to write Unit test

At least unit tests are something that LLMs can help with. All MLEs are to a certain extent QA (Quality Assurance) engineers. Maybe here is where vibe coding can also help MLEs to quickly prototype QA tests for their models to make sure that they would perform well in practice, once it’s put in production.

On the Risk of Deskilling

Critics of vibe coding, and LLM-assisted coding in general point out that by relying on LLMs to generate one’s code, one loses intimacy with the code being written, and therefore losing the knowledge of what is being generated. This is a real risk. Currently, junior software engineers are having a hard time to find a job because companies assume that they might not be as good as LLMs which have been trained on millions of pieces of code generated by experienced coders on the internet. Why hiring someone who cannot code as well as an AI? Basically the door is closed on junior programers who would want to debug code in the real work environment. So as a whole industry, people are being deskilled one group at a time.

How about experienced MLEs and data scientists, are they also being deskilled if they rely too much on LLMs, and AI to generate their analyses and production code. Possible.

One of my friendss when being asked if he was using AI to generate SQL queries, he answered that AI annoyed him, and slowed him down. He had a vision of what to accomplish in mind, AI tended to interfere with his thought process. His approach to data engineering is relatively top-down, while AI tool tends to vibe as in writing with the flow. AI has some good ideas every once in a while, but its net effect is negative for my friend because it slowed him down since he couldn’t bother to articulate his entire complicated thinking in words. It’s faster to write code. He even told me once: code doesn’t lie. So just look through the code.

In my experience with data engineering, AI is not as helpful as I thought it would. The main reason is that as a subject matter expert (SME), I know precisely which feature matters for what model, while AI lacks this crucial context. By typing a very elaborate prompt to help guide AI, I actually slow myself down because it takes me even less time to just write the code myself. To be clear, and precise in prompting means to add more context. This sometimes means generating a huge chunk of texts which might be even longer than the SQL query itself. So why would I bother to describe what should be done? I could simply just write the query myself given the time constraint.

Conclusion

I think vibe coding has a role to play in the workflow of MLEs and data scientists. However, I only see that it’s good at the beginning in the early exploratory period of a piece of code or a project. Once the initial code has been generated in order to put it through production, code review, the human worker still has to scrutinize the accuracy of the code generated. This might lead to some amount of deskilling. Yet, I think our skills will be shifted to testing, quality control, and refining requirements. However, these all take practice, scrutinizing LLM-generated code takes time, and patience.

Careers are a jungle gym, not a ladder

Nga Than — Thu, 02 May 2024 19:11:10 GMT

Photo by Hu Chen on Unsplash

“As we climb that jungle gym of our work, we need to look for opportunities to re-package our skills and experiences in ways that bring us meaning and engagement, as well as solve problems for our potential employers.”

“Build your skills, not your resume.”

― Sheryl Sandberg, Lean In: Women, Work, and the Will to Lead

While grappling with intricacies and complexities at my first job at a finance company after graduate school, I stumbled upon Sheryl Sandberg’s Lean In, where she presented a compelling metaphor: envisioning one’s career as a jungle gym rather than a straightforward ladder. Gradually I realized that this metaphor often manifests quite literally in the real world. In the competitive tech ecosystem of Silicon Valley, progress demands that one literally “climbs the jungle gym,” be it through acquiring a rock-climbing gym membership or navigating the intricate corporate web. Rock climbing is a very popular sport in the Bay area. The essence of this metaphor lies in the contemporary career landscape, where loyalty to a single job or career path rarely spans beyond a few years, rendering the traditional climb up a linear corporate ladder obsolete. Oftentimes, one has to go up and down, move sideways to truly advance.

Sandberg argues that for women navigating the workforce at different points in their life, the jungle gym metaphor is incredibly liberating. It’s especially potent for women, who often juggle more diverse career trajectories. Whether they are starting anew, pivoting between fields, facing barriers, or stepping back into the workforce after caring for their newborn child or an elderly family member, women can find this metaphor speaks to their experience. This metaphor acknowledges that climbing to the top is a struggle, with each path uniquely challenging. Yet, it reassures us that moving up, down, or sideways is all part of the process—it’s expected and acceptable. It is rarely smooth sailing, but perseverance is what keeps us climbing. Unlike the career ladder, which suggests a constant, unyielding ascent, the jungle gym presents a more realistic, winding path. But no one step will be wasted in the process.

After I read the book, a thought dawned on me--maybe this is true for me too. The notion of a career as a jungle gym might explained how I ended up where I am now. My academic journey began in the practical world of economics, where I pursued my undergraduate major. Later, in graduate school, my curiosity shifted towards the intricate dance of social phenomena, leading me to explore sociology. Eventually, I realized academia wasn’t my calling, prompting me to venture into data science bootcamps and computational social science summer schools to acquire the technical prowess my graduate program lacked. With newly honed technical skills, I secured a tech position at a financial company post-graduation. Soon after, I transitioned to a role in machine learning at a tech firm. Transitioning from economics and sociology to machine learning engineering was indeed a winding journey. Unlike many, who streamline their career path into machine learning with a straightforward computer science degree, accumulating requisite technical and statistical skills, my route was more serpentine. Sometimes it feels like I am trying too hard to fit into a tech environment given my training in the humanities and social sciences.

Early in April 2024, I gave a Zoom talk on a technical topic to a group of machine learning engineers in Vietnam. The engineers peppered me during the talk with insightful and practical questions about scaling generative AI solutions to a larger. They wanted to know how to move beyond demos towards technical solutions that can benefit millions of users. Given the state of Generative AI adoption in corporations, especially in Vietnam, I found the questions to be exciting, and happy to engage with the group. Each question struck a chord, resonating with my daily struggle at work. Merely sifting through their questions was an education in itself. This stirred the familiar awe of a professor’s life, where the unknown questions bloom, and often, the audience hold keys you might not. Here lies the true test of humility. This longing pulled at me, stirring a desire to teach and do public speaking more often, to prepare longer, yet to embrace the confidence to share and interact, even when not fully armored with preparation or expertise.

Returning to the questions at hand, one curious participant slid a private message my way via Zoom’s private message feature, inquiring about my journey from sociology to machine learning—what was the path like? Given the lack of time and the personal nature of the query, I opted to hold off on responding, prioritizing the questions aired openly by the group instead. Yet, once the talk was over, the question lingered in my mind, compelling me to consider a more thorough and detailed response.

The short answer to such a question is that it’s a process. The long-winded one is that the process was like a jungle gym, and many a times fortune was on my side while I zigzagged through its challenging, rugged terrain. And it took a few years for this transition to happen.

Let’s rewind to that pivotal moment in 2019 when I realized academia wasn’t my forever path. That summer, I attended a computational social science summer school, picking up methodologies that would redefine how I approached my research. The short two-week intensive course resulted in a collaborative research project, and I relished every moment of the research, and creative writing process. This collaborative endeavor wove together my collaborators’ and my research interests in psychology, social work and sociology. I soon discovered my knack for the technical side was sharper than most, prompting me to spearhead the technical challenges of our project. Throughout the writing phase, I poured my heart into the teamwork, contributing whatever possible. This experience crystalized two fundamental aspects of research for me: I thrive on technical challenges and I flourish in collaborative settings.

Two pressing questions emerged. First, if I ventured from academia into the bustling tech industry, what kind of roles would welcome me? Second, among those options, which one would ignite my passion the most? I spent the following two years seeking answers to those two crucial questions.

As an empirical scientist, I approached answering my queries through a method of trial and error, bolstered by gathering my own data. Initially, armed with an academic CV rich in research skills and project management, spanning AI ethics, and traditional sociology—including causal inference and experimental research design--I opportunistically targeted industry roles that might value such a diverse background.

Through conversations with graduate school friends who recently landed a job in either industry or academia, I gathered that securing a job often requires multiple attempts, especially for someone with a social science background like mine. Several paths lay before me. One is secure a cushy internship which could be converted into a full-time offer. The challenge, however, was pinpointing the career track I desired, as internships are often specific to departments, and career trajectories. Alternatively, I could apply for a few full-time positions over the next couple of years, while narrowing down the right career choices, identifying skill gaps, and sharpening those skills.

While the Covid-19 Pandemic suddenly descended upon us, everyone had to follow shelter-in place policy in New York City, I seized this bizarre turn of events to explore research projects that had previously beyond my gap due to local constraints. I joined force with colleagues at Montreal AI ethics institute to work on AI ethics papers. I attended summer school at University of Michigan virtually, immersing myself in as many machine learning courses as I could manage. I earned an A- in the machine learning course, satisfying the advanced statistics certificate requirement for my graduate studies. I had planned to attend this summer school before, yet the various costs of a summer in Michigan had always held me back. With the advent of online learning and Zoom classes, concerns over air tickets and summer housing evaporated; fortuitously, the extra scholarships that I secured that year were able to cover the summer school tuition. Suddenly the policy of shelter-in-place shattered numerous barriers, opening up possibilities for me.

At the same time, I initiated a new project with my former research team. The year of 2020 was a period of remarkable productivity for me, both in research and in acquiring new technical skills. In the spring and fall semesters of 2020, I took on teaching roles in data mining, and statistics at the graduate level for the master’s program in social science research at Hunter College. Throughout my graduate studies, I discovered a valuable lesson: to truly master a subject, I needed to teach it. This teaching forced me to investigate the subject matter more, read more, and confront questions to which I could only respond, “That’s a great question; I’ll get back to you on that.” What I often did was as soon as I got back home, I looked deeper into research papers, and tried to craft a partial answer to send to my students.

Forced to be indoors, I realized that I had so much extra time on hand which used to be just commuting time on the train. This led me to engage more with various academic online communities, and I could spend more time on academic Twitter, scouring for insights, and discussions about job opportunities that would suit me after graduation. From Twitter, I stumbled upon a burgeoning sociology academic group, whose research gathered around digital phenomena such as digital inequality, gaming communities, the use of Instagram to capture gentrification. The group met informally on Slack, formed out of necessity to maintain scholarly connections amidst the disruption of COVID-19. I even co-led a computational social science reading group with several new colleagues from this forum over the course of a year. Facilitating this reading group enabled me to meticulously sift through seminal research papers in my chosen field of computational social science. It also compelled me to familiarize myself with the leading scholars with whom I aspired engage in scholarly dialogue.

In late 2020, I opportunistically applied for a new residential fellowship for PhD holders at Apple. This was the inaugural year for the program at Apple Inc., likely initiated by Ian Goodfellow, a famous computer scientist, who had recently joined Apple from Google whose famous residential fellowships helped many land coveted full-time jobs at Google. Essentially, this fellowship mirrored a year-long internship, or a post-doc, with a much better salary than a post-doc in academia. I threw my hat in the ring, without much thinking about my chances of getting a call back. My thinking was simple: it couldn’t hurt to try. To my surprise, a recruiter reached out a couple of months later, and I commenced my first FAANG interview process in early February of 2021.

In the previous summer before, on my friend Evelyn Le’s blog, I read about her journey to landing a software engineering position at Microsoft, even though her background was in political science, public policy, and NGO work. While preparing for the job interviews, she even took care of her two young children. Evelyn described her journey of self-teaching herself to become a software engineer on her Self Taught Software Engineer blog. It was one of the earliest posts after she got the job offer at Microsoft. I was intrigued with the idea that one could be a self-taught engineer by utilizing resources available online. Through Evelyn’s reflections, I discovered that tech companies prioritize technical skills and knowledge such as algorithms and data structures and expect candidates to solve problems within one hour using candidate’s preferred programming language. Shortly after finishing the blog post, I signed up for LeetCode, and started learning algorithms, and data structures in my spare time.

I began my preparation in earnest. Signing up for a subscription at LeetCode, I committed to solving 5-10 problems daily from the moment I learned of my upcoming interviews until the day of the interviews themselves.

The recruiter explained that the process would start with a phone screening; should I succeed there, I would advance to a virtual onsite interview. This stage meant engaging directly with several full-time data scientists, and engineers from that team. The team considering my application specialized in Misinformation within Siri Knowledge division.

I tackled the preparation head-on, despite the process feeling entirely alien to me. I was surprised that my profile was even considered, given I had not coded in Python. Until then, my go-to programming language had been R, widely favored in the realms of statistics, and social science research. Moreover, my confidence in my statistical knowledge was tenuous at best. Aside from three statistics courses during my graduate studies and ongoing causal inference research, my actual machine learning experience was minimal, hardly seeming sufficient for the interview. Yet because my resume was considered, I resolved to give it my all. If I passed the interview, and landed a job at Apple, this would force me to finish my PhD, and transition into the tech industry. If not, I aimed to advance through as many interview rounds as possible, to gain a comprehensive on-site experience at a tech company, and better understand what it takes to secure a position on a misinformation team as a technical person. With determination, I soldiered on.

Every day, I diligently practiced LeetCode problems. Each exercise practice brought new insights into algorithms, data structure like linked lists, arrays, depth-first and breath-first searches amongst other topics. Those concepts fascinated me. Back in high school, I had learned some fundamental algorithms including stack, queue, recursion, and dynamic programming. However, it was more than 15 years ago. Most of those concepts had faded from my memory. Besides, the programming language I used back in high school was Pascal. Although seldom recognized today, and not particularly useful in terms of syntax, the foundational knowledge of algorithms lay dormant in my mind, waiting to be unearthed and reactivated through fresh learning, review, and deeper understanding.

I had about a month to prepare for the interview amidst a whirlwind of other activities: writing papers, teaching a course, attending seminars. My spirits were high as I told myself: I’ve secured an interview with a FAANG company on my first attempt. So embracing the Vietnamese proverb, “không thành công cũng thành nhân,” I was prepared for any outcome. Reflecting on my LeetCode commit history, I devoted a considerable amount of time to practice in July 2020—right in the throes of the pandemic. This was after I had read Evelyn Le’s blog, learning Python on my own while attending the Michigan summer school, striving to grasp what it truly meant to code in Python.

As the interview in February 2021 approached, I intensified my practice regimen. I diligently worked through problems every day from mid-January until the interview on February 9, 2021. Once the interview concluded, I took a break from practicing until August 2021, when I began gearing up to officially enter the job market.

The initial phone screening, a technical interview, was conducted by the team lead. He did not call me for 30 minutes, making me feel very nervous. I had to email the recruiter about this tardiness. Probably he was caught up in resolving some technical issues. Upon joining the call, he profusely apologized for the delay. I had cleared my schedule before and for at least an hour after the interview, knowing these sessions could be both psychologically and physically draining. His being late might have inadvertently helped me past the first round; rather than pose technical or LeetCode style questions, he simply inquired if I was familiar with the tic-tac-toe problem, to which I responded affirmatively. He then asked if I knew how to solve it. I confirmed, explaining that I could solve it by hand and likely code a programmatic solution as well. I knew in the back of my mind I should not say no whether I knew how to solve this problem or not otherwise the interview stopped right there.

My answering yes to the tic-tac-toe problem formulation concluded the technical interview portion of the phone call. For the next thirty minutes, we talked about the Siri Knowledge Team’s current focus, their expectations for a residency candidate, and their hope to transition someone from the one-year opportunity into a full-time role. He then inquired me about the status of my PhD and whether I would complete it by the residency’s start. I affirmed that I would on the phone, despite knowing in the back of my mind that my PhD progress being at a standstill at that time, realizing I couldn’t possibly complete it on schedule. However, my priority was to secure the job first, then address the completion of my PhD later. The interviewer, a PhD in physics himself, informed me that holding a PhD was not a requirement for the role on Apple’s Siri Knowledge Team that I was being considered for. Nevertheless, from one professional to another, he advised me to finish my PhD if it was a personal educational goal of mine.

I valued his advice and appreciated his candor. That concluded my first phone screening. The interaction left me feeling positive; it resembled more of a conversation between two professionals than a formal interview. There was considerable back-and-forth, with discussions centering on my career aspirations and the team’s requirements. This interaction starkly contrasted with my usual experiences in the academic world, where hierarchies were sharply defined. In academia, I was often treated as a trainee, an in-training scholar, not yet an established scholar, whose opinions were listened, but not truly valued given my limited track record of published scholarly work. This professional exchange immediately made me feel like an integral part of the team, valued as an adult and a working professional whose contributions would potentially impact the team’s workflow. Being treated as an equal during the interviewing process further solidified my decision to transition from academia to industry.

Shortly after I put the phone down, I received word that I had advanced to the second round. At this point, the stress began to mount. I was unsure what to expect from the 4-5 interview rounds at Apple. The prospect seemed daunting, particularly as my understanding of machine learning and related methodologies, which I had only begun to explore the previous two summers, was still quite rudimentary. I had by then published an article using natural language processing methods and was writing the next. But I still felt inadequate.

The interviews at Apple Siri Knowledge team proved insightful in numerous ways. The on-site interviews spanned an entire day, running from 2pm to 8pm, interspersed with occasional breaks. The day included four comprehensive technical interviews and one behavioral interview with the skip manager. I approached these interviews with the mindset of learning as much as possible. I assumed the role of a social scientist eager to understand the tasks of technical professionals and the rigor involved in securing such positions.

By the end of the on-site interviews, I was completely exhausted. The experience had its highs and lows. I left feeling uncertain about my chances of securing the job. I later realized that this feeling was quite prophetic. Within a week, I learned that I had not been selected for the job. So, adopting the spirit of “thua keo này ta bày keo khác” I moved on. But honestly, I was not selected because I was totally underprepared for the interviews.

At the same time, I applied for a couple of internships in UX research, a filed many of my sociology and social sciences friends had pursued post-graduation. One response came from Wealthfront, a fintech startup that offers robo-advisor services based in San Francisco. One day, the hiring manager called me to discuss the position, my goals for the internship and the projects I had outlined in the application letter. After chatting for half an hour, and he noted that since the UX research team is part of the design department and I hadn’t mentioned any design experience, the role might not be a good fit. The feedback was incredibly revealing. I appreciated his prompt feedback and asked if I could stay in contact, given his publicly known willingness to mentor newcomers to the field of UX research.

There I was at the end of the internship interview season, having gotten only two sets of interviews at two distinct companies, each for a different career path: data science, and UX research. By this point, I had gathered some insights into potential career paths following my graduate studies.

I decided to give the UX career path one more shot before completely giving up on it by applying for a co-op opportunity at an ecommerce company. I only made to the first interview, which was conducted by a sociologist turned UX researcher at the company. During the interview, as I heard myself speak about my UX research passion, so did the sociologist who conducted the interview. Clearly, I lacked enthusiasm for daily research tasks such as collecting user data for reports, understanding the micro-journeys of users during purchases, or analyzing why consumers prefer certain products over others through analyzing survey data. By the end of the interview, I felt certain that I would not advance to the next round. My intuition was correct; indeed, sometimes a gut feeling can reveal a great deal. The co-op opportunity was awarded to another sociology PhD student I know. This attempt marked my second and final effort to secure a position in UX research. I resolved to pursue only data science and machine learning engineering from this point onward, as my background had at least enabled me to progress to the on-side interview round at Apple, unlike my experiences in UX positions at both startups and established tech companies. That marked the end of my pursuit of a career as a UX researcher.

Having decided to pursue a career in data science with full dedication, I realized further training was essential. Early in 2019, I applied to the Data Science Fellowship at the Data Incubator, aiming to join their summer 2019 cohort. Ultimately, I could not attend because I lacked the fund to pay for the pricy bootcamp with full price tag. I was not chosen as one of the few fellows who were smart enough to receive the full scholarship for the coveted fellowship. When 2021 arrived, I re-applied to the Data Incubator, hopeful for an offer that would cover the full tuition. Once again, I didn’t receive a scholarship to cover all expenses. But as luck would have it. Due to Covid, the entire program transitioned to an online format, reducing the tuition to a third of its original cost. The tuition now was approximately $8,000 for the entire 10-week program. This became affordable when I successfully negotiated with a fellowship at my graduate school to fund this tuition in exchange for summer work. To my surprise, the program director agreed! So it worked out for me at the end. It just goes to show, it’s always worthwhile to ask for help when you need it. I consider myself fortunate that there has always been someone willing to sponsor part of a conference or a program whenever I truly needed it.

So, there it was: I had enrolled in an esteemed data science bootcamp alongside with some of the most accomplished PhDs, postdocs in mathematics, computer science, material science, and neuroscience. The program also admitted social science and humanities PhD candidates like me. Some got a PhD in English literature, anthropology. I was surprised at first, but then I learned that they all had STEM undergraduates from well-known universities, I realized that their admission was not random. To my surprise, the cohort mate that amazed me the most had recently earned a PhD in English literature and she wrote a dissertation on mathematicians. I gained extensive knowledge in statistics, SQL, and knowledge graphs, and honed my coding skills in Python. During the previous summers, I had learned a simplified version of these tools tailored for social scientists engaged in research. This summer, I acquired data science and machine learning skills essential for undertaking projects within corporate contexts. The program was markedly more practical, featuring numerous hands-on exercises aimed at scaling up and refining project scopes.

More importantly, the program maintained partnerships with a diverse group of hiring entities, including both startups and established companies. Every week, the program disseminated job descriptions from employers seeking data scientists. I attended presentations by hiring partners, who showcased their companies and teams, and their projects. I applied selectively for a few jobs that were distributed to our summer cohort.

By summer’s end, I had applied to several positions and scored one or two interviews with different hiring partners. However, I was not actively pursuing job opportunities just yet for two reasons. First, my dissertation work was still not really in a good shape for me to leave my PhD just yet. Second, had I secured a position, companies often wanted me to start as soon as possible rather than waiting for an entire year for me to finish my education. I gave myself a few more months to enhance my skills further, while writing up my PhD dissertation. Nevertheless, over the following months, I improved my skills by reading deep learning textbooks and started to applying for additional jobs. I also refined my resume. Furthermore, I undertook more challenging research projects that utilized advanced natural language processing techniques that I recently acquired at the bootcamp. I sharpened my identity as a data scientist, capable of both general tasks like product data science, and specialized work in natural language processing.

Strategically, I narrowed down my focus by applying for jobs explicitly mentioning NLP. I consumed advanced research papers of the time, including BERT, RoBerta, GPT-2. I began implementing solutions to fine-tune BERT, RoBERTa for my own research projects.

Meanwhile, I discovered that most companies increase their headcount in the early month of the year, namely January, February, March. As a result, I maintained a low profile with my applications in fall 2021, planning to apply en masse in early 2022. During the winter break of 2021/2022, I intended to further enhance my technical skills, and theoretical knowledge, as well as complete LeetCode exercises.

As 2022 commenced, I received invitations for several onsite interviews. My previous interview experience at Apple proved to be immensely beneficial. Data science interviews at non-tech companies often include a take-home exercise. I completed these using Jupyter Notebook with more ease. Having completed the bootcamp, I now felt more confident with the tool than ever before. I enjoyed completing those exercises.

During these onsite interviews, they typically consisted of a panel interview with anywhere from 2 to 20 team members. I needed to present my solutions and answer all questions posed during the interview, which typically lasted between 30 minutes and an hour.

Ultimately, the set of interviews at Prudential Financial stood out from the rest. I effortlessly navigated through different rounds. Everyone who interviewed me left a good impression on me. The response was swift; just two weeks after the interview, I received a call that I had received the job offer.

Salary negotiations began shortly after, and within a week, we had both agreed on the terms of my contract. I signed the contract, planning to start after completing my PhD dissertation, which I anticipated finishing soon. The act of signing the offer also allowed me to stop all other job application, feeling relaxed, and directed my energy towards the last mile of my academic work. Ultimately, I began my new position in late May 2022, and remained with the company until March 2024. Subsequently, in March 2024, I transitioned to the machine learning engineering track, which is quite close to data science roles in tech companies.

That is the very long answer to the question of how I transitioned from sociology to machine learning. It was a lengthy process. It began with some inkling of curiosity in 2019, followed by more deliberate learning in 2020, intense learning, and applications efforts in 2021, and then on-the job learning upon starting a full-time position. In total, this journey is expected to span approximately five years intermittently.

Time flew by, as during those five years, I wasn’t constantly aware of my evolving role as a machine learning engineer. During those years, I wrote a dissertation in sociology, publishing articles in social sciences, AI ethics, and writing op-eds and popular articles. I pursued my hobby in urban gardening, skiing, and I read numerous novels and social science books. In theory, those five years could be condensed into just a couple of years for someone solely pursuing a master’s degree in AI or computer science. I took the long route, but it brought me great joy and fulfillment especially because I was able to discover the pathway one step at a time. I did not follow any specific curriculum or had a compressed timeline.

I am happy where I ultimately ended. Careers are like a jungle gym. One should enjoy the journey of zigzagging from one role to another until finding a place where one feels truly comfortable.

Lean Startup Model: Centering Customer Experience

Nga Than — Sun, 07 Jan 2024 03:53:21 GMT

I have just finished reading the book The Lean Startup by Eric Ries. It was recommended to me almost a decade ago by food startup founders whom I interviewed for a food startup research project while in grad school. Several times, the book was mentioned as if it was a bible of the startup world. I had delayed a deep-dive into the book until now. Over this holiday season, I made an attempt to pick it up, and read it from cover to cover. The book is about how teams make decision in the face of extreme uncertainties. Some insights from the book are worth examining here in the era of breakneck-speed generative AI developments.

The Vision-Strategy-Product Pyramid

According to Ries, the company’s vision is the foundation that keeps all activities at the startup together. Product, sales, growth strategies are tactical strategies that are built on top of the vision. The concept “pivoting” is an important concept in the startup world. However, Ries doesn’t suggest “pivot” means changing without any direction. It has to be a change conditioned on lessons learned, and based on the broad vision that the company has held. These lessons are discovered using the “lean startup“ methodology which places an emphasis on “validated learning.“ Validated learning is another way to say that for everything that one does, there must be a clear hypothesis, and a way to test that hypothesis, then make a decision based on the results of such a hypothesis testing process. In other words, Ries is advocating for scientific management, and product development. Finally, once a product/feature is created, the rest of the work is in the area of optimization of a product.

The Build-Measure-Learn Feedback Loop

Build-Measure-Learn Feedback Loop

The build-measure-learn feedback loop is simply a product development cycle. A team often goes through product ideation, builds something quickly, measures the effects of the products on the target population, collects necessarily data, and learns from the data analysis. Then the cycle repeats. In this process, the main emphasis is to reduce the amount of time this cycle completes. The earlier an idea can be turned into a product, the faster the team can learn from the target consumers whether there’s a market fit.

The goal to minimize the amount of time for such a build-measure-learn feedback loop to complete is very beneficial in many areas not just for a product development. One might think of it is in a fractal framework, where within each circle like ideas, build, product, measure, data, learn, there’s a mini-build-measure-learn feedback loop. For example, as a data scientist whose products are internal analytical reports, I work with internal stakeholders. When I build something quickly, I can ship my “product,“ which is often an analysis. Then I can measure quickly whether my “consumers,” ie. other departments appreciate, agree with the report. Finally, after getting the feedback from other departments, I can collect the data of their feedback in the form of quantitative analysis or verbal feedback, and then learn from the data. When the cycle is completed, I can get back to refine the idea, or to discard the idea altogether, and move on to the next project.

Critique of the agile framework

One important insight from the book is the concept of “waste.“ According to Ries, any effort that goes into building, ideating which doesn’t directly go into shipping a product early is a waste. What Ries means by waste here is that most people want to ship the most polish products to customers. However, in order to polish a product, a team must make a lot of decisions such as engineering, design decisions that might go into features, aspects of products that customers don’t like. So instead, they should not aim for perfection, but rather completion. Customers should have the “first crappy products“ in hand to be able to provide feedback, critique, and suggestions, whose ideas are more grounded, which the design, engineering team can then act upon. This idea resonates with me as a data scientist, and as a writer.

During my graduate school years, the idea I love the most was to “write the first crappy draft“ for any paper that I was trying to produce: from a simple blog post, to a book chapter, a research article, and then finally the dissertation. At the beginning I was so afraid of producing bad products, I never produced anything at all. Until when I took a class “writing for publication,“ where I learned that most good writers have written bad sentences, and the first draft is the most important draft. Once it is out, there’s something to work on. One can start edit it. They can send it around for feedback. They can even print it out to the tear it apart, burn it, or whatever. But there must be something there, concretely for criticisms. Otherwise, everything is still in one’s head. If it’s not on paper, it’s not an idea. So I got over my fear of producing bad writing. I just wrote one sentence after another. That’s how most writings got done. Then I started sending my writings to others. Most feedback I got at first was not that exciting, but people reacted to my writing. At least I have an audience, albeit an uninterested one. Gradually I felt comfortable writing at longer length. In other words, I agree with Ries’ approach of not wasting anytime in debating, negotiating, attending endless meetings that don’t lead to putting a product or another version of the product into the customer’s hands.

One pitfall that engineering teams often fall into is being stuck in the Build phase. This is where the agile development framework comes in. It is a very popular framework in software development. It privileges the build process. It’s created for engineers to optimize for their productivity without taking into consideration what the customers actually want at the end. Ries argues that the agile framework with story being accepted constantly to move production along is too rigid for a process called validated learning, meaning that each story has to have an end goal in mind: does this story contribute to building a feature that increases customers’ engagement, revenues, etc. In other words, procedural thinking (agile development) privileges productivity without thinking through if this productivity is needed to start with. Who is then to call it out when engineers create stories that are at the end considered as wasteful? The answer is no one person, but it’s rather a system of processes. The answer lies in “validated learning.“ That means, each story after completed has to be validated that the story is indeed helping a customer, most of the time is through hypothesis testing, or A/B testing.

The idea that engineering efforts might be wasteful if their productivity is being acknowledge only by managers, but not necessarily by consumers is very powerful. One contemporary example is the current engagement of engineers with customers on social media about Starfield, a video-game produced by Bethesda. The game has received very negative reviews left and right especially from Bethesda’s long term fans. Most criticism is along the line that the game is boring, the world-building is without depth, and characters are not memorable. Fans have gone on to post their feelings about the game on different sites, game reviews, making Youtube videos about their reviews, etc. Interestingly, developers of the game have come out to defend how hard it was to create this game, to build features, to design a game engine, etc.

Of course, game players won’t understand how hard it is to make each piece of the entire game. But what matters is that the consumers, i.e. gamers don’t think highly of the game even after all those efforts that engineers put in. What we are seeing are a few issues at play: (1) the developers were working diligently for 5 years to release a beautiful product that they hope customers will enjoy; (2) customers (mostly Bethesda fans) are expecting some wow experience that will draw them back in; (3) customers’ feedback is so negative, and the cultural phenomenon has become so big that everyone wants to make their side clear of why the game has gone completely opposite of what everyone had hoped for. From Lean Startup philosophy point of view, everything that doesn’t go into how a customer can positively experience the game is a waste. What it means is that a lot of the time engineers tried to figure out how to create a feature was a waste. What puzzled me even further is why it takes so long to build a game? 5 years without having proper mainstream customers feedback baked in during the process seems like a very risky business. Is there a possible way of building a game that takes a lot less time? Or if it has to take that long, can user testing (with mainstream users) be done more often in the building process?

Now what are the lessons for creating products in the Generative AI era? I would argue that following the lean startup philosophy is the way to go for both startups as well as large corporations. Since the field is moving so quickly, it is important to have products go to consumers quickly. Only when the customers can experience what the effects that generative AI can do for them, then a team can decide if their product has any market fit.

Why Generating Infographics is so Hard for GenAI?

Nga Than — Thu, 21 Dec 2023 14:01:39 GMT

I encountered some surprising behavior of DALLE when tinkering with the prompts that proposed by Ethan Mollick in the post about constructing expert prompts that get the most out of chatGPT. He humorously calls a repository of these prompts to be grimoires. The idea of having expert prompts to create expert GPTs is brilliant. With some prompting, one can turn the default boring chatGPT into a professional expert in some field. I realized that the prompt that he created for a tutorGPT is HUGE. It’s almost 400 words long. No-one could have come up with it without a lot of tinkering, and hacking one’s way through a lot of experiments. Having seen how long this prompt is, I started to wonder whether I have been prompting the right way. Most of my prompts are relatively short like: tell me this, summarize this.., how to do this? Sometimes I get some tasks done right. Sometimes GPT sounds so dumb it takes Google to solve the problem. The observation that a detailed prompt is more effective than a shorter one is also shared by Jaime Teevan, Chief Scientist at Microsoft. She argues that we, humans know a lot of contexts, which we need to explicitly tell chatGPT otherwise it doesn’t know what we’re trying to do.

Writing effective prompts is an art that grade school, or my doctoral program for that matter, hasn’t prepared me for. Effective prompt writing is a completely new art, new form of writing that takes “the AI imagination” to do it right. The AI imagination is defined as the ability of a person to understand what the limits of AI is, and thus is able to construct appropriate prompts to harness the power of genAI simply through words. This imagination is a skill. It’s not an innate ability. This is a disciplinary skill akin to what an undergraduate student might get after they take method classes such as “Historical Imagination,“ “Sociological Imagination,“ or “The Art of Mathematical Reasoning.“

What does it take to write prompts effectively then? It’s about giving contexts. It’s about describing the role of the GPT. It has to assume a role. It has to know the boundaries of its character: What it can do or what it cannot do. The more details we provide, the better the character will behave.

So I have been doing things pretty much the wrong way. I haven’t packed enough details. I have been a pretty lazy prompt writer. I only want an answer instead of being happy writing long prompt with a lot of imagination packed into it. The more you put into an AI, the more you get out of it. AI-engineering is hard work!

Long story short, while I have been playing with the tutorGPT that Mollick designed, I found that the tutorGPT doesn’t really do a good job at teaching me a new concept. I asked questions like teaching me about Matrix decomposition, or chain rules in probability. It would generate texts fine, though I didn’t fact check the answers with a mathematician, I assume that the answers were correct.

As a visual learner, and that I remember from my teaching days in graduate school, I know that a picture, an infographics, or a diagram is worth a thousand words. The concept will edge in the learner’s mind in very different ways. So after each explanation, I ask chatGPT to generate a figure or a diagram or an image to encapsulate the idea. Here are what it generated for matrix decomposition:

Image Generated by DALLE-2 about matrix decomposition

The image is pretty impressive. I love the idea of representing two types of matrix decomposition (LU decomposition & Singular Value Decomposition) by dividing a stack of pancakes. But when examining the illustration further, I found it’s odd: the words seem very very wrong. “Simigmag,” “Sigmma” are supposed to be sigma. SMUA is SVD I think.

I kept iterating through a few examples, here is the illustration for the workflow of a UX researcher:

Image generated by DALLE-2 to describe the workflow of a UX researcher

All words on this image are misspelled: “data colleriction,” “Anaysis,“ “IIInterwiees,“ etc. This gave me a good laugh.

Here’s another mathy one, an illustration of the chain rule in probability:

Image generated by DALLE-2: chain rule in probability

Same problem: the words are misspelled everywhere: “decommpiision”, “prooibiity,” “proviity,” “probiity”, “proiity.”

I reiterated the problem, and ask chatGPT to fix the chain rule figure, and this is what it got the second time around:

This text accompanying the re-generated illustration is as follows:

The revised illustration, ensuring correct spelling and accurate depiction of the chain rule in probability decomposition, is now available. This image visually represents the sequence of events A, B, and C, with clear and correct labeling, and demonstrates the multiplication of their probabilities in the correct sequence, adhering to the chain rule.

So instead of revising the same digital illustration which is pretty cool already, it generates a completely new illustration. Fine. I get it. I do the same thing: instead of fixing the bugs in my production code, I’d rather generate something completely new. I can sympathize. But then upon closer inspection, words are still misspelled everywhere: “prtitery,” “chaclulabte,” “probility.” Regeneration doesn’t really fix the important details.

Overall, the infographics generated are pretty impressive from the first glance. The “negative space,” or the white space, or the area of space that is left out empty is utilized well. The core concepts might be well illustrated, but the words are SO WRONG!!

Why getting words right in an infographics is so difficult?

This seems to be a hard problem that image generation models are still trying to solve. There is even a reddit post about this issue. Redditors joked that Dalle doesn’t have a spell check because it’s using a diffusion model. I wish spell check and diffusion model can be merged! What seems to be so easy for human eyes become hard if the underlying generation model is a diffusion model. One observation seems to be that all text-to-image models have encountered the same no-spell check issue, but the models with larger parameters seem to do better than the smaller ones. Another reason I found is that DALLE is likely using a character-based model instead of a word-based model. The model only considers individual character within a word rather than a word as a whole. So when generating a picture only individual words are considered instead of the entire word (the context). Others have pointed out that DALLE 3 seems to have done better than DALLE 2, but it doesn’t mean that this problem no longer exists with a better image generation model.

The next question is rather a hypothesis: in this particular exercise, I am asking chatGPT to generate something akin to an illustration in a textbook or a lecture note. It fails to generate the complete image with correct words. Does that mean this is when humans should be involved? This reminds me of the research Repairing Innovation by Elizabeth Anne Watkins, and Madeleine Clare Elish. They found that when AI algorithms fails in hospital, and the human workers such as nurses and doctors are those who are on the trenches that carry out the necessary labor to fix what goes wrong . Are we seeing a rise of “repairing innovation” work done by humans in the age of Generative AI when AI fails to deliver? Some users of DALLE humorously suggested that we can simply use Photoshop to fix the words that DALLE couldn’t get right. That must be fun for those who are Photoshop experts. At least they didn’t have to come up with the design from 0, and only have to fix certain aspects of the design. If the total number of work hours that they put on fixing it is smaller than the number of hours that they create something from scratch, maybe it’s a productivity win (loosely defined).

The final set of questions I have is when can this problem be fixed? How would it be fixed? What it takes to fix this issue? It seems to be a hard problem: generating texts but in the form of pixel. I bet someone is working on a dissertation that tries to solve this issue. If anyone knows of such a research, please let me know.

Everyone is a Prompt Engineer in the Age of GenAI

Nga Than — Sat, 16 Dec 2023 22:54:27 GMT

Image Generated by DALLE with prompt “Generate an image of a serious knowledge worker in the form of a cat that tries to edit some documents on a screen with lots of documents around with tasks to do like powerpoint creation, and production development.”

Every worker in the year of 2024 will be a prompt engineer. Since the inception of ChatGPT in 2022, there has been a tremendous interest, financial and human resources pouring into generative AI research, app development. My prediction is that all companies will try to make generative AI work for them in the year of 2024. Innovation and entrepreneurship scholar, Ethan Mollick, argues that with generative AI, organizations need to reimagine how organizations organize, and create different work processes. Business leaders should take responsibility to actively participate in this redesigning future of work with AI as an essential part of the organization processes. The main argument hinges on the idea that with generative AI, a team work process would be something akin to the idea of “collective intelligence” between humans and AIs where a team’s productivity is a synergy between human workers and their AI colleagues. Each AI team member can do a job, or perform subtasks, sub-processes. The figure below demonstrates an example of how a normal process can incorporate AI workers into the work process to speed up the delivery of a particular assignment. In essence, we’re entering a work culture where each person would bring with them a suite of AI agents to augment, automate, and enhance their work.

Work Process as a collaboration between human workers and AI tools: credit Ethan Mollick

The ideal process that Ethan Mollick painted is very intriguing. An open question for business leaders and each worker is how could they actually implement this ideal work process in practice. I would argue that in 2024 every worker should learn the art of prompt engineering, and should perform in one way or another the job of a prompt engineer. While the title of prompt engineering as a new job category after the introduction of chatGPT, it should instead be conceptualized as a set of skills that every worker should master in this GenAI age. Any organization that creates a structure, a work culture of learning, upskilling that promotes prompt engineering as an essential skill for their worker will get ahead in this race to harness generative AI power. Prompt engineering should no longer be a specialized skills that are relegated to a group of specialists called prompt engineers. Similar to the introduction of personal computers, type writing is no longer a specialized job. Everyone now learns how to type in grade school.

Chief Scientist Jaime Teevan of Microsoft recently pointed out that in order to unlock tremendous power of large language models, we (yes the collective we) should not focus on fine-tuning a large language model or build a new large language model, but rather the main values to organization are in mastering prompt engineering, and coming up with creative prompts that unlock the productivity promises of large language model. She further observes that prompt innovation has mainly come from the research community, which should not be the case. Prompt engineering innovation should come from billion large language model users on a daily basis. In other words, as organizations incorporate GenAI in their workflows, any employee can innovate, and add to the collective power of prompt engineering.

A recent research between Harvard Business School and the Boston Consulting Group, examining professional consultants’ productivity gain when using generative AI found that “professionals who skillfully navigate this frontier gain large productivity benefits when working with the AI, while AI can actually decrease performance when used for work outside of the frontier” (emphasis is mine). The tasks that the authors examined were non-routine tasks which make automation less ideal, and that the introduction of chatGPT ushered in “an entirely new category of automation, one whose abilities overlapped with the most creative, most educated, and most highly paid workers.” For workers who could resolve tasks that are outside of the frontier, they exhibit “centaur behavior.” These users “switch between AI and human tasks, allocating responsibilities based on the strengths and capabilities of each entity. They discern which tasks are best suited for human intervention and which can be efficiently managed by AI.” Don’t we all want to have this “centaur behavior” now?

Besides, the authors found that workers who have access to GPT-4 with an overview of how to effectively use prompts performs slightly better than workers who are only given access to the tool alone. This finding is very important. Even when the tool is made available, workers with training will perform better than workers without prompt engineering training on average. This insight has profound implications for workforce training, and educational curriculum design for wider AI adoption.

In the year of 2023, we have observed that businesses and educational institutions originally took a risk-averse stance with regard to chatGPT. Now they have put guardrails and created internal systems that let employees take advantage of chatGPT. ChatGPT is now conceptualized as another essential tool for a business akin to Google. The question is no longer about whether it should be allowed. The question has become how to unlock chatGPT and other large language model tools in the form of chat at the full scale based on what we have learned about how this tool is being used at work.

To take advantage of these tools, I suggest that prompt engineering should become an essential skill for every worker. This has a few implications for the educational system, and workforce training programs:

Educational institutions

Prompt engineering should be introduced early in one’s educational journey. Students will use chatGPT. The question is how they can use chatGPT to help them get the most out of it.
I recently helped a college students in critiquing their scholarship and internship essays. I read their essays with an editorial eye, and suggested places to cut, and word choices to use. Yet, the students instead of taking my suggestions to micro-edit every sentence, they input the entire essay in chatGPT with my suggestions for the machine to rewrite the entire paragraph, sometimes an entire essay for them. What I would have done instead is that instead of asking chatGPT to write an entire paragraph for them, the student should have asked chatGPT to critique every sentence, and suggest micro phrases, and changes, and make the differences known with proper rationale. This is a process that chatGPT can do relatively well. In this case, the student was too focused on churning out a well-formatted essay in a short period of time, instead of paying attention to word choices, their personal voice, sentence structures, transitions, namely, the many things that make each essay unique and personal to an individual. This student would benefit so much from getting personalized guidance on what it takes to write a great motivation essay, what good writing principles are, and how to use chatGPT to achieve those goals. In a sense, chatGPT is a personal assistant, but the more detail instructions one uses to give it, the more one would get out of it. This is akin to when an artist creates a piece of art, using a predetermined template is different from the process of exploring an idea, and drive it to the fullest conclusion of it using a tool such as an easel or a camera. It all starts out with an idea, while the process could be learned, and could be figured out while this artist exercises their agency to take advantage and master the tool.
Prompt engineering should not be conceptualized as a skill that is for engineering students only. Every student should learn it. Each specific prompt engineering style might depend on the subject matter at hand for example writing classes would utilize chatGPT very differently than say in a math class.
Having an agreed upon standard of what kind of prompt engineering is for what levels, and how to design a good prompt is an important critical thinking skill that prepares them to succeed in the future jobs.

Organizations

Many organizations are now having their internal chatGPT-equivalent tool. Prompt engineering should not only be a course that engineers and technical people take. Instead, everyone should have skills to write a good, functional prompt. Each organization should also come up with training programs akin to what I mentioned above for educational institutions.
Each organization should proactively upskill their workforce. There are many training courses on different platforms for their workforce to use nowadays. This must be something that businesses should encourage.

What I am most afraid of is that every business is trying to introduce tools such as ChatGPT in their organization, but if it is not incorporated into the workflow of each worker on a daily basis, it would become irrelevant. This is where HCI scholars, UX researchers will have a role to play. The difference between introducing a tool, and realizing the productivity gain is tremendous. The question is not about introducing the tool, but to empower their workforce to effectively use such tool to unlock the most productivity gains. This is the question of “How.” Any organization able to solve this question of upskilling their workforce and encouraging them to use chatGPT and the equivalent tools would unlock the most potentials faster, and for sure will be better off in the age of GenAI.

2024 - Year of Generative AI at Work

Sat, 02 Dec 2023 16:48:44 GMT

This substack draws inspiration from by the paper “Generative AI at work” by Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond (2023). This research is so far my favorite study in the social sciences that examines empirically what generative AI can actually do for businesses. They studied how a conversational assistant transform the work of more than 5000 customer support agents (think about call representatives at call centers, or agents that chat live with customers). The authors found that using the tool increases productivity by 14% on average, measured by number of issues resolved per hour. More importantly, they found that this tool affects novice workers more than experienced workers, 34% improvement for beginners while doesn’t affect highly skilled workers. They argue that Generative AI disseminates the best practices of more able workers, and assists newbies. Furthermore, they found that AI assistants improve customer sentiments, increase employee retention, and may lead to worker on-the-job learning. They then suggest that access to generative AI can increase productivity, with more gains on the inexperienced workers.

Image generated by Dalle using prompt: “Generate an image for an AI summit. Remove people, only keep a logo.”

The research setup takes place at a Fortune 500 tech company, and the customer support agents are those that perform live chats with customers. This setup is a typical human in the loop experient where agents can ignore the chatbot’s suggestions, and proceed based on their own experience, knowledge if they choose to do so. This allows for the agents to exercise their own autonomy over their work, advice, and the quality of service that they provide to customers. Regarding the quality of work, the researchers found that off-shore customer service reps are less tired if they work overnight shifts to serve US-based customers. Customers escalated to managers less often. This leads to lower worker attrition, which often caused by newer workers leave. Overall, this research suggests positive changes to both employee satisfaction, and customer satisfaction after the AI-assistant was introduced to customer-service work. This optimistic outlook on the potential of generative AI on workers’ productivity and business gains propelled me to start this substack.

This past week, I attended the AI summit in New York City. The conference gathers technologists, journalists, business people, a lot of established companies as well as startups that are working on different AI applications. One could clearly see generative AI to be the main theme at the conference. Meta announced their Purple LLama at the summit. A Google representative talked about Gemini. OpenAi talked about the one year anniversary of ChatGPT. Everyone else was busy trying to make generative AI work at different types of organizations from small & medium businesses to large corporations. Someone even said during a panel that a year ago we never heard of RAG, retrieval-augmented generation; now it’s basically the industry-standard of how to implement an information retrieval system to exploit all the good promises of large language models. What I gathered from these different conversations at the Summit was that the year of 2024 will be the year when everyone tries to make generative AI work.

The concept of generative AI at work is front and center, and has become more important than ever. Many open questions have opened up for technologists, AI ethicists, legal and compliance officers, and social scientists to answer. For technologists, the main question is how to move from toy demos to scalable solutions that could be sustained by themselves, how do we avoid the pilot purgatory. For system designers, and AI solutions engineers, the question is how to combine different methods to reliably engineer solutions for any problem at hand. We now can combine LLMs with traditional software engineering. How can we creatively, cheaply and quickly deploy our solutions, to bring our solutions to market, thus gain more from GenAI? For compliance officers, the questions are how can we avoid all the pitfalls of generative AI such as hallucinations, copyrighting, etc? For AI ethicists, what does it mean to create company policy around the use of generative AI such that we develop, deploy and use them responsibly given financial and time constraints? For social scientists, what does it mean by having a culture of generative AI at work? What does this culture look like? Whom does it privilege? Who will lose out at work?

All of these questions require a lot of brain power and human resources to figure out. As for me personally working at the intersection of both technological development, business development, and AI ethics, I often ask the question: how I can develop, and deploy a solution that is beyond toy solution, while staying engaged with AI ethics folx, and social scientists to think about what the downstream effects on different groups of end users might be.

All of these are the motivating factors for the inception of this substack. I hope that this will be an intellectually exciting space for me to explore different practical as well as theoretical questions when implementing generative AI solutions in the work place not only for increased productivity, improved customer experience, but also for increased work satisfaction, and a more just society.