Yury Kashnitsky

Google Antigravity: first impressions

2025-11-25T00:00:00+00:00

First of all, it’s sci-fi stuff, of course. Just imagine showing this to the version of yourself from 5 years ago.

There are at least three differences between Google Antigravity and a standard IDE with a coding agent hovering over it:

You can run multiple agents in parallel. Assign frontend to one, backend to another, and research to a third. Then you just come back to check the result;
Antigravity launches the application in a sandbox itself, tests it just like a human would, reports on what worked and what didn’t, and then writes a report in a “Walkthrough” file. That is exactly what is happening in the video below: when the browser is highlighted with a blue border, that means Antigravity has taken control;
Nice little touches like separate files for Task, Implementation Plan, and Walkthrough; the agents’ thoughts and actions are even more transparent.

You can also watch this video and try to replicate it.

Below we’ll run Antigraviry with one cool task that I’ve used for quite a while when teaching Python.

Prey and Predators

As an example, let’s take “Prey and Predators”, a.k.a. “Law of the Jungle” which is a simplified version of Conway’s Game of Life. This used to be a great take-home assignment in an interview series or a university course, and now it can be vibe-coded in minutes.

Here is a starter repository.

Rules and Task

Rules:

A predator can eat an adjacent prey or simply move to an adjacent cell.
Prey can also move to an adjacent cell.
If a predator does not eat anything for a certain amount of time, it dies.
After specific time intervals, predators and prey reproduce if there is a neighboring empty cell. The offspring occupies the free cell.

The current state of the ocean should be displayed on the screen, ideally using a graphical user interface (GUI). The simulation ends either after a certain number of iterations or when all predators or all prey have died.

Use this model to test the hypothesis of cyclical populations of predators and prey.

Vibe-coding Prey and Predators with Antigravity

For a prompt, I simply gave:

Implement a solution for this task https://github.com/Yorko/prey_predators_cool_programming_task in a modern tech stack, with Python backend and Next.js frontend

Antigravity opened the browser to read through the repository (that’s already pretty cool), then came up with a Task and an Implementation plan.

Then the app prompts the user to review the implementation and proceed (it’s also possible to provide comments to both, in a code review style). After a couple more approvals, it generates the whole codebase and launches the frontend and backend services (it also updates the Task tab on the go).

Finally, Antigravity opens the browser to test the application.

Your browser does not support the video tag.

First impressions are fascinating. It’s magical to see the agent live testing the application.

Whether Antigravity is of much help in real projects (as compared to tools like Gemini CLI or similar) – we’ll see.

Tutorial: YouTube summarization with Gemini and Google Cloud Run

2025-05-19T00:00:00+00:00

Gemini is pretty good with YouTube analysis. You can just drop a youtube link into gemini.google.com and ask to provide a summary. Here is an example for https://youtu.be/jCTvblRXnzg – a brief, meme-heavy overview of Google’s AlphaEvolve:

Other LLMs, not equipped with YouTube tools, might not be able to fetch the video content provided only with a link:

(funny enough, this particular LLM braggs about GPT-4o instead of Google’s AlphaEvolve).

Notes:

While Gemini is a multimodal AI capable of understanding images and audio, its YouTube summarization feature primarily focuses on the textual content. It doesn’t typically analyze visual cues or nuances in the speaker’s tone to generate the summary;
Summarizing long videos might be a challenge, for such tasks it’s better to first cut the video in chunks.

Goal of the tutorial

Let’s build and deploy a web application that summarizes YouTube videos using Google’s Gemini and deploy it with Google Cloud Run.

This tutorial is a modernized version of the code lab Build a YouTube Summarizer Codelab.

Prerequisites

Python 3.x;
Google Cloud SDK (gcloud) installed and configured;
A Google Cloud Project with billing enabled;
Required APIs enabled in your Google Cloud Project (Cloud Build, Cloud Run, IAM, Vertex AI, etc. - refer to deployment scripts for specifics);
Docker (for local building if not using Cloud Build).

Installation & Setup

Open terminal and clone the repository with code for this tutorial;

 git clone [email protected]:Yorko/youtube-summarizer-cloud-run.git

Install uv – the modern Python dependency manager:
```
 pip install uv
```
Create a virtual env:
```
 uv venv
```
Install Python dependencies with uv
```
uv sync
```
Configure Google Cloud:
- Set your project ID: gcloud config set project YOUR_PROJECT_ID
- Ensure your user account or service account has the necessary permissions (e.g., Cloud Run Admin, Cloud Build Editor, Service Account User, Vertex AI User).

Local usage

Run the Fast API backend:
```
 make run_backend
```
Run the Streamlit frontend (in a different terminal tab/window):
```
 make run_frontend
```
Access the application in your web browser: http://localhost:8501
Enter a YouTube video link (e.g. this one https://youtu.be/jCTvblRXnzg on AlphaEvolve by DeepMind), optionally add a prompt, and click “Generate Summary”.

Deployment to Google Cloud Run

The project includes a script to automate the build and deployment process.

Review and modify build_n_deploy.sh: Update PROJECT_ID, SERVICE_NAME, DEPLOY_REGION, and SERVICE_ACCOUNT variables as needed for your environment.
Build the container image using Google Cloud Build:
```
make build_cloud_image
```
Deploy the service to Cloud Run:
```
make deploy_cloud_run_service
```
This command deploys the container image built in the previous step, configuring the service account, minimum instances, memory, and allowing unauthenticated access by default. The script will output the URL of your deployed Cloud Run service. Visit this URL to test your deployed YouTube Summarizer!
In case your organization doesn’t allow unauthenticated access, you can proxy the service to localhost:
```
 gcloud run services proxy ${SERVICE_NAME}
```

Just like with locally run application, this will open the app at https://localhost:8080.

Bonus challenges (optional):

Explore the scripts/enable_oauth_for_cloud_run.sh script. Understand how it sets up a Load Balancer and Identity-Aware Proxy (IAP) to restrict access to authenticated users. Try implementing it for your service;
Modernaize the app: split front & back into different services and use cloudbuild.yaml to specify different Docker files for them;
Experiment with prompting.

48 interviews away from a Google offer

2024-09-30T00:00:00+00:00

I’ve been looking for Applied ML scientist positions for quite a while, got rejections from 16 companies (mostly, bigtech and startups) and 2 offers. In an earlier post, I shared some helpful resources, here I’ll share some high-level stats.

Breakdown by interview type:

Behavioral - 13.5
Coding - 8.5
ML breadth - 6
ML depth - 5
ML coding - 4
Research presentation - 4
ML system design - 3.5
Take-home assignment - 3
System design - 0.5

I was positively surprised that coding tasks were generally easy (think of Leetcode Easy mode), and I looked for senior+ positions, hence som many behaviorals. Also, for Applied ML Scientist roles you see a skew towards ML interviews and almost no system design interviews.

Interview leads (how did I get to the 1st interviews):

Referral - 7
Cold application - 4
Reached out directly to the Hiring Manager - 4
Recruiter/HM reaching out - 3
Data Science communities - 2

Most common questions

I didn’t record each end every one of course, but these stood out:

p-value. Learn it by heart but also understand what it means
in NLP-style ML breadth interviews, they all want you to explain the transformer or attention architecture. I made a post on how I take these questions
Amazon is notoriously famous for puzzling behavioral questions like “tell me about a time you used data to adapt your strategy”. Other companies and startups would rather ask you to describe successes, failures/conflicts, i.e. more expected stuff
I was never (exactly 0 times) challenged to describe my weaknesses. Not sure if the importance of this question is overrated, just my experience

Interviews

Here are some highlights.

Apple Music, ML Researcher for Recommender Systems, London

Got rejected after the first interview. They ended up hiring my former colleague who already lives in London and has way more experience with recommendation systems. Fair enough.

Aiforia, ML Team lead – founded by folks from Yandex and Sber Devices, now in Cyprus

Passed the HM interview (mix of behavioral/technical). Then a Kaggle grandmaster grilled me on “ML fundamentals.” Felt like light trolling tbh, haven’t seen such tricky questions in ages. Some even started with “I don’t know the answer to this one myself” lol. Anyway, they were looking for experience with voice tech, which I don’t have at all. So no hard feelings, didn’t make the cut.

Replika

Saw their post in Vas3k club that they’re looking for frontend devs, but you can reach out anyway. I wrote to the CTO, we chatted. Not exactly a perfect match, they needed researchers with a strong engineering focus.

Lesson learned: no matter how much I want to emphasize Applied Science, I shouldn’t use phrases like “messing around with configs” 🙂

Nvidia, Senior Applied Scientist

Pain and humiliation, they absolutely smashed me. I didn’t just fail, but failed miserably.

Right off the bat, it was like “So, you think you’re hot stuff?” What do you use for DPO? How have you done distributed model parallel? No? Only DDP? Ah, haven’t touched 70b models? The interviewer was very polite, but that was the vibe.

Then it was okay. Transformers, NLP, all that jazz. Almost everyone asks about the transformer architecture. Dude was dropping references to papers like Retro left and right, clearly well-read. But I think I held my own in the conversation.

I crumbled at the very first engineering question. What’s the difference between storing variables on the stack and the heap? And how is that related to local/global variables? I didn’t just forget, I don’t think I ever even learned this. It took me two tries to even understand the question. The most I could mumble was something about the stack appearing during recursion. The peak of the interview was question #24:

And algorithms: the “8 queens” problem. Classic, 101, according to the interviewer. Didn’t have to write code, just describe the solution. But I started rambling about dynamic programming, then backtracking. At least I correctly estimated the factorial complexity, but I still couldn’t clearly describe the solution with DFS. Thought it was a simple problem, basic stuff, but it’s actually a hard one.

NVIDIA is looking for rock stars, strong in both research and engineering. They can afford it, the job description for Senior Applied Scientist has a salary range for the US of 180-333k, and that’s just the base. And their stock is through the roof. So it’s all good.

Cohere, Member of Technical Staff

Really liked these guys, the interviews were super reasonable. Had a small coding test to optimize some Python code (got the same problem that a Kaggle grandmaster gave me before, so that interview wasn’t a waste after all, heh). Instead of LeetCode, they had ML coding - I had to implement sampling from a simplified LLM decoder (greedy, top-k, top-p). ML system design was literally about an LLM evaluation system that Cohere is currently working on. I bet they get a lot of ideas from candidates :) And a paper review of my choice, also about LLM evaluation. In the final round, I had a behavioral interview with the big boss, and there was no spark. Didn’t get real feedback, just some excuse like “lacked the level of adaptability and speedy execution.”

Snorkel AI, Staff Applied NLP Scientist

I got really emotionally attached to this option (don’t do that before you have an offer). It’s a startup founded by 5 PhDs from Stanford, growing rapidly, a unicorn. They have a pretty coherent vision: companies don’t need monsters with 1.8T parameters, they need specific models for their domain, fine-tuned on their data (YC agrees, “Request for startups”: small fine-tuned models as an alternative to giant generic ones). Snorkel is all about programmatic data labeling, which saves experts time on labeling, and also uses LLMs for soft labels. Plus distillation/quantization - you get small and powerful models, Snorkel’s blog is full of such stories (example).

The interviews here were also very reasonable: behavioral first, then ML coding, ML system design, and a presentation about a research project. In that one I didn’t show sufficient tech expertise but rather focused on leadership. I should have clarified with the recruiter which staff archetype they need.

LLM Startup, VP AI (offer)

No-name startup, pre-seed, but the position is VP, I could have joined as the 6th employee, with a whole 1% of the company. Tempting, but I’m a coward and not doing a startup.

The interview was quite unusual - they sent me a doc in advance with ~20 questions about everything under the sun:

“We’re an LLM startup, why won’t the next OpenAI update kill us?”
“Hallucinations are critical in our business, what are we going to do about them?”
“How will we integrate the research department into the company structure?”.
And even this: “Everyone is chasing GPUs now; maybe we’ll be smarter and look at new chips?”.

For all the questions, I was asked to share my vision. Perhaps the most interesting interview. Initially, the CTO wanted to spend half an hour on LeetCode, but during the conversation, he was like, “okay, let’s not waste time, the conversation is going well. High-bandwidth conversation.”

Google Cloud, GenAI Field Solutions Architect (offer)

I got into Google through a referral, finally they didn’t ignore me. Ironically, I was referred by a girl I helped leave Google to join Mistral.

Google has gradually converged to a format of 4 interviews (it used to be 15-20). I had the following rounds:

Leetcode + system design
Role-related knowledge
Leadership & googleyness
General Cognitive Ability
“casual” conversation with the manager

In the first round, the leetcode seemed easy and the design seemed difficult. I studied the design thoroughly (speaking of interview success being 50% effort and 50% luck, I haven’t prepared this long for any other company). With big tech, you can ask for a couple of weeks to prepare, usually they are okay with this. And the mocks were very helpful, especially considering that I had never done a design interview before.

Role-related knowledge was about LLMs and consulting, there were a lot of questions about how to describe LLMs to clients, top managers, engineers. The technical questions didn’t seem difficult (the “Generative AI with LLMs” course and my own experience with LLMs were enough), but for the business acumen and consulting questions, practice with business cases, like they do in big3, wouldn’t hurt.

Leadership & googleyness is basically behavioral. Even though I mentor myself, I did 4 mocks, learning what exactly they want to hear in interviews for staff positions at Google. This was incredibly helpful. As a result, I pretty thoroughly reworked my story bank. Thankfully, there were no tricky, Amazon-style questions like “tell me how you used data to modify your strategy”. Instead, it was more or less clear from the question what leadership qualities were being discussed and which of my stories to tell.

General Cognitive Ability is open-ended questions like “a friend opened a chocolate store, advise him on a business plan.” There is a clear framework here, easy to practice. This YouTube channel by a former Google HR Jeff Sipe helped me a lot (there’s a whole playlist about negotiations as well). I also took a consultation with a small mock, where I was advised to speak more slowly.

And the “casual” conversation with the manager - not casual at all, should be treated as behavioral. You can chat about life later, once they hire you, in the interview they look at signals. I prepared my strongest stories treating this as a behavioral interview.

Overall, I estimate the contribution of behavioral to be about 80%. Yeah, I didn’t expect that could happen with Google. But this is a position in the Sales track, not SWE, I will have to communicate a lot with clients and top managers, so that’s the focus.

Conclusions

Now the most important part. A long job search is mostly about your mental health and psychological stamina. I’d highlight the following considerations:

it’s a marathon. Reserve 1 year for your job hunt. If done faster, great; otherwise, keep going
it’s a controlled lottery. Not just lottery (otherwise, you’ve got no lever to change anything) but a controlled one: you can improve your odds of winning by working hard but still treat this as a lottery
respect your mental health. Just do what helps you: sports/yoga/art/etc. Don’t do leetcode at 3am if you stand up at 7am
don’t get emotionally attached to any position before you get an offer. I did that a couple of times, and this only makes getting a rejection harder
don’t overthink it. Many people tend to think it’s smth wrong with them or they suck or smth. No. The actual reason for rejection might be just location or another better candidate. One bigtech company ditched me after the 1st tech round, and turned out they hired my ex-colleague who’s based in London and has more experience in recommender systems. And that’s fine, I wouldn’t have benefited from hours of self-reflection and wondering “why the heck did they ditch me?”. So make constructive conclusions from your interview experience and move one.

Good luck!

[EN/RU] To Ph.D. or not to Ph.D.

2024-04-12T00:00:00+00:00

In this post I collected my own subjective pros and cons of pursuing a Ph.D. degree. If you scroll down, you’ll find a richer version of this post in Russian, wilth all jokes and puns.

That’s an eternal debate, and there are quite a few “Ph.D. survival guides” already published (e.g., the one by Andrej Karpathy or “10 Tips for Research and a PhD” by Sebastian Ruder; beware of the survivorship bias btw). Still, I wanted to share my view, even though I did a Ph.D. in Russia and the European/US/etc experience might be somewhat different.

As a prelude, one of the widest avenues in the world, in Buenos-Aires, where I attended IJCAI 2015 and watched DeepMind present Atari RL for the first time. And of course, a genuine Argentinian steak is a life-time gastronomical experience.

Pros

1. Career boost

Even though, as the saying goes, “Ph.D. is for people who love ideas more than money,” still, career-wise, it can be a great time investment. I personally enjoy R&D, for me it’s a sweet spot between industry and academia. Some of the most seducing positions I’ve seen require a Ph.D. as a minimal req (though often can be substituted with a master’s + 5-7 years of industrial experience). Another consideration for me is that a Ph.D. degree might help me when I’m 70+. Teaching calculus in a university is not a bad plan for me, a nerd.

“PhD is for people who love ideas more than money”.

2. Freedom of creation

After I finished graduate school but had not yet defended, I went to work as a DS for an IT giant. For the dissertation, I still had to prove one small theorem, and then I realized how blinkered my brain was and how much more difficult it was to just sit down and think calmly. But when you are in graduate school full-time, ideas come to your mind all the time despite the chaos.

3. More free time (kinda)

Source: https://phdcomics.com.

Don’t get me wrong, academia ain’t all rainbows and beers during lunch breaks. Some research institutes might operate like that, but my Ph.D. was a serious full-time endeavor. My advisor would challenge me to aim at top-tier ML conferences. Add here grant proposals, conference deadlines and teaching – and your Ph.D. turns into a hectic activity, just like an industrial position. Despite all that, grad school felt like more free time compared to the industry, especially for networking and side projects. It was during grad school that I started teaching ML/DS at universities and corps, and eventually kicked off mlcourse ai. You could argue I could’ve spent that time solely on research and aimed to be part of that elusive 0.1% of people who made a difference with their Ph.D. thesis, but my excuse is that I’m too dumb for that.

4. Constantly learning new stuff

One day you’re studying GANs and game theory, then you ace advanced statistics; in a week you realize that you need to refresh some concepts from graph theory, then maybe group theory. It’s just constant learning; you are always catching up with all these smart folks around you. This might be stressful, but in my case, that only motivated me.

5. Developing industry-relevant skills

With CS/DS/ML, we are lucky. As a Ph.D. candidate, I learned a lot about all these things relevant to IT roles in the industry: algorithms, databases, social network analysis, etc. Actually, I spent the whole 1st year taking 4 Coursera courses at a time to catch up and fill the gaps. Then, you’d see yesterday’s Ph.D. candidates landing jobs at Amazon/Meta/etc.

Cons

Certainly, there are many downsides to pursuing a Ph.D. degree, even apart from “spending these 4-6 years at Google instead”.

1. Feeling like a loser

Source: https://phdcomics.com.

Doing Ph.D., you constantly feel that you’re stupid, your research sucks, and the thesis itself is worth nothing (the latter is true, only your papers matter). Even at grad seminars, where Ph.D. candidates informally shared what we were working on, I couldn’t shake off the feeling that we were all smart losers. Moreover, I felt that everyone else felt the same. Probably that’s even worse for postdocs, with a never-ending need to beg for money. As one of my colleagues put it, “You are never happy doing your Ph.D.”

2. Bad working culture

Source: https://phdcomics.com.

The downside of a flexible schedule in your grad school is that you often work on the weekends and even holidays. But the biggest issue isn’t even time management. In academia, I saw many smart people, but not a single organized one that I could learn from. It’s not only about the infamous messy “academic” style of coding. It’s rather this “publish or perish” pressure, deadlines, multi-processing, overlapping projects, etc, that all eventually lead to sloppiness and bad working culture.

3. Bureaucracy

That’s probably more relevant for Russia and my old-fashioned Ph.D. defense process, so I won’t be elaborating. tl;dr: in my case, that was a damned shitload of bureaucracy.

4. Grants

That’s the sole reason I’ll never get back to academia (well, apart from me being too dumb to produce revolutionary ideas). All these grant applications and reports just drain you and make you feel miserable (especially grant reports, in the pre-GPT era). Searching for grants, writing applications and reports, bluffing about the results – this all makes you (or at least made me) really unhappy. The rumor is one of the ads that makes professors or postdocs lure into the industry sounds like: “Think about it! You won’t have to write grant applications. Ever. E-ver.”

5. The Gestalt monster

Source: https://phdcomics.com.

Your Ph.D. thesis is a huge gestalt. Anyone who’s been through this knows: the period from pre-defense to the long-awaited defense is incredibly stressful. Wherever you are, whatever you’re doing, part of your brain is occupied by the thought that you should be writing your thesis. This “tumor” is eating part of your brain. And even though the defense brings a tremendous feel of relief, I wouldn’t go the same route again.

Conclusion

Despite all the downsides, I want to end on a positive note: Ph.D. is a cool experience. While I wouldn’t call it the happiest period of my life, I don’t regret it. It’s like an adventure and an extension of your childhood: when else would you go for it if not at 23? And a bit of ageism as a wrap-up: my advice is to go to grad school when you’re young; once you’re loaded with work, family duties, and other responsibilities, it’s too hard to break out of that local minimum.

A (richer) version of this post in Russian

The corresponding Telegram post and Vastrik post.

Субъективные плюсы и минусы аспирантуры

Давно хотелось порассуждать на тему, стоит ли вообще гоняться за степенями кандидата наук/Ph.D. Много копий уже сломано в спорах, а зачем оно вообще надо, не лучше ли все то же время проработать. Не буду гнаться за объективными доводами, опишу чисто свой субъективный опыт, какие плюсы и минусы увидел.

Надо оговориться, что я выбрал аспирантуру ВШЭ, в которой в 2013 можно было зарабатывать почти так же, как в индустрии на джун/миддл позициях, а также летать на конференции по Европе и даже разок в Аргентину. Также я получил к.т.н. в старом формате с диссоветами и секретаршами, так что опыт Ph.D. в Европе/Америке может отличаться от моего.

Возможно, широчайшая улица мире – в Буэнос-Айресе. Жаль только, далеко от столицы не дайют отъехать, все по датам конференции. Но настоящий аргентинский стейк – это впечатление!

Я нисколько не жалею, что пошел в аспирантуру, даже несмотря на то, что академия – для того 0.1% людей (навскидку) способных порождать реально стоящие идеи.

Плюсы

Плюс 1. Карьерный

Вообще “Ph.D. is for people who love ideas more than money”, но я первым плюсом назову именно карьерный. Я долго присматривался, что вообще люблю делать в жизни, и на данный момент это R&D (позиции типа applied scientist). C одной стороны, скучно в чистой индустрии делать однотипные проекты, с другой – чистый рисеч пилить сложно, надо быть умным и усидчивым. А вот прикладной рисеч в корпорации, без требования публиковаться – это что надо, sweet spot. То есть посмотрел, что там state-of-the-art, почитал статьи, попробовал применить, да так чтоб бизнес-импакт тоже был. Еще заодно с кем-то запартнериться, чтоб не самому страдать, а ребята из академии тоже помогали. Классно же. И тут я вижу профит от степени, людям в академии она говорит, что “этот парень свой”, а менеджменту в корпорациях она говорит о том, что “это парень умеет по-новому взглянуть на вещи”. Неудивительно, что для некоторых R&D позиций на входе требование степени Ph.D.

“PhD is for people who love ideas more than money”.

У меня есть еще вполне определённое нËрдическое видение своей старости, куда более четкое, чем своих ближайших 20-ти лет. Я буду преподом в универе, проводить семинары по матану – преисполнившись в этом вашем корпоративном мире, вернусь к бесконечно-вечному и прекрасному. Да, буду старпёром, чем плохо, надеюсь, доработаю до смерти, не маразмируя. Буду жить матаном, общаться со студентами и студентками, прививать им любовь к математике и кайфовать. И вот для такой пенсии Ph.D. очень даже к стати.

Плюс 2. Свобода творчества

Уже после того, как я закончил аспирантуру, но еще не защитился, я пошел работать программистом-исследователем в мэйл. По диссеру еще надо было доказать одну небольшую теорему, и тут-то я понял, насколько мозг уже зашорен, насколько сложнее просто сесть и спокойно подумать. А вот когда ты full-time в аспирантуре, идеи приходят в голову постоянно, даже несмотря на то, что суеты тоже немало.

Плюс 3. Свободного времени больше

Не стоит думать, что в академии вольготно и можно водку пить в обеденный перерыв. Где-то в НИИ, конечно, именно так и делают, но аспирантура ВШЭ (как и нормальный западный Ph.D.) – это именно full-time деятельность. Мой научник челенджил меня публиковаться на топ-конференциях, всячески подгонял и не давал останавливаться на двух статьях в ВАКовском вестнике провинциального технического института. И уж точно в академии много своих приколов с заявками на гранты и дедлайнами конференций, которые порой превращают твои выходные в будни (в случае моей жены такой черной дырой, сжирающей все время и мыслетопливо, было преподавание).

Источник: https://phdcomics.com.

При всем этом мне показалось, что все равно в аспирантуре свободного времени больше, чем в индустрии, на примере мэйла. Больше времени на нетворк и сторонние проекты. Именно во время аспирантуры я начал активно преподавать ML/DS в вузе и корпорациях и в итоге заложил основы http://mlcourse.ai (тут можно возразить, что я мог бы этого не делать, а больше времени вложить собственно в рисеч и попасть в тот 0.1% людей, породивших стоящие идеи, но я отговариваюсь тем, что туповат для этого).

Плюс 4. Желание постоянно изучать что-то новое

Сегодня читаешь про GANs и их связь с теорией игр, ботаешь теорию игр, завтра по-нормальному за статистику берешься, послезавтра – графы, на следующей неделе – теория групп. Через месяц начинаешь преподавать алгоритмы, чтоб самому в них лучше разобраться. И так постоянно. Кажется, что вокруг все умнее тебя и надо догонять. Кого-то это точно вогнало в стресс, но мне было интересно.

Плюс 5. Релевантные для индустрии навыки

Каждый раз хочется уточнить, что все субъективно. Есть аспиранты, глубоко забуривающиеся в узкие области науки, и им не очень повезло с обобщением получаемых навыков на индустрию, а вот нам с Data Science и Machine Learning повезло куда больше. Я почти весь первый год ботал по 3-4 курса за раз на курсере, закрывал пробелы после слабой магистратуры – алгоритмы и БД, дискретка, анализ соцсетей – все это не в молоко, а вполне релевантно если не абстрактной “индустрии”, то хотя бы собеседованиям. Потом я преподавал студентам Python и машинное обучение и, не дожидаясь защиты, без особых проблем устроился программистом-исследователем в упомянутый мэйл (хотя до аспы греб на такой галере, что и вспоминать неохота, там особого ценных хард скиллов я не приобрел). Хотя всегда надо оставить вариант скрытой переменной – упертости: кто упорот в хорошем смысле, тот и диссер пойдет писать, и литкод для собеса осилит.

Минусы

И конечно, минусов тоже море. Иначе про Ph.D. не было бы столько срачей и мемов. Опущу очевидные (”лучше за 4 года в Яндексе бы вырос!”) и опишу те, что сам испытал.

Минус 1. Ощущение лузерства

Источник: https://phdcomics.com.

Постоянное ощущение, что ты тупой и рисеч твой ничего не стоит, а диссер пишется в стол (а если защищаться по классической схеме, он реально пишется в стол, важны только крутые статьи, но не диссер). Даже в аспирантуре модной (на 2013 год) Вышки все равно не избежишь “атмосферы НИИ”, когда дойдет до защиты – чай “Корона российской империи”, деды в свитерах, гранты РФФИ. Какие-то никому не нужные проекты тащатся и все делают вид, что они реально что-то значат. Даже на аспирантских семинарах, где мы делились, кто над чем работает, меня не покидало ощущение, что собрались умные лузеры. Более того, мне казалось, что у других ощущение ровно то же было. Предположу, что на позициях postdoc и выше ощущение лузерства еще и усугубляется необходимостью “стоять с протянутой рукой”, то есть выбивать деньги на свои исследования. В-общем, мне запомнились слова бодренького француза, коллеги в Вышке: ”You are never happy doing your Ph.D”.

Минус 2. Плохая культура работы

Я уже говорил, что в целом у меня было в аспирантуре было больше свободного времени, но этот свободный график работы часто подразумевал дедлайны в выходные и даже праздники, а рабочая суббота в академии – по сути почти принятая норма.

Источник: https://phdcomics.com.

Но даже не time management самая главная проблема. В академии я видел много умных людей, но ни одного – организованного, так чтоб было чему поучиться. И речь даже не про пресловутые коммиты в мастер и качество кода академиков. Просто смотришь на коллег вокруг и видишь, что у всех жопы горят, проекты идут внахлест, а организация работы желает лучшего. И конечно, из-за этого халтуры тоже немало. Парадигма ”publish or perish” и гонка за Хиршем тут совсем не помогают.

Минус 3. Бюрократия

Я защищался в старом формате, с диссоветами и морем лишних бумаг – куча времени ушла просто в молоко, на бюрократию. Причем за этим словом стоит не только подписание бумажек у секретарши, которая “сейчас, ой просите, пьет чай с коллегой”, но и удовлетворение всяческих хотелок рецензентов и оппонентов, это может занимать месяцы, а то и годы. Кто хоть раз испытывал фрустрацию от отвергнутой статьи с токсичными комментариями рецензентов – вот с диссером надо умножить в 10 раз. Не забуду случай с одним лектором. Я пошел на его занятие по оценке сложности алгоритмов и просто испытал кайф и от подачи материала (речь шла про умножение матриц, чудесное откровение Штрассена, придумавшего, как эту базовую операцию ускорить, и современные алгоритмы), и от самого лектора, казалось даже, что у меня с ним сложилась химия: захотелось делиться своими алгоритмами, обсуждать, как их можно ускорить, и вообще иметь в лице такого лектора старшего товарища или даже наставника. И тут моя коллега по аспирантуре проходит предзащиту, приходят ваковские чиновники и начинают придираться просто к каждому слову и каждой формулировке. Самый активный среди них – тот самый лектор. Причем вел себя он так, что цензурно и не опишешь. И вроде понятно, что лектор так переобулся из лучших побуждений – помочь аспирантке с диссером, но все равно думаешь “нах так жить”. И в моменте, когда тебе самому приходится проходить все эти ваковские квесты, бюрократия тебя чуть ли не ломает, надо просто всю волю в кулак собрать, чтоб не забросить диссер после предзащиты.

Справедливости ради, в ВШЭ процедура защиты стала намного больше походить на зарубежную с Ph.D., и лишней бюрократии стало меньше. Но я еще успел отхватить совка.

Минус 4. Гранты

Уже местами упоминалось выше, но все же стоит отдельным минусом обозначить. Я уверен, что никогда не вернусь в академию и даже не потому, что туповат для рисеча. Заявки на гранты и отчеты по ним… ммм… вот этого одного достаточно. Километры каких-то слабо связанных между собой абзацев (жаль, не было chatGPT в те времена), бахвальство, подача вещей в нужном свете – и все чтоб выбить 700к рублей на всех на год, еще и с 35% налогом. В Европе, конечно, есть реально большие гранты, которые могут спонсировать всю лабораторию много лет. Но все равно поиск грантов, заявки, отчеты кого угодно сделают несчастным, даже в случае больших грантов. Слышал, на конфах солидных профессоров так и переманивают в индустрию: “Подумайте! это ж вам не надо будет писать заявки на гранты. Ни-ког-да”.

Минус 5. Гештальтище

Источник: https://phdcomics.com.

Само требование затащить таки этот диссер – колоссальный гештальт. Знакомо всем, кто прошел этот путь: период от предзащит до долгожданной защиты – очень давящий. Где бы ты ни был, что б ты ни делал, часть твоего мозга оккупирована мыслью, что ты должен писать диссер. Ощущение, что без успешной защиты и получения степени ты просто потерял 3-4 года – очень гнетущее. И даже несмотря на последующее потрясающее ощущение “горы, свалившейся с плеч” после успешной защиты, я б больше не хотел повторять такой опыт.

Заключение

Несмотря на все минусы, хочется закончить на позитивной ноте (жена тут сбоку подсказывает: надо обязательно отметить, что если б не аспирантура, мы б не познакомились). Ph.D. – это классный опыт. Хоть это я б не назвал самым счастливым периодом своей жизни, но все же не жалею. Своего рода авантюра и “продление детства”, но когда как не в 23 года. Если б мне сейчас было снова 23, я б снова принял решение пойти в аспирантуру, только скорее на Ph.D. куда-то в крутое место типа Сингапура и заниматься какой-то более актуальной ML-темой. И идти в аспирантуру точно надо по молодости, потом как обрастешь работами, семьями и прочими обязанностями, уже все, из локального минимума не выскочишь.

Math in a real project: scaling laws for near-duplicate papers

2023-10-20T00:00:00+00:00

In this post, I describe how graph theory popped up out of the blue in a real project.

In one of the latest posts I described near-duplicate detection with LSH and how to run it in Python with Datasketch.

When you apply it to scientific papers or submitted manuscripts, you can spot various types of fraudulent behavior:

simultaneous submissions – when the authors send 2 or more manuscripts to the different journals at the same time;
duplicate publications – cases when there is a pair of almost identical published papers
“salami slicing” – when a long paper is split into several smaller ones, each published independently;
“paper mills” – research misconduct of selling fraudulent papers, some of those can be spotted with near-dup detection algorithms.

Example of a spotted potential retraction.

With my prototype, we first developed a near-duplicate detection solution at Elsevier and then collaborated with STM to roll it out for all publishers.

Internally, at Elsevier, we measured that around 4% of manuscripts have a near-duplicate submitted earlier. Prior to scaling the algorithm to all journals by all major publishers, a reasonable question we asked was: As we increase the set of papers under consideration, how does the percentage of those having at least one near-duplicate change? Apparently, if we’d expect it to stay at 4% that’s one story, but if we expect to have 20% of near-dups, that’s a completely different story.

Mathematical problem formulation

Let a graph represent the relation “to be a near-duplicate”

$N$ – is the number of nodes, $E$ – is the number of edges.

Interpretation:

node $\leftrightarrow$ a title/abstract/full text of a manuscript/paper
edge $\leftrightarrow$ 2 titles/abstracts/full texts are near-duplicates
connected node (with at least one edge) $\leftrightarrow$ a title/abstract/full text has at least one near-duplicate
isolated node (without edges) $\leftrightarrow$ a title/abstract/full text is “unique”

Let’s say Publisher 1 ran LSH and found $\Large \alpha_1$ % of near-dups (connected nodes):

Now, Publisher 2 also ran LSH and found $\Large \alpha_2$ % of near-dups (connected nodes):

Question: What’s the new $\Large \alpha$ % of connected nodes in the combined graph?

Note:

we are unlikely to ever check this experimentally;
we are unlikely to estimate the share of edges between the 2 sets (near-dups across Publisher 1 and Publisher 2) as the parties do not share data.

Looking from another perspective:

What’s the % of isolated nodes (“unique” papers) in a graph?
How does this % grow in a growing graph?

Assumptions

We assume a random graph model:

each pair of nodes is connected by an edge with a fixed probability $\large \mathbb{p} \ll 1$
the fact that two nodes are connected is independent of other nodes/edges
$\large \mathbb{p}$ is independent of the number of nodes

“All models are wrong but some are useful”
George Box, British statistician

Deriving the formula for # isolated nodes

Let’s consider a graph with $n-1$ nodes (black), and separately, the $n$-th node (green) which is “added” to the graph.

Interpretation: a growing set of papers.

The $n$-th node is connected to each one of the other nodes with a fixed probability $\large \mathbb{p} \ll 1$.

The probability that it’s not connected to any of them (i.e. isolated):

\[\large P_{iso} = {(1-\mathbb{p})}^{(n-1)}\]

With $\large \mathbb{p} \ll 1$ and $\large n \gg 1$ we have:

\[P_{iso} = {(1-\mathbb{p})}^{(n-1)} \approx {(1-\mathbb{p})}^n = \sum_{k=1}^n {n \choose k} (-\mathbb{p})^k\] \[= 1 - n\mathbb{p} + \frac{n(n-1)}{2}\mathbb{p}^2 - \frac{n(n-1)(n-2)}{6}\mathbb{p}^3 + \dots\] \[\approx 1 - n\mathbb{p} + \frac{n^2}{2}\mathbb{p}^2 - \frac{n^3}{6}\mathbb{p}^3 + \dots\] $$\Large = e^{-n\mathbb{p}}$$

We arrived to an exponential dependency of $P_{iso}$ on $n$.

$P_{iso}$ is also the expected percentage of isolated nodes in the graph.

Interpretation: as the set of papers is growing, there’s a higher chance that any of the papers will see a near-duplicate, and thus the percentage of “unique” papers is dropping.

Experimental results

700k ScienceDirect abstracts

Percent of unique abstracts (blue) and the corresponding prediction (orange).

For 700k ScienceDirect abstracts, we see ~95% of unique ones, 5% have $\geq1$ near-duplicate.

Estimated $\large \mathbb{p} = 7 \cdot 10^{-8}$

Same + projected to 10 mln. papers

Same as before but with the projection to a set of 10 mln. abstracts (orange)

Based on the estimated $\large \mathbb{p} = 7 \cdot 10^{-8}$, the projection to 10 mln. papers gives $\approx$ 50% near-dups (crazy!).

4.6 mln. manuscript titles from the Editorial Manager

Percent of unique manuscript titles (blue) and the corresponding prediction (orange). The red line shows the share of manuscript titles with at least one near-duplicate.

For 4.6 mln. manuscripts from Editorial Manager, we see ~62.8% of unique ones, 37.2% have $\geq1$ near-duplicate.

Estimated $\large \mathbb{p} = 10^{-7}$. Very close! Now the previous projection doesn’t look crazy anymore.

One more check of the model validity

According to the model, the number of edges is quadratic in the number of nodes:

\[E = \mathbb{p} \cdot {n \choose 2} = \mathbb{p} \cdot \frac{n(n-1)}{2}\]

Imagine a clique (fully-connected graph), each edge is then kept with probability $\mathbb{p}$, hence this formula.

The scaling law of #edges (near-dups) vs. #nodes (titles/abstracts) is predicted well, though the coefficients are a bit off. The model can be adjusted for that.

Limitations of the model/analysis

we ran LSH with titles & abstracts, not with full texts
LSH is probabilistic, it’s recall is <100%, i.e. it finds not all of the near-dups (the actual recall is ~90% for titles, intractable to assess for full texts)
model predictions are good qualitatively (the model explains the effects well) but a bit off quantitatively (a discrepancy for #edges vs. #nodes)

Conclusion

The mathematical model shows that the number of papers with at least one near-duplicate increases with the size of the collection. Hence, in a combined dataset of papers from multiple publishers, we’d expect to see a higher percentage of duplicated papers and, therefore, more cases of research misconduct.

[RU] Inburgering in the Netherlands

2023-07-26T00:00:00+00:00

In this post, I describe my experience in passing Dutch naturalization exams.

Коль уж и этот блог, и телеграм-канал называются New Yorko Times, буду больше рассказывать про свои “новые времена”. Я тут давеча сдал все экзамены, без пяти минут (и одного года) голландец, расскажу про процесс. По-нидерландски он называется inburgering (инбурхеринх).

Немного контекста: в Нидерландах действует программа налоговых поблажек для высококвалифицированных мигрантов (kennismigrant), 30% дохода не облагаются налогом в течение первых 5 лет. Это называется 30% ruling, соотв. галка есть в калькуляторах доходов, например, тут https://thetax.nl. К примеру, с гросс 100к с рулингом на руки – 6к евро в месяц, а без рулинга – 4800 (разница в 20%, а не 30%, поскольку все сложно, налоговая шкала прогрессивная). Когда рулинг заканчивается, одновременно надо ~~побираться~~ решать финансовые вопросы и обновлять разрешение на проживание в стране. Тут вариантов как минимум 5 – от “оставить все как есть” до смены гражданства. Для ПМЖ и гражданства надо сдать языковые экзамены.

Экзаменов 5 или 6: 4 стандартных, как в IELTS/TOEFL (письмо, аудирование, говорение, чтение) и экзамен на знание голландского общества (+ истории, политики, налогов, страховок и всякой бытовухи). Тем кто не работает, надо еще сдать экзамен по ориентации на нидерландском рабочем рынке: сделать портфолио и пройти интервью. Самое сложное, что интервью на нидерландском (вот ведь засада!).

Я еще проскочил с уровнем A2, это довольно лайтово и, соответственно, недостаточно для нормального диалога с нидерландцами. Уже долго обещают повысить до B1, что вдвое сложнее. Пока оно все свежо в памяти, расскажу. Далее, пожалуй, интересно только тем, кому все это предстоит.

В принципе экзамены не так уж нереально взять с наскока, по-физтеховски (японский? ща докурим, пойдем сдавать). Проходной бал – 6 из 10 по каждому экзамену. Но конечно, лучше хоть чутка выучить язык в целом, чтоб хоть по мелочи уметь объясниться. Мы с женой записались на курсы, 4 месяца прям ходили в школу, сидели за партами и поднимали руки (если кому нужна рекомендация школы в Гааге – пишите в коменты). Это pre-train. После этого еще зафайнтюнились уже чисто под экзамены, 5 занятий с репетитором должно быть достаточно.

Теперь вкратце пройдемся по экзаменам.

Lezen (чтение)

Самый простой экзамен, для ответа на некоторые вопросы не нужно даже знание языка. Достаточно просто уметь глазами Ctrl+F делать. Всего 25 вопросов, дается больше часа, этого прям за глаза.

Вот типичный пример вопроса:

Рекомендуемся стратегия: сначала внимательно прочитать вопрос, затем варианты ответа, потом уже искать ответ в тексте. Читать текст от начала до конца не имеет смысла.

В качестве подготовки можно просто посмотреть примеры экзамена на inburgeren.nl, еще можно найти примеры в книге Ad Appel.

У меня тут 10/10 было без проблем.

Luisteren (аудирование)

Все похоже, только надо слушать запись или смотреть видео и выбирать варианты ответа. Тоже 25 вопросов, дается 45 минут, но и этого предостаточно. Тут у меня тоже 10/10.

С аудированием меня слегка сбил ютуб-канал Ad Appel, там прямо сложно, почти как в живой речи, на экзамене все проще.

Schrijven (письмо)

Схряйвваардихьхайд (Schrijfvaardigheid) – это прям классика, писать ручкой по бумаге, может оказаться непривычно после многих лет тыкания в клавиши. Здесь дается 40 минут на 4 упражнения: 2 письма написать, один формуляр заполнить и еще что-то типа заметки в газету написать. Вот это уже сложнее, чем чтение или аудирование, тут как минимум надо грамматику знать.

Типичный пример задания, источник: inburgeren.nl

Мы с репетитором наиболее усердно тренировали именно письмо, и почти каждый раз было немало правок. Примеров много также в книге Ad Appel.

Нацарапал я на 9 из 10 (результат пришел спустя 4 недели).

Spreken (говорение)

Тут дают 35 минут на 24 задания – из них 12 это открытые вопросы, еще 12 – по сути аудирование (не понимаю, почему это вообще часть экзамена по говорению).

Типичный открытый вопрос: “В выходные я люблю ездить на море. А как ты проводишь выходные? И с кем?”. На ответ дается всего 15 секунд, важно ответить именно на оба вопроса.

Говорение надо тренировать, так с наскоку плохо выйдет. На ютуб-канале Ad Appel немало примеров. Да и с репетитором мы тоже делали упор на говорение.

За spreken я получил 7 из 10 (результат пришел спустя 6 недель). Обратной связи нет, так что почему не максимум, не знаю. Впрочем, проходной 6 из 10, так что все хорошо. Но надо иметь в виду, что именно говорение часто проваливают.

KNM (экзамен на знание нидерландского общества)

KNM – это тест на знание голландских обычаев, истории, бытовухи (медицина, налоги, страховки) и т.д. Тут достаточно прочитать книгу “Welkom in Nederland” и сделать тестовые экзамены на сайте https://inburgeren.nl. Я еще надыбил какие-то более сложные тесты, т.к. прошел слух, что вопросы на экзамене сильно отличаются от тестовых. Но в моем случае это было лишним, экзамен прям почти такой же был, как тестовые.

Книгу читать было приятно, написана на уровне A2, так что можно просто брать и читать. В словарь, конечно, приходится подглядывать, по таким темам как коммуналка, здоровье или страховки немало незнакомых слов.

Всего 40 вопросов, дается 45 минут, так что темп бодрый. KNM я на удивление сдал по максимуму, а так вообще с ним бывают проблемы.

ONA (ориентация на рынке труда)

Это мне не надо было сдавать, 6-ой экзамен – для неработающих. Жена будет его сдавать, так что могу обновить пост опосля. Пока же можно поискать по ключевому слову ONA в одном из многочисленных русскоязычных чатиков в NL, например, в этом.

Обсудить пост лучше всего в телеграм-канале New Yorko Times, а именно, тут.

A short prompt engineering (chatGPT ‘cooking’) course by Andrew Ng and OpenAI

2023-07-19T00:00:00+00:00

In this post, I review a short course by Andrew Ng and Isa Fulford on ChatGPT Prompt Engineering for Developers.

I found “ChatGPT Prompt Engineering for Developers” great and would like to give a short overview.

It’s our favorite Andrew Ng in collaboration with Isa Fulford from OpenAI.

Hi, Andrew! Long time no see! image credit

Pros

the course is (yet) free
the course is very short, just ~10 lectures, 5-10 min. each
very practical, it’s all about examples of using OpenAI APIs
the platform is great: video on the right, and interactive Jupyter running on the left; thus you can play around with code while watching the video

Some tips covered

tiny ones like putting the part of the text that you need to process between triple backticks
making chatGPT respond in a structured way, e.g. JSON so that you don’t have to parse the output with regexp (if you are solving a problem with a regexp, you have two problems)
all the way through typical downstream tasks (sentiment classification, translation, etc.) up to writing a small pizza order bot with chatGPT backend where basically the whole operation of the bot is described with one long prompt

What I missed

examples of few-shot learning, how to best provide examples right there in the prompt to improve downstream performance as compared to the zero-shot setup
how to debug such solutions. Debugging a pizza order bot that follows your long-written prompt with instructions sounds close to impossible

Despite the cons, the course is definitely worth 2-3 hours of your time and 0 euro/dollars. I recommend taking a couple of your own tasks (either from pet-projects or real business tasks) and playing with them as you progress through the course.

Near-duplicate Detection with Locality-Sensitive Hashing and Datasketch

2023-06-27T00:00:00+00:00

In this post, I review Locality-Sensitive Hashing for near-duplicate detection. I demonstrate the principle and provide a quick intro to Datasketch which is a convenient library to run near-duplicate detection at scale.

The problem being solved

Once I needed to find near-duplicates in a (relatively) large collection of texts ~5 mln. docs. I wanted the solution to be:

easy-to-use
scalable
exact, i.e. when a pair of near-duplicate texts is flagged, we can be confident that those are indeed near-duplicates.

I somehow struggled for quite a while to find a solution that would satisfy all conditions. Until I found Locality-Sensitive Hashing (LSH) and its implementation – Datasketch.

MinHash LSH – the principle

1. When we need to deduplicate a single dataset

image credit

In a nutshell, this works as follows. For a most typical scenario where we need to identify near-duplicates in a single collection of texts, we perform the following steps:

the text is processed and cut into shingles (overlapping substrings of a fixed size);
then a set of shingles is minhashed, this involves creating multiple hashes for a set of shingles, so that we end up with a single vector of integers for each piece of text, a.k.a. a signature;
the dimension of the hash vector is further reduced via Locality-Sensitive Hashing, which is creating a single hash from a number (band) of nearby elements in the hash vector. The resulting vector is also called a bandhash signature or bandhash vector;
all pairs of signatures where elements match at least in one position, generate candidate pairs;
(optionally) we can measure the true similarity between corresponding pieces of text to account for errors (False Positives) of the LSH algorithm.

I know there are quite a few terms here. Instead of explaining all of them (and thus re-writing something similar to this nice blog post) I’d refer to a classical book “Mining massive datasets”, ch. 3 for an intro into Locality-Sensitive Hashing and finding similar items. In this blog post, we’ll focus on a practical use case of finding near-dups in a large collection of texts.

2. When we have incoming “query” data that we want to compare to a large “index” dataset

Here “historical” data can be a large dataset, e.g. 5 mln. documents.

The “query” dataset is much smaller, e.g. 10K documents that we receive daily, say via some API, and would like to deduplicate.

We can do without LSH at all just comparing 10K fresh documents to 5 mln. historical documents. But that’d require 50 bln. comparisons each day might be too computationally prohibitive (a dumb idea leading, above all, to a considerable carbon footprint). LSH is a technique that approximates the exact similarity function.

The essence of the algorithm is to create signatures for each piece of text that is identified here by a DocID. Signatures are just numeric vectors of some fixed dimension, e.g. 128.

For two pieces of text to be considered as candidates for near-duplicates, it suffices for their hash signatures to match in at least one component. In the picture above, a pair highlighted in green is a candidate, and a pair highlighted in orange is another one. Bolded are those matching hash values.

Limitations

The method only takes care of the lexical similarity not semantical. Thus, with LSH, we won’t identify near-duplicates that differ due to parapharasing, synonym replacement, etc.
The method is probabilistic, i.e. some errors are allowed. Not all candidates would actually be near-duplicates. One can check this by calculating Jaccard similarity of the candidates. Thus, the algorithm is characterized by precision (out of all pairs of candidates found by the algorithm, what’s the proportion of real near-duplicates, i.e. with their Jaccard similarity exceeding the predefined threshold) and recall (out of all near-duplicate pairs, what’s the proportion of those found by the algorithm).
In practice, for a large enough dataset and long pieces of text (e.g. full documents not just titles), LSH tends to work worse in terms of precision while recall can not be known without a crazy carbon footprint. Finding true near-duplicate pairs in a relatively small collection of 50K texts requires >1.2B calls to a Jaccard similarity subroutine.

# imports
import json
import pickle
import re
from pathlib import Path

import numpy as np
import pandas as pd
from datasketch import MinHash, MinHashLSH
from matplotlib import pyplot as plt
from num2words import num2words
from tqdm import tqdm

Preprocessing and hashing

Essentially, MinHashLSH operates with shingle sets where shingles are overlapping substrings of a fixed size. The following 4 code cells show how MinHashLSH builds hash vectors (a.k.a. Signatures) for entry texts.

Further, as described in the picture above, for two pieces of text to be considered as candidates for near-duplicates, it suffices for their hash signatures to match in at least one component

s = "this is a piece of text"

shingle_size = 4

shingle_set = {s[i : i + shingle_size] 
               for i in range(len(s) - shingle_size + 1)}
shingle_set

{' a p',
 ' is ',
 ' of ',
 ' pie',
 ' tex',
 'a pi',
 'ce o',
 'e of',
 'ece ',
 'f te',
 'his ',
 'iece',
 'is a',
 'is i',
 'of t',
 'piec',
 's a ',
 's is',
 'text',
 'this'}

def hash_func(a_string, salt: int = 1):
    return hash(a_string + str(salt))

These are the 5 components of a toy 5-dimensional hash signature. Each one of them is created by hashing all shingles and taking a min. value of the hashes.

for i, salt in enumerate(range(5)):
    print(i, min([hash_func(el, salt=salt) for el in shingle_set]))

-7220920153181112185
-9127360350460247126
-8803612098918371157
-8027849914885749588
-9069105076530742277

Datasketch LSH – a toy example

from datasketch import MinHash, MinHashLSH

SIMILARITY_THRESHOLD = 0.6
NUM_PERMS = 96
SHINGLE_SIZE = 4

Three similar strings. We’ll index first two, and then look for near-duplicates for the 3rd one.

s1 = "This is a piece of text"
s2 = "This is a similar piece of text"
s3 = "This is also a similar piece of text"

Inserting strings split by whitespaces into MinHash objects.

minhash1 = MinHash(num_perm=NUM_PERMS)
minhash2 = MinHash(num_perm=NUM_PERMS)
minhash3 = MinHash(num_perm=NUM_PERMS)

for d in set(s1.split()):
    minhash1.update(d.encode("utf8"))
for d in set(s2.split()):
    minhash2.update(d.encode("utf8"))
for d in set(s3.split()):
    minhash3.update(d.encode("utf8"))

Create LSH index and insert first 2 MinHashobjects in it.

lsh = MinHashLSH(threshold=SIMILARITY_THRESHOLD, num_perm=NUM_PERMS)
lsh.insert("text1", minhash1)
lsh.insert("text2", minhash2)

Querying near-duplicates for the 3rd piece of text.

lsh.query(minhash3)

['text2'

Same with Redis storage as a backend, not Python dictionaries

See MinHashLSH docs to configure the algo to run with the Redis backend. The idea is that to query LSH for near-duplicates, we only need to make lookups to get signatures. Redis is an in-memory database that allows for very fast lookups, also, it scales much better than Python dictionaries.

lsh_redis = MinHashLSH(
    threshold=SIMILARITY_THRESHOLD,
    num_perm=NUM_PERMS,
    storage_config={"type": "redis", 
                    "redis": {"host": "localhost", 
                              "port": 6379}},
)
lsh_redis.insert("text1", minhash1)
lsh_redis.insert("text2", minhash2)

lsh_redis.query(minhash3)

['text2']

Running LSH near-duplicate detection with a real-world dataset

Further, we run the algorithm with some realistic dataset – news about cryptocurrencies, Kaggle dataset.

SIMILARITY_THRESHOLD = 0.8
NUM_PERMS = 128
SHINGLE_SIZE = 4

lsh = MinHashLSH(threshold=SIMILARITY_THRESHOLD, num_perm=NUM_PERMS)

Reading data

# You can download the dataset and customize this path
PATH_TO_DATA = Path("crypto_news")

The following two parts of the dataset would imitate the historical part (index_df) and the query part (query_df). For each title in the qury part, we’d like to find near-duplicate titles in the historical part.

index_df = pd.read_csv(PATH_TO_DATA / 
                       "crypto_news_parsed_2013-2017_train.csv")
query_df = pd.read_csv(PATH_TO_DATA / 
                       "crypto_news_parsed_2018_validation.csv")

We’ll identify each title by some id, so reindexing. Also, there are quite a few fields in the dataset, we’ll take care only of the title field.

index_df.index = [f'train_{i}' for i in range(len(index_df))]
query_df.index = [f'val_{i}' for i in range(len(query_df))]

index_df[['title']].head(2)

	title
train_0	Bitcoin Price Update: Will China Lead us Down?
train_1	Key Bitcoin Price Levels for Week 51 (15 – 22 ...

query_df[['title']].head(2)

	title
val_0	Paris Hilton’s Hotel Mogul Father to Sell $38 ...
val_1	Playboy Sues Cryptocurrency Company for Breach...

def preprocess(string, maxlen=500):
    tmp_string = string[:maxlen]
    tmp_string = re.sub(r"(\d+)", 
                        lambda x: num2words(int(x.group(0))), 
                        tmp_string)
    res = re.sub(r"[\W]+", "", tmp_string).lower()
    return res


def _shingle(string, shingle_size=4):
    shings = {
        string[i : i + shingle_size] 
        for i in range(len(string) - shingle_size + 1)
    }
    return set(shings)

LSH from Datasketch

lsh = MinHashLSH(threshold=SIMILARITY_THRESHOLD, num_perm=NUM_PERMS)

Populating the index

for id_, title in tqdm(index_df['title'].iteritems()):
    
    title_shingles = _shingle(preprocess(title), 
                              shingle_size=SHINGLE_SIZE)

    title_minhash = MinHash(num_perm=NUM_PERMS)

    for shing in title_shingles:
        title_minhash.update(shing.encode("utf8"))

    lsh.insert(id_, title_minhash, check_duplication=False)

We’ve indexed that many titles:

len(lsh.get_counts()[0])

If needed, we can serialize the LSH object

with open("lsh.pkl", "wb") as f:
    pickle.dump(lsh, f)

!du -hc lsh.pkl

 35M	lsh.pkl
 35M	total

Get near-duplicates for the query data

dup_dict = {}

for id_, title in tqdm(query_df['title'].iteritems()):

    title_shingles = _shingle(preprocess(title), 
                              shingle_size=SHINGLE_SIZE)

    title_minhash = MinHash(num_perm=NUM_PERMS)

    for shing in title_shingles:
        title_minhash.update(shing.encode("utf8"))

    dups = lsh.query(title_minhash)
    dup_dict[id_] = dups

(Optional step) Analyze true Jaccard similarity

def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return len(s1.intersection(s2)) / len(s1.union(s2))

To access precision, we calculate the actual Jaccard similarity for the candidates identified by LSH.

jaccard_sims = []

for id_, dups in tqdm(dup_dict.items()):
    if dups:
        shingle_query_title = _shingle(
                              preprocess(
                              query_df.loc[id_, "title"]))
        for dup_id in dups:
            shingle_indexed_title = _shingle(
                                     preprocess(
                                     index_df.loc[dup_id, "title"]))
            sim = jaccard_similarity(
            	shingle_query_title,
            	shingle_indexed_title)
            jaccard_sims.append(sim)

plt.hist(jaccard_sims, bins=20);

The distribution is nice, mostly, LSH indeed captures similar pairs.

Precision

(pd.Series(jaccard_sims) >= SIMILARITY_THRESHOLD).sum() / len(jaccard_sims)

0.8334

NOTE: That’s the precision of the LSH algorithm. In practice, it’s very easy to have 100% precision with an additional effort of calculating the actual Jaccard similarity for the candidate pairs (as done above) and filtering out false positives, i.e. the candidates pairs with similarity below the predefined threshold.

Recall

This is a very computationally intensive step (that we are speeding up with multiprocessing) – we calculate all pairwise Jaccard similarities between 11k query titles and 27k indexed titles and see how many true near-duplicates the LSH algo missed.

shingled_query_text = [
    _shingle(preprocess(el)) for el in tqdm(query_df["Title"])
]
shingled_index_texts = [
    _shingle(preprocess(el)) for el in tqdm(index_df["Title"])
]

Building pairwise Jaccard similarity matrix with multiprocessing

from multiprocessing import Pool

class JaccardPool(object):
    def __init__(self, archive_shingles):
        self.archive_shingles = archive_shingles

    def __call__(self, val_shing):
        """
        :param val_shing: a shingle set to compare with each one in
                          `archive_shingles` and to calculate Jaccard similarity
        """
        return [
            jaccard_similarity(val_shing, arch_shing)
            for arch_shing in self.archive_shingles
        ]

try:
    pool = Pool(8)  # on 8 processors
    engine = JaccardPool(archive_shingles=shingled_index_texts)
    sims = pool.map(engine, shingled_query_text)
finally:  # To make sure processes are closed in the end, even if errors happen
    pool.close()
    pool.join()

Now we have a similarity matrix of size [11k, 27k] which we can use to compute recall, i.e. how many pairs with Jaccard similarity over the given threshold we managed to find.

sim_matrix = np.vstack(sims)
print((pd.Series(jaccard_sims) >= SIMILARITY_THRESHOLD).sum() / 
      (sim_matrix >= SIMILARITY_THRESHOLD).sum())

0.925

NOTE: We see that with short titles recall is pretty high. In reality though, for large datasets, recall is unknown (without a crazy carbon footprint from computing all pairwise text similarities).

Literature

“Mining massive datasets”, ch. 3 – the theoretical foundation of Locality-Sensitive Hashing
A blog post on this topic
Datasketch – a Python library implementing, among all, the MinHashLSH algorithm

Is the 99% accuracy claim in detecting chatGPT-generated content really trustworthy?

2023-06-09T00:00:00+00:00

In this post I reason about the claimed accuracy of chatGPT detectors and why the task is far from being solved, in spite of those “99% accuracy” pitches that you hear.

Self-claimed metrics for chatGPT detectors

A colleague pointed me to yet another chatGPT detector with 99% accuracy (on top of GPTZero, DetectGPT, etc.). This Forbes article overviews many of such detectors.

Let’s summarize the claimed self-metrics of the presented detectors:

TurnitIn: 98% accuracy
Copyleaks: 99% accuracy
Winston AI: 99% accuracy
AI Writing check: 80-90% accuracy
OpenAI classifier: 26% recall, 91% specificity, with a small math exercise we get 58.5% accuracy

Wait. OpenAI hires the best talents in the whole world. The guys allegedly work for 60-90 hours/week. Is OpenAI really lagging behind the University of Kansas, Winston AI, etc.?

Why is the task hard?

Of course, OpenAI is not lagging behind the University of Kansas and others. The task is actually much harder. While it’s easy to overfit a particular dataset and report 99% scores, it’s much harder to build a generalizable detector that would work for any domain, any language, and any generator (if we allow other models and do not focus solely on chatGPT).

This is a type of task that is hugely susceptible to data drift and model drift (if you’d like the detector to spot content produced by any LLM).

How do I get 99% accuracy and raise money?

Easy. Take a handful of papers, create their chatGPT versions (e.g. with paraphrasing), and train any BERT-type model. Bingo! ~90% scores even in a fair cross-validation setup.

That’s what the University of Kansas did, according to the mentioned Forbes article.

The team of researchers selected 64 perspectives (a type of article) and used them to make 128 articles using ChatGPT, which was then used to train the AI detector. The model had a 100% accuracy rate of identifying human-created articles from AI-generated ones, and a 92% accuracy rate for identifying specific paragraphs within the text.

How detectors failed in the COLING 2022 competition

The argument above is supported by my experience with setting up a COLING 2022 competition track on the detection of AI-generated scientific papers. Here is the full blog post but the gist is the following:

all the models trained by contestants easily overfitted to the competition dataset, hence 99% F1 scores seen on the leaderboard;
as one of the winners Domenic Rosati showed with his follow-up paper, the models trained with the provided competition data (DAGPap22 in the table below) generalize very poorly to a new similar dataset that Domenic had generated (SynSciPass in the table below): 31.4% in a binary classification task is worse than random (actually, if we flip the detector’s predictions, that’d yield 1-31.4% = 68.6%).

So where are we with chatGPT text detection?

This needs a whole new post but for now, I’d say: be critical, don’t buy these claims of “99% accuracy” and live with the fact that we’ll probably never know for sure if a piece of text is human-written or completely machine-generated. That’s the new reality.

chatGPT would almost pass my take-home assignment for the Machine Learning Engineer role

2023-01-30T00:00:00+00:00

In this post I describe how chatGPT can cover around 90% of the steps needed to successfylly crack a take-home assignment for the Machine Learning Engineer role.

I know everyone is fed up with chatGPT by this point in time. But I find this story especially peculiar and worth sharing.

for those living in a cave or having 50h of meetings per week: chatGPT is a new chatbot by OpenAI based on GPT3, which is unprecedentedly good.

I’ve just hired a Machine Learning engineer to join my team, and I used a take-home assignment for the selection process. I love giving take-home assignments for several reasons, one of them is that it’s a good proxy to assess the candidate’s ability to write clear code and communicate the findings. Some folks criticise such assignments as they make candidates spend their free time on job-related tasks. Fair enough, although I pay back by providing thorough written feedback on every candidate’s work (including code). I also enjoy this particular take-home assignment that I’m giving, as it’s based on my real pet-project: sentiment analysis of crypto news that I’ll describe in a different post (Russian speakers can enjoy the description on Habr.com).

Having played a bit with chatGPT, I noticed that it can do fairly well with machine learning course assignments. In particular, for one of mlcourse.ai assignments, chatGPT3 also created a perfect implementation of the Stochastic Gradient Regressor which works, and then it explained how to use the new Python class.

I also heard from one math professor that chatGPT cracks most of the exams in calculus that he gives to students. So he has to review his assignments and probably redo most of them.

Naturally, I decided to check how well chatGPT would do on the MLE take-home assignment.

So, I simply fed the whole assignment description into chatGPT, and got back a long and reasonable description of the steps to be taken. And look what happens when I go on asking the model to generate Python code for the above.

So, we get separate code snippets for model training and evaluation, scraping the website with test data and getting model predictions, and also the flask API prediction endpoint code. All snippets are followed by reasonable descriptions in plain English. What was especially pleasing: in the description of the task (in English), I gave two lines of code to read the data (Python), the model understood and copied those two lines.

You can see that the code won’t work as is – fitting Logistic Regression with raw texts is not correct, but we’ll get there.

Creating a Docker image is a no-brainer!

Then I asked to structure the code a bit better:

Next: ”What would be the content of the Readme file?” – and there it is, a clear Readme with all the instructions needed to run the API. It’s actually better than some 70% Readmes that I saw while hiring an MLE.

chatGPT understands the context, keeps the generated code in “memory” and can debug it! The error with skipped text representation step was fixed by the model itself.

One more example:

In the end, I made it work, having fixed 2-3 annoying bugs like the lack of permissions to install packages in the Docker file, missing app.run() command in the Flask app code, etc.

But if it were a live coding interview, chatGPT would’ve covered 90% of the way to a fully working ML application. An average candidate wouldn’t do better.

Of course, some would say that chatCPT is generally stupid, it cannot solve even a quadratic equation, and in the case of this take-home assignment, it simply reproduced something seen on GitHub (as if it were not a miracle on its own!). But is it really so different from how a human would approach the assignment?

Here we are. I’ll finish the onboarding process for this MLE position and will get rid of this particular assignment. It’s actually a very good question what to replace it with.

Exciting times!