The Data Blog

The Future of Software Engineering: How AI Is Reshaping Roles and Skills (2025)

igor — Tue, 29 Apr 2025 21:50:04 +0000

Artificial intelligence is no longer a futuristic concept discussed in conference keynotes; it’s rapidly becoming the default collaborator for software engineers worldwide. Whether you’re bootstrapping a startup’s backend, crafting pixel-perfect frontends, or managing complex legacy systems, understanding how to effectively partner with Large Language Models (LLMs) is shifting from a novelty to a necessity.

Just a year or two ago, many of us viewed tools like GitHub Copilot as autocomplete on steroids – helpful, but supplementary. Fast forward to 2025, and it’s increasingly common to initiate entire projects by brainstorming with models like Claude 3.7 Sonnet or GPT-4o, often receiving functional code scaffolds within minutes. This dramatic acceleration changes the very nature of engineering work. We’re spending less time writing boilerplate loops and more time defining system architecture, validating AI-generated outputs, and making critical product decisions.

But harnessing this speed requires more than just access to an API key. Without structured workflows and a critical mindset, it’s easy to waste hours wrestling with an LLM stuck in a loop of incorrect fixes. This post dives into the practical realities of AI-assisted software engineering based on firsthand experience: the workflows that boost productivity, the pitfalls to avoid, the skills that matter most now, and a glimpse into where this rapid evolution is heading.

The Shift: Why AI Collaboration Matters Now More Than Ever

The transition is palpable. Internal benchmarks and industry studies alike point towards significant productivity gains. GitHub’s research indicated developers using Copilot completed tasks 55% faster. An independent academic replication confirmed this, finding a 55.8% speedup. While these numbers resonate with our own experience, particularly on greenfield projects, the true impact goes beyond mere speed.

AI is forcing a re-evaluation of where engineers deliver the most value. If an AI can generate boilerplate code, unit tests, or even simple UI components reliably, the engineer’s role elevates towards:

Strategic Definition: Clearly articulating the what and why of a feature or system.
Architectural Design: Making high-level decisions about structure, dependencies, and scalability.
Rigorous Validation: Ensuring the AI’s output is not just functional but also correct, secure, efficient, and maintainable.
Complex Problem Solving: Tackling the non-standard, ambiguous challenges that require human ingenuity and domain expertise.

Mastering this new dynamic requires adapting our workflows.

Evolving Workflows: From Design Docs to Dual-Model Reviews

Integrating AI effectively means rethinking traditional development steps. Here’s how our process has evolved:

Traditional Step (circa 2023)	AI-Assisted Step (2025)	Rationale
Sketch requirements in a doc (often skipped)	Co-author a detailed Design Doc in Markdown with an LLM	Provides clear “contract” for AI, version controllable, aligns team.
Hand-code boilerplate/use basic templates	Use LLMs to generate scaffolding, tests, and basic logic	Frees up human engineers for higher-level tasks, faster iteration.
Manual code review by peers	Dual-Model AI Review followed by focused human oversight	Catches more issues, reduces bias, speeds up initial review cycles.
Team discussion/alignment in Slack/Meetings	Blend human chat with LLM-generated summaries & analysis	Efficiently surfaces key points, tracks decisions, aids asynchronous work.

Design Docs as AI Contracts

We now mandate a Markdown design document for every project, regardless of size. This document, co-authored with an LLM to ensure clarity and completeness, typically covers:

Goal: What problem does this solve?
Interfaces: APIs, CLIs, data formats.
Tech Stack: Languages, frameworks, key libraries, versions (e.g., Python 3.11, Rust 1.78).
Dependencies: External services, databases.
Data Models: Schema definitions.
User Stories/Acceptance Criteria: How do we know it’s done?
Milestones: Phased rollout plan.

This doc lives in the project repository. IDE extensions (like Cursor’s or custom setups) can automatically include it in the context for every prompt. This simple step drastically reduces the chances of the LLM “guessing” the wrong framework, inventing file structures, or ignoring project constraints. It acts as a clear contract for the AI collaborator.

A diagram illustrating the workflow: A Markdown design document (design.md) feeding into an AI model (like GPT-4o or Claude) which then generates code (main.py, tests.py). Arrows show the design doc providing context.

The Power of the Dual-Model Review Loop

Code review is fundamental to quality. We’ve augmented our human review process with a two-stage AI review, leveraging different models to catch a wider range of issues:

Generate: Use your primary LLM (e.g., Claude 3.5 Sonnet, often strong for frontend/TypeScript) to generate code based on the design doc and specific prompts.
Commit & PR: Follow standard Git practices – create a pull request with clean, scoped commits.
Extract Diff: Copy the changes (git diff main… | pbcopy or similar).
AI Review 1 (Generator Model): Sometimes, asking the generating model to review its own work can catch basic errors or prompt misunderstandings. “Review this diff for adherence to the design doc and best practices.”
AI Review 2 (Different Vendor): Paste the diff and key design doc sections into a different model (e.g., Gemini 2.5 Pro if you generated with Claude/GPT). Ask specific questions: “Review this diff for potential security vulnerabilities, missing edge case tests, performance bottlenecks, and consistency with Python 3.11 best practices, considering our design doc [link/paste relevant section].”
Human Oversight: A human engineer now reviews the PR and the AI review outputs. The focus shifts from spotting typos or basic logic errors (often caught by AI) to verifying architectural soundness, complex logic, and subtle requirement mismatches.

Using models from different vendors (e.g., Anthropic vs. Google vs. OpenAI) is crucial. They have different training data, architectures, and inherent biases. One model’s blind spot is often another’s strength. We’ve seen cases where one model confidently hallucinated a non-existent library function, only for the second model to immediately flag it as incorrect, saving significant debugging time.

A flowchart visualizing the Dual-Model Review

Navigating the Pitfalls: Common AI Time Sinks and How to Escape Them

While AI accelerates development, it introduces new ways to lose time if you’re not vigilant. Understanding these common failure modes is key to maintaining productivity:

API Hallucinations & Version Skew: LLMs trained on older data might confidently use deprecated functions, non-existent API endpoints, or incorrect library versions. Mitigation: Always provide version context (e.g., “using React 18”) and cross-reference generated code with official documentation, especially for newer or rapidly changing libraries. Use tools that feed current docs into the context.
Superficially Correct Tests: AI can generate tests that pass but don’t actually validate the intended logic or cover meaningful edge cases. Mitigation: Treat AI-generated tests as a starting point. Critically evaluate what they assert. Manually add tests for complex logic, failure conditions, and known edge cases. Demand tests that fail when logic is broken.
The Infinite Fix-Loop: The model gets stuck oscillating between two slightly flawed solutions (Variant A -> Variant B -> Variant A…) without converging. This often happens with complex or poorly defined problems. Mitigation: Use the “Summarize for an Expert” trick: Ask the stuck model: “Summarize the core problem we’re trying to solve, the approaches attempted so far, and why they failed. Explain it like you’re briefing a senior human expert.” Then, paste this structured summary into a fresh chat session or a different LLM. This resets the context and often breaks the loop by reframing the problem.
Over-Reliance & Skill Atrophy: Relying too heavily on AI for fundamental tasks without understanding the underlying principles can hinder learning and problem-solving skills. Mitigation: Consciously interleave AI generation with manual coding. Take time to read and understand AI-generated code. Debug issues manually sometimes to reinforce fundamentals. Treat AI as a powerful pair programmer, not a replacement for thinking.

An illustration depicting the “Infinite Fix Loop”: An arrow cycling between “Code Version A (Buggy)” and “Code Version B (Also Buggy)” with an AI icon in the middle looking confused. A separate box shows the “Break-out Trick”: An AI icon summarizing the problem, with an arrow pointing to a fresh chat window/different AI icon, leading to “Correct Solution C”.

The Skills That Matter Most in the AI-First Era

The rise of AI isn’t eliminating engineering jobs, but it is reshaping the skills required to excel. Technical proficiency remains crucial, but the emphasis is shifting towards skills that maximize collaboration with AI:

Precise Asynchronous Communication: The ability to write exceptionally clear prompts, design documents, bug reports, and code reviews is paramount. Your instructions need to be unambiguous for both human colleagues and AI models to act upon the correct intent. This includes iteratively refining prompts to get desired outputs.
Adaptability and Continuous Learning: The AI landscape and associated tooling are evolving at breakneck speed. Today’s best-in-class model or workflow might be superseded in six months. Engineers need a mindset geared towards quickly learning, evaluating, and integrating new tools and techniques without friction or dogma.
Critical Thinking & Engineering Rigor: This is perhaps the most important skill. It involves:
- Validation: Not blindly trusting AI output. Questioning assumptions, verifying logic, and testing thoroughly.
- Debugging: Knowing how to diagnose problems, whether they originate in human or AI-generated code. This might involve reading deeper than before – sometimes even looking at lower-level outputs if performance is critical.
- Systemic Understanding: Seeing the bigger picture beyond the generated function – how does this piece fit into the larger architecture? What are the security, scalability, and maintainability implications?
- Knowing When Not to Use AI: Recognizing tasks where human intuition, creativity, or deep domain knowledge are currently irreplaceable.

We’ve adjusted our hiring process accordingly. We don’t use “prompt-engineering” trivia. Instead, during live coding sessions, candidates can choose their preferred tools, including LLMs. The valuable signal isn’t whether they can get the AI to write code, but how they interact with it. Do they critically evaluate suggestions? Can they explain why an AI suggestion is wrong? Do they refine their prompts thoughtfully? This conversational approach reveals far more about their AI fluency and engineering rigor than any canned test.

A person working together with AI.

Our Current AI-Assisted Toolchain (Snapshot: Q2 2025)

The effectiveness of AI-assisted development hinges on the entire toolchain. Here’s a snapshot of what our team finds effective currently. Note: This stack evolves rapidly; experimentation is key.

Layer	Tool(s) Chosen	Key Reason
IDE	VS Code + Copilot / Cursor	VS Code offers robust core features & Copilot integration. Cursor excels at multi-file context awareness & inline chat/editing. We use both.
Generators	Claude 3.5 Sonnet (Frontend/TS) GPT-4o / GPT-4o-mini (Python/Rust)	Models selected based on internal benchmarks for specific language/task performance and pass rates. Sonnet often yields better structured UI code; GPT variants strong on backend logic.
Reviewer	Gemini 2.5 Pro	Acts as a crucial “second opinion” due to different training data, helping catch biases or errors missed by the primary generator model.
Context Injection	.cursor-rules, github/llm-rules.md, Custom Scripts	Automatically inject project-specific context (language versions, style guides, design doc snippets) into prompts, reducing repetitive setup.
Doc Search	IDE Plugins (e.g., Copilot Chat Web Search), Phind, Perplexity	Provide models with access to up-to-date documentation for rapidly evolving frameworks (e.g., latest React, Rust nightly features).

Beyond Anecdotes: Evidence of AI’s Impact

While personal experiences are illustrative, broader data confirms the trend. Beyond the GitHub Copilot studies, other research highlights the nuances:

Stack Overflow’s 2023 Developer Survey: Showed rapid adoption, with 44% of developers using AI tools in their workflow and another 26% planning to. This indicates a massive shift in under a year.
Code Quality Concerns: An early 2024 white paper cautioned that while AI speeds up development, relying solely on AI without rigorous review can lead to code with subtle bugs, security flaws, or maintainability issues. This reinforces the need for strong validation processes like the dual-model review.
Task Variability: Productivity gains aren’t uniform. Studies suggest AI provides the biggest boost for repetitive tasks, boilerplate generation, and unfamiliar domains, while complex debugging or novel algorithm design still heavily relies on human expertise.

The data paints a clear picture: AI significantly enhances productivity, but quality control and strategic application remain critical human responsibilities.

Peering Towards 2030: What AI Will Own (and What Remains Human)

Predicting the future is fraught, but current trajectories suggest a likely division of labor:

Likely AI Domain by 2030:
- Boilerplate & Scaffolding: Generating standard code structures, setup scripts, basic CRUD operations.
- Exhaustive Unit & Integration Testing: Writing comprehensive tests based on schemas and specifications.
- Code Translation & Refactoring: Migrating codebases between languages or frameworks based on defined rules.
- Simple UI Component Generation: Creating standard UI elements based on design systems.
- Documentation Generation: Drafting initial documentation from code and specifications.
Likely Human Domain by 2030:
- Problem Framing & Strategic Alignment: Defining why a piece of software should exist, who it serves, and how it fits into the larger business or product strategy.
- Complex System Architecture & Design: Making high-level decisions about interactions, trade-offs, and long-term maintainability for novel or intricate systems.
- Novel Algorithm Development: Creating fundamentally new approaches to solve problems.
- Ethical Considerations & User Trust: Ensuring software is fair, unbiased, secure, and respects user privacy.
- Ambiguous Problem Solving: Navigating requirements that are unclear, conflicting, or require deep domain expertise and creative thinking.
- Mentorship & Team Leadership: Guiding and developing other engineers, fostering collaboration, and setting technical vision.

The engineer’s role becomes less about typing code and more about orchestrating AI agents, defining problems, validating solutions, and making strategic decisions.

Hiring in the AI Era: Seeking Fluency, Not Just Prompters

Job titles likely won’t change drastically overnight – we’ll still hire “Software Engineers.” However, the expectations within that role are evolving. Our hiring process now emphasizes:

Live AI Collaboration: Allowing candidates to use LLMs during coding interviews.
Probing for Reflection: Asking questions like, “The AI suggested X, why did you choose to implement Y instead?” or “Walk me through how you validated that AI-generated test.”
Rewarding Curiosity & Adaptability: Valuing candidates who experiment with prompts, check alternative models, or question AI suggestions over those who rely purely on recall or blindly accept the first output.

We don’t anticipate a widespread, distinct role of “Prompt Engineer” within software teams. Instead, we expect a higher demand for “AI-Fluent” Software Engineers – those who seamlessly integrate AI tools into their workflow, understand their capabilities and limitations, and maintain strong core engineering principles.

Open Challenges and the Road Ahead

Despite the rapid progress, significant challenges remain:

Knowledge Cutoff & Real-time Information: Models are often trained on data that’s months old. New library releases, framework updates, or security vulnerabilities emerge faster than models are retrained. Workaround: Develop robust context injection pipelines (feeding current docs, using web search plugins) and maintain vigilance.
Vendor Lock-in & Tooling Fragmentation: Specific IDEs or tools might be tightly coupled to certain models, making it hard to switch. Proprietary prompt extensions can reduce portability. Workaround: Keep core prompts relatively generic and portable; prefer open standards where possible.
Governance, Security & Licensing: AI can inadvertently introduce code with problematic licenses, leak secrets if trained on sensitive data, or introduce subtle security flaws. Workaround: Human oversight remains critical. Implement automated code scanning tools (SAST, DAST, license checkers) but don’t rely on them exclusively. Clear data governance policies for using internal code with external models are essential.
Cost Management: Frequent calls to powerful AI models can become expensive, especially for large teams or extensive automated workflows. Workaround: Monitor API usage, use smaller/cheaper models for simpler tasks, implement caching strategies, and optimize prompting.
Evaluation & Benchmarking: Objectively measuring the quality of AI-generated code and the effectiveness of different models/prompts remains challenging. Workaround: Develop internal benchmarks relevant to your specific domain and tech stack; combine quantitative metrics with qualitative human review.

Practical Tips You Can Implement Tomorrow

Ready to improve your AI-assisted workflow? Try these actionable tips:

Keep Design Docs Concise: Aim for key details to fit within the model’s context window (often ~10k-100k+ tokens, check your model’s limits). Use summaries for very large docs.
Use Folder-Specific Rules: Configure your tools (like .cursor-rules or similar) to provide different context based on the directory (e.g., frontend/ gets React context, backend/ gets Rust/Python context).
Alias Common Commands: Create shell aliases for frequent tasks, like alias gdiff=’git diff main… | pbcopy’ to quickly copy diffs for review prompts.
Leverage the “Summarize for Expert” Trick: When stuck in a loop, ask the model to summarize the problem and failed attempts. Paste this into a fresh chat or different model.
Prefer Small, Focused PRs: AI agents (and human reviewers) handle smaller, well-defined changes much more effectively than large, monolithic ones. Break down tasks accordingly.
Version Your Prompts: Store effective prompts in a shared team resource or version control; treat them like code.
Share Learnings: Encourage team members to share successful prompting techniques, tool discoveries, and common pitfalls.

Where to Explore Next

This field is moving incredibly fast. Staying curious and experimenting is key.

The post The Future of Software Engineering: How AI Is Reshaping Roles and Skills (2025) appeared first on The Data Blog.

Time Series Annotation Tools

igor — Sat, 10 Dec 2022 19:29:17 +0000

Time series annotation requires handling of sequential information as present in signals recorded over time.

In contrast to NLP (text) or image annotation time series annotation requires the analysis of a signal over time. Think about listening to a small voice message and having the task of annotating it. Depending of the complexity of the singal the annotation process can take a multiple of the actual signal length. This means, that annotating a 10 second signal could take up to several minutes as the annotator needs to replay the signal several times at different speeds to properly annotate it.

In this post, I focus mostly in temporal or time series data other than audio as there are great tools dedicated for audio annotation.

Open Source Time Series Annotation Tools

There are several good open source tools for time-series annotation.

Baidu Curve (no longer maintained)

A very popular open source project was Curve from Baidu (the chinese Google). Unfortunately, the team decided to archive their GitHub repository due to lack of resources. You can still use the tool but the docs are not longer available and the tool is no longer maintained as of November 2022. Since the repo is read only external contributors can’t anymore contribute.

TagAnomaly by Microsoft

TagAnomaly is a anomaly detection tool written in R by Microsoft. It allows you to spot anomalies and export them into a csv file.

Another open source tool for time series annotation is TagAnomaly by software giant Microsoft. In contrast to other tools which typically use a web framework such as react, vue or angular, TagAnomaly is built around R. R is very popular among statisticians and has a UI framework called shiny that is used here. Since not everyone is familiar with R Microsoft also provides a Dockerfile to easily run the software in your browser.

Note that this tool is more used as an anomaly detection tool. You can find anomalies (manually or algorithmically) and then zoom in and have a closer look. It allows you to export anomalies to a csv file.

Wearables Development Kit (WDK)

This open source tool has been developed with activity recognition in mind for wearables. Think of the task of annotating whether a person is walking or running based on the accelerometer and gyroscope of your phone. Unfortunately, also WDK seems no longer to be maintained. But most functionality seems still working fine. A neat little feature is that you can playback videos in sync with the time series data. This of course only helps if you’re dealing with time series data where you have a video footage in sync.

The wearables development kit has been built to annotate time series data from wearable sensors for activity recognition.

LabelStudio by Heartex

A more recent time series annotation tool that is still actively maintained and gaining popularity is LabelStudio by Heartex. The tool also runs in your browser and besides times series supports all other major annotation types such as NLP (text) and images.

LabelStudio by Heartex supports time series data since 2020. You can annotate events and export them

If you’re more interested in outsourcing the labeling work rather than doing it yourself using a data annotation tool I recommend to head over to the curated List of Data Annotation Companies.

Unfortunately, there are not too many time series annotation tools that are still maintained. If you know one that you can recommend but is not yet in the list please don’t hesitate to reach out or leave a comment!

The post Time Series Annotation Tools appeared first on The Data Blog.

Data Annotation and What Data Annotation Companies do

igor — Mon, 07 Feb 2022 22:53:52 +0000

Data annotation is one of the core functions of machine learning. The more data an ML model is trained with, the more accurate it will become.

Just like humans learn through training and practice, machine learning models are also trained by feeding them with huge volumes of data.

One of the reasons Google is still the best search engine is because it has a lot of data compared to its competitors, including Yahoo and Bing (Microsoft’s search engine). With this data, Google is able to give users the best search results that match their search queries. Several other web apps also rely on data annotation to improve their algorithms in order to enhance their users’ experience.

An autonomous robot learns to navigate and understand its surrounding after learning from annotated data.

So, what is data annotation?

Data annotation refers to the process of categorizing and labeling information or data so that machine learning models can use it. The data used to train machine learning models has to be accurately labeled and categorized for specific use cases. For instance, the categorization and labeling of data to be used by a search engine ML model is different from a speech recognition ML model.

Data annotation involves assessing four primary types of data; text, audio, video, and image. This article will focus mainly on images and texts annotation since they are the most popular types of data used to train machine learning models.

Text annotation

A 2020 State of AI and Machine Learning report shows that over 70% of companies relied on text to train their AI and machine learning models. The common types of annotations used with text include; sentiment, intent, and query. Let’s discuss each of these in detail.

Sentiment Annotation
Sentiment annotation involves assessing emotions, attitudes, and opinions, making it crucial to have the proper training data for machine learning models. Sentiment annotation is done by humans because it involves moderating content and sentiments on platforms such as social media and eCommerce sites.

Query annotation
This type of text annotation involves training search algorithms by tagging the various components within product titles and search queries to improve the relevance of search results. Algorithms that use query annotation are usually found in search engines for eCommerce platforms.

Intent annotation
This type of text annotation involves training machine learning models to identify intention in a particular text. Intent annotations help ML models to differentiate various inputs into categories, including requests, commands, bookings, recommendations, and confirmations. This type of text annotation is mainly used to train search engine Machine Learning models.

Image annotation

Image annotation involves training machine learning models with several images to help them learn about the features in those images. Some of the applications that use such algorithms include; computer vision, robotic vision, and apps that have facial recognition functionalities.

For effective training of ML models with image annotation, metadata has to be attached to all the images used. This metadata usually includes identifiers, captions, and keywords. Some of the popular use cases that take advantage of image annotation include; health apps that auto-identify medical conditions, computer vision systems in self-driving cars, machines used for sorting things, and many more.

Image annotation is more intense and requires more computation power than text annotation. This is simply because images carry way more data than texts. Training ML models with images involves learning about all the pixels in the various images fed into the ML model.

Images annotation has five main types, and these include;

Bounding boxes annotation
With bounding boxes, human annotators are tasked to draw boxes around specific subjects within the image. This type of annotation is mainly used to train autonomous vehicle algorithms to detect objects such as road labels, traffic, potholes, etc.

3D cuboids annotation
This type of image annotation involves drawing 3D boxes around specific objects in an image. Unlike bounding boxes that only consider length and width, 3D cuboids include the height or depth of the object.

Polygons
At times some objects may not fit well in a bounding box or 3D cuboid because not all things are rectangular. Objects such as cars, humans, and buildings are usually not perfectly rectangular, so they can’t fit in a rectangle or cuboid. In this case, human annotators have to draw polygons around the non-rectangular objects before feeding this data to an ML model.

Lines and spines
These are used to train machine learning models to identify lanes and boundaries. So, annotators are required to draw lanes between certain boundaries that you would wish your ML model to learn.

Semantic segmentation
This is a much more precise and deeper type of annotation that involves associating every pixel in a given image with a tag. This annotation type is mainly used in machine learning models for autonomous vehicles and medical image diagnostics.

What do data annotation companies do?

One of the major challenges involved in training machine learning models is finding the right quality and quantity of data to feed them. Remember, the quality and amount of data you provide these models determine the overall outcome of the tasks these models will be finally be deployed to do.

To help fix these issues, data annotation companies avail the appropriate amount of data that can be used to train various types of AI and ML models. These companies use the human-assisted approach and machine-learning assistance to provide high-quality data to train AI and ML models.

Besides providing training data for AI and ML models, data annotation companies also offer deploying and maintaining services for AI and ML projects. These are follow-up services meant to ensure the provided data provides the desirable results wherever the ML algorithm trained using this data is deployed.

For instance, if it is a search algorithm deployed in an eCommerce site, the data annotation company has to ensure the algorithm provides the best search results for the various user queries.

Data Entry and Data Annotation Jobs

The requirements are very different depending on the task that has to be done. Some of them are also data entry jobs that are not used to train any AI system at all but just to feed a software system with the right data.

Data Annotation Specialist

As already described earlier the actual labeling or annotation task is performed by humans. There are different job titles such as data annotation specialists, data annotator or data labeler. All of them refer to the human annotating the data. Depending on the industry and the requirements these jobs can pay anything from minimum salary of a few dollars per hour to higher amounts for annotating medical images or other difficult data.

For some of these jobs such as data entry there is often also no prior work experience required.

Data Entry

Data Entry are among the most popular jobs in this domain. The tasks can vary between digitizing documents, adding new products to a catalogue to manually copying information between two software systems. There is also a high chance that you find a data entry job with no job experience. Depending on how sensitive the data is you might also work from home to do the job. If the task does not require an internet connection and the employer is fine you might also work completely offline. Some of these jobs can also be seen as a side hustle to make some extra money in the evening.

If you’re a US citizen you can find data entry jobs at some of the large companies. Most of them allow you to work from anything within the US such as Utah, NYC, Las Vegas, Houston and many more.

amazon data entry jobs

Check out our list of data annotation companies to learn more!

The post Data Annotation and What Data Annotation Companies do appeared first on The Data Blog.

Rotoscoping: Hollywood’s video data segmentation?

igor — Thu, 23 Apr 2020 10:05:00 +0000

In Hollywood, video data segmentation has been done for decades. Simple tricks such as color keying with green screens can reduce work significantly.

In late 2018 we worked on a video segmentation toolbox. One of the common problems in video editing is oversaturated or too bright sky when shooting a scene. Most skies in movies have been replaced by VFX specialists. The task is called “sky replacement”. We thought this is the perfect starting point for introducing automatic segmentation to mask the sky for further replacement. Based on the gathered experience I will explain similarities in VFX and data annotation.

You find a comparison of our solution we built compared to Deeplab v3+ which was at the time considered the best image segmentation model. Our method (left) produced better details around the buildings as well as reduced the flickering significantly between the frames.

Comparison of our sky segmentation model and Deeplab v3+

Video segmentation techniques of Hollywood

In this section, we will have a closer look at color keying with for example green screens and rotoscoping.

What is color keying?

I’m pretty sure you heard about color keying or green screens. Maybe you even used such tricks yourself when editing a video using a tool such as Adobe After Effects, Nuke, Final Cut, or any other software.

I did a lot of video editing myself in my childhood. Making videos for fun with friends and adding cool effects using tools such as after-effects. Watching tutorials from videocopilot.com and creativecow.com day and night. I remember playing with a friend and wooden sticks in the backyard of my family’s house just to replace them with lightsabers hours later.

In case you don’t know how a green screen works you find a video below giving you a better explanation than I could do with words.

Video explaining how a green screen works

Essentially, a greenscreen is using color keying. The color “green” from the footage gets masked. This mask can be used to blend-in another background. And the beauty is, we don’t need a fancy image segmentation model burning your GPU but a rather simple algorithm looking for neighboring pixels with the desired color to mask.

What is rotoscoping?

As you can imagine in many Hollywood movies special effects require more complex scenes than the ones where you can simply use a colored background to mask elements. Imagine a scene with animals that might be shy of strong color or a scene with lots of hair flowing in the wind. A simple color keying approach isn’t enough.

But also for this problem, Hollywood found a technique many years ago: Rotoscoping.
To give you a better idea of what rotoscoping is I embedded a video below. The video is a tutorial on how to do rotoscoping using after effects. Using a special toolbox you can draw splines and polygons around objects throughout a video. The toolbox allows for automatic interpolation between the frames saving you lots of time.

After effects tutorial on rotoscoping

This technology, introduced in After Effects, 2003 has been out there for almost two decades and has since then been used by many VFX specialists and freelancers.

Silhouette is in contrast to After Effects one tool focusing solely on rotoscoping. You get an idea of their latest product updates in this video.

I picked one example for you to show how detailed the result of rotoscoping can be. The three elements in the following video from MPC Academy blowing my mind are motion blur, fine-grained details for hairs, and the frame consistency. When we worked on a product for VFX editors we learned that in this industry the quality requirement is beyond what we have in image segmentation. There is simply neither a dataset nor a model in computer vision fulfilling the Hollywood standard.

Rotoscoping demo reel from MPC Academy.
Search for “roto showreel” on YouTube and you will find many more examples.

How is VFX rotoscoping different from semantic segmentation?

There are differences in both quality and how the quality assurance/ inspection works.

The tools and workflow in VFX, as well as data annotation, are surprisingly similar to each other.

Tools and workflow comparison

The tools and workflow in VFX, as well as data annotation, are surprisingly similar to each other. Since both serve a similar goal. Rotoscoping, as well as professional annotation tools, support tracking of objects, working with polygons and splines. Both allow for changing brightness and contrast to help you finding edges. One of the key differences is that in rotoscoping you work with transparency for motion blur or hair. In segmentation, we usually have a defined number of classes and no interpolation between them.

Quality inspection comparison

In data annotation quality inspection is usually automated using a simple trick. We let multiple people do the annotation and can compare their results. If all annotators agree the confidence is high and therefore the annotation is considered good. In case they only partially agree and the agreement is below a certain threshold an additional round of annotation or manual inspection takes place.
In VFX however, an annotation is usually done by a single person. The person has been trained on the task and has to deliver very high quality. The customer or supervisor lets the annotator redo the work if the quality is not good enough. There is no automatic obtained metric. All inspection is done manually using the trained eye of VFX experts. There is a term called “pixel fucking” illustrating the required perfectionism on a pixel level.

How we trained our model for sky segmentation

Let’s get back to our model. In the beginning, you saw a comparison between our result and Deeplab v3+, 2018. You will notice that the quality of our video data segmentation is higher and has less flickering. For high-quality segmentation, we had to create our own dataset. We used Full HD cameras mounted on tripods to record footage of the sky. This way a detailed segmentation around buildings and static objects can be reused throughout the whole shot. We used Nuke for creating the annotated data.

Image showing the soft contours using for rotoscoping.
We blurred the edges around the skyline.

Additionally, we used publicly available and license-free videos of trees, people, and other moving elements in front of simple backgrounds. To obtain ground truth information we simply used color keying. It worked like charm and we had pixel-accurate segmentation of 5 min shots within a few hours. For additional diversity within the samples, we used our video editing tool to crop out parts of the videos while moving the camera around. A 4k original video had a Full HD frame moving around with smooth motion. For some shots, we even broke out of the typical binary classification and used smooth edges, interpolated between full black and white, for our masks. Usually, segmentation is always binary, black or white. We had 255 colors in between when the scene was blurry.

Color keying allowed us to get ground truth data for complicated scenes such as leaves or hair. The following picture of a palm tree has been masked/ labeled using simple color keying.

For simple scenes, color keying was more than good enough to get detailed results. One could also replace now the background with a new one to augment the data.

This worked for all kinds of trees. And even helped us obtain good results for a whole video. We were able to simply adapt the color keying parameters during the clip.

Also, this frame has been taken using simple color keying methods

To give you an idea of the temporal results of our color keying experiments have a look at the gif below. Note there is a little bittering. We added this on purpose to “simulate” recording with a camera in your hand. The movement of the camera itself is a simple linear interpolation of the crop on the whole scene. So what you see below is just a crop of the full view.

This mask has been obtained using color keying for the first frame. The subsequent frames might only need a small modification of the color keying parameters. We did such adaptation every 30-50 frames and let the tool interpolate the parameters between them.

Training the model

To train the model we added an additional loss on the pixels close to the borders. This helped a lot to improve the fine-details. We played around with various parameters and changing the architecture. The simple U-Net model worked well enough. We trained the model not on the full images but on crops of around 512×512 pixels. We also read up on Kaggle competitions such as the caravan image masking challenge from 2017 for additional inspiration.

Adversarial training for temporal consistency

Now that we had our dataset we started training the segmentation model. For the model, we used a U-Net architecture, since the sky can span the whole image and we don’t need to consider various sizes as we would need to for objects.
In order to improve the temporal consistency of the model (e.g. removing the flickering) we co-trained a discriminator which always saw three sequential frames. The discriminator had to distinguish three frames coming from our model or the dataset. The training procedure was otherwise quite simple. The model trained for only a day on an Nvidia GTX 1080Ti.

So for your next video data segmentation project, you might want to have a look at whether you can use any of these tricks to collect data and save lots of time. In my other posts, you will find a list of data annotation tools. In case you don’t want to spend any time on manual annotation there is also a list of data annotation companies available.

I’d like to thank Momo and Heiki who worked on the project with me. An additional thank goes to all the VFX artists and studios for their feedback and fruitful discussions.

The post Rotoscoping: Hollywood’s video data segmentation? appeared first on The Data Blog.

Thank you for reading this blog

igor — Sat, 04 Apr 2020 10:24:42 +0000

Since I started this blog at the beginning of February over 200 interested people visited the website. I’ll use this post to express my thanks to everyone.

Since this is a data-driven blog and I like to be transparent let’s have a look from where the interested readers are coming from. Among the top 5 countries, we have the US, Switzerland (where I’m from), Germany, India, and France.

Visitor statistics based on the country

Now, let’s have a look at how they found out about this blog. A big chunk of visitors came from social networks or other news sites such as Hacker News, Reddit, Twitter. More than 10% of the visitors came from feedspot.com, another news aggregator. Special thanks to Anuj, who added this blog to the list of top 40 Machine Learning Blogs.

Finally, the probably most interesting question is which post did you like the most? By far the most visitors had a summary of data annotation companies, followed by the summary of tools and frameworks and the work I did during Hack Zurich 2016 with the random forest image classifier.

What’s next?

A few of you reached out to me and asked whether they could write guest posts. Of course, that’s something I would very much support and appreciate. Just reach out to me and we can have a look at your post.

I set my personal goal of posting at least once a month. This month I spent quite some time making a Medium post for my company about data redundancy. I had a hard time deciding on whether I should post it here or on Medium. We decided to go with Medium due it’s already existing user base and being able to reach more people through publications such as towards data science. The reach was very high with more than 1’000 visitors in the first two weeks.

The post Thank you for reading this blog appeared first on The Data Blog.

List of Data Annotation Companies

igor — Fri, 28 Feb 2020 21:49:35 +0000

An up to date and manually curated list of top data annotation companies from all over the world. Grouped by annotation type.

I’ve been looking for data labeling for computer vision data. There are a lot of good companies offering services. Most of them follow similar principles such as outsourcing to countries with a cheaper labor cost. From my experience, there are mostly differences in the data type they focus on (images vs audio vs text) as well as the way they work.

Data annotation companies add annotations such as bounding boxes to images such that AI algorithms can learn.

Unfortunately, the pricing isn’t always transparent. Usually, they charge per hour. For computer vision data this is usually around $1-$5 per hour per worker.

If you want to set up your own data annotation pipeline don’t forget to have a look at my blog about tools and frameworks.

List of Data Annotation Companies

Company	Vision	Text/ NLP	Audio	Offer API
AI Data Innovations			yes
Alegion	yes	yes		yes
Appen		yes	yes
Awakening Vector	yes
Basic AI	yes	yes	yes
CapeStart	yes	yes	yes
Cloud Factory	yes		yes
Cogito	yes
DataPure	yes
DAITA	yes
Dbrain		yes
DefinedCrowd	yes	yes	yes	yes
Diffgram	yes	yes		yes
DJAOO	yes	yes		yes
edgecase.ai	yes
Figure Eight	yes	yes	yes
hCaptcha	yes			yes
Humans in the Loop	yes
iMerit	yes
InfoSearch BPO	yes	yes	yes
Konverge.AI	yes	yes
Labelbox	yes			yes
Label Your Data	yes	yes
Manthano AI	yes
Mighty AI	yes
Playment	yes
Precise BPO Solution	yes
Scale AI	yes	yes	yes	yes
Segments AI	yes			yes
Supahands	yes
Hive AI	yes
Trainingset AI	yes			yes
Understand AI	yes

Do you know another company, not on this list of data annotation companies or additional information I should add? Please, don’t hesitate to add a comment or drop me a message.

The post List of Data Annotation Companies appeared first on The Data Blog.

Humans Powering the Machines

igor — Mon, 17 Feb 2020 15:56:17 +0000

There is a hype around AI built on top of recent success with deep learning. But there is one unsolved piece in the equation. AI needs to learn from humans.

Small robots learning from humans.
(Matan Segev, pexels.com)

When I heard about machine learning for the first time, I thought it would just simply work like this. Let’s say I want to classify pictures of dogs or cats. I would show the model a picture of a dog and a picture of a cat and it would learn to separate the two classes. Unfortunately, that’s not the case. To get high accuracy with deep learning models trained from scratch we need thousands of images for each class.

For some, this might not sound like a big deal. But if you assume it takes you 5 seconds per image to determine its a dog or cat and add it to a dataset, you can label 30 images per minute. Or 1,800 images per hour. For a dataset consisting of 10,000 images (5,000 per class), we end up spending around 5h30min. We could outsource this task to an annotation company. They would annotate each image multiple times by multiple annotators and have an additional quality control person looking over it. So 5h turn into 20h. With a cost of 1$/h, this is still moderate. But how about labeling 1 Million images? And what about image segmentation, where you can spend up to 1h per image?

And then came Transfer Learning

Probably the solution with the widest adoption to solve this absurd problem is called transfer learning. The idea is, instead of training a deep neural network from scratch we use a pre-trained model from another dataset. The low-level features from the pre-trained model generalize very well so only have to fine-tune the last layer. We essentially use a deep learning model as a feature extractor. The classifier on top can be anything ranging from an SVM to a linear layer.

The beauty of this approach is, that the number of samples needed to train a model drops significantly. In case you didn’t know, in object detection or segmentation we already use transfer learning. All those models from Deeplab, Mask-RCNN to the latest ones from the research are typically built on a backbone model pre-trained on a dataset like ImageNet. That’s why object detection works with just a few thousand samples.

Our human annotators now have to annotate less. But a lot of their work is still inefficient. Are all samples equally important? Can we automate human annotation? Can machines annotate data for other machines?

Active Learning

One of the questions I often get asked is whether we can teach machines to annotate data for other machines. The answer is no. Data annotation is all about providing ground-truth data for the machine learning model. The goal is to provide as accurate data as possible. If we allow one machine to annotate for another machine the latter one’s accuracy kind of bounded by the first one. Flaws learned from the first model would propagate or become worse.

The train-predict-label loop

So we can’t use machines to do the annotation but can they help humans to speed up the task? Yes, indeed they can. Nowadays under the umbrella of active learning, we understand a model that predicts newly unseen samples and gives confidence. Is the confidence low we assume the sample is hard and the human should annotate it.
Additionally, the prediction from the model can be used as an initial guess for the human annotator. This process is now used by almost every company working with large datasets and requiring data annotation. But also this method has its drawbacks. When I exchange with ML engineers they often complain that they used active learning from the beginning and after a while figured out it didn’t bring any value compared to randomly selected samples.
I think it’s really important to try out and properly verify the various methods available before investing heavily in the pipeline. Furthermore, just because a method works on one academic dataset it doesn’t mean it works on yours.

What’s next? Semi-Supervised Learning

Last year, I spent a lot of time reading up with the latest research on semi-supervised and self-supervised learning. Recent advancements in the field make them the most promising approach. The idea is that we train a model on the raw data using contrastive losses. For example, we take an image, perform various image augmentations such as random crop, color changes and then train a model to embed augmentations from the same image close to each other whereas different images are far apart. I highly recommend having a look at recent publications such as Contrastive Multiview Coding, Contrastive Predictive Coding or SimCLR.

Plot showing the ImageNet accuracy of recent self-supervision methods.
(From SimCLR paper, 2020)

I imagine in the future we can build a model that learns from the just collected and raw dataset using self-supervision. A few samples get selected which will be sent for annotation. The model gets fine-tuned using the annotations for a specific task. However, there will always be one step where a human needs to annotate a sample even if it might only be a fraction of the work we have today.

The post Humans Powering the Machines appeared first on The Data Blog.

Tools and Frameworks

igor — Fri, 14 Feb 2020 13:59:55 +0000

Find the best tool for you out of a list of fast tools and frameworks for data annotation or labeling for images, videos, text (NLP) or audio.

I had trouble getting a good overview of all the tools and frameworks around for data annotation so I created this list. I will try to keep it up to date. There are many tools and each one has it’s advantages and disadvantages.

	Computer Vision	NLP	Audio	Others
Open Source tools and frameworks	Images Video LiDAR 3D	Text	Audio	Time Series MultiDomain

Open Source Annotation and Labeling Tools and Frameworks

Data annotation can be a very tedious task. Luckily there are many free to use tools available in the web. Some of them are are open-source and allow for modifications.

Here you find a list of open-source projects grouped by datatypes!

Computer Vision

Images

Alturos.ImageAnnotation – A collaborative tool for labeling image data
Anno-Mage – A Semi-Automatic Image Annotation Tool which helps you in annotating images by suggesting you annotations for 80 object classes using a pre-trained model
CATMAID – Collaborative Annotation Toolkit for Massive Amounts of Image Data
CVAT – Powerful and efficient Computer Vision Annotation Tool
deeplabel – A cross-platform image annotation tool for machine learning
imagetagger – An open-source online platform for collaborative image labeling
imglab – A web-based tool to label images for objects that can be used to train dlib or other object detectors
Labelbox – Labelbox is the fastest way to annotate data to build and ship computer vision applications
labelImg – LabelImg is a graphical image annotation tool and label object bounding boxes in images
labelme – Image Polygonal Annotation with Python
LOST – Design your own smart Image Annotation process in a web-based environment
make-sense – makesense.ai is free to use online tool for labeling photos
MedTagger – A collaborative framework for annotating medical datasets using crowdsourcing.
OpenLabeler – OpenLabeler is an open-source desktop application for annotating objects for AI applications
OpenLabeling – Label images and video for Computer Vision applications
PixelAnnotationTool – Software that allows you to manually and quickly annotate images in directories
Pixie – Pixie is a GUI annotation tool which provides the bounding box, polygon, free drawing, and semantic segmentation object labeling
turktool – A modern React app for scalable bounding box annotation of images
VoTT – An open-source annotation and labeling tool for image and video assets
Yolo_mark – GUI for marking bounded boxes of objects in images for training neural network Yolo v3 and v2

Video

Find a video labeling tool that suits your needs. Some of these tools have also great Python APIs to interface with them.

Diffgram – Training Data Software for Teams Shipping Deep Learning AI Systems. Track objects through time.
UltimateLabeling – A multi-purpose Video Labeling GUI in Python with integrated SOTA detector and tracker
VATIC – VATIC is an online video annotation tool for computer vision research that crowdsources work to Amazon’s Mechanical Turk.

Lidar

Here we provide a curated list of lidar annotation tools.

semantic-segmentation-editor – Web labeling tool for the camera and LIDAR data

3D

KNOSSOS – KNOSSOS is a software tool for the visualization and annotation of 3D image data and was developed for the rapid reconstruction of neural morphology and connectivity.

NLP

Text

ML-Annotate – Label text data for machine learning purposes. ML-Annotate supports binary, multi-label and multi-class labeling.
SMART – Smarter Manual Annotation for Resource-constrained collection of Training data
TagEditor – Annotation tool for spaCy
YEDDA – A Lightweight Collaborative Text Span Annotation Tool (Chunking, NER, etc.). ACL best demo nomination.

Audio

It’s a good idea to use a special audio annotation tool. They allow for easy playback and help you mark timestamps efficiently.

audio-annotator – A JavaScript interface for annotating and labeling audio files.
audio-labeler – An in-browser app for labeling audio clips at random, using Docker and Flask.
EchoML – Play, visualize and annotate your audio files
peak.js – Browser-based audio waveform visualization and UI component for interacting with audio waveforms, developed by BBC UK.
wavesurfer.js – Simple annotations tool, check the example.

Others

Time Series

Here we list a few popular time series annotation tools. These can be use for annotating anomalies or interesting sections in a data stream.

Curve – Curve is an open-source tool to help label anomalies on time-series data
TagAnomaly – Anomaly detection analysis and labeling tool, specifically for multiple time series (one time series per category)
time-series-annotator – The CrowdCurio Time Series Annotation Library implements classification tasks for time series.
WDK – The Wearables Development Toolkit (WDK) is a set of tools to facilitate the development of activity recognition applications with wearable devices.

MultiDomain

Dataturks – Dataturks support E2E tagging of data items like video, images (classification, segmentation, and labeling) and text (full-length document annotations for PDF, Doc, Text, etc) for ML projects.
Label Studio – Label Studio is a configurable data annotation tool that works with different data types

If your looking for data-labeling service providers check out my other blog.

Do you know a tool or framework and would like me to add it to the list? Just comment below or drop me an mail at isusmelj at gmail.com!

Source:
github.com/heartexlabs/awesome-data-labeling
github.com/taivop/awesome-data-annotation
github.com/jsbroks/awesome-dataset-tools

The post Tools and Frameworks appeared first on The Data Blog.

A random forest image classifier in a day

igor — Mon, 10 Feb 2020 14:24:00 +0000

Learn about how we did collect data and trained a random forest image classifier within a single day for Hack Zurich 2016.

One of my first projects using my newly gathered know-how of machine learning was during HackZurich 2016. We built a sign digitizer which turned handwritten signs into word-like masterpieces. The final result can be seen below. The text detection used a cloud API, the sign recognition, however, used a custom model built using a random forest image classifier.

Demo of our Hack Zurich 2016 project

Data Collection

First, we needed a dataset of signs. A quick Google search made us realize that there wasn’t any back then. So we had to decide on either skipping the functionality of sign recognition or creating our dataset. The team votes went for the custom dataset and we all spent the next 15 minutes drawing shapes on pieces of paper. We scanned our art pieces with the Samsung printer. There were roughly 10 pages like the one shown below.

Data Preparation

Then we used a simple algorithm for finding connected components using https://opencv.org/ that fit into rectangles to crop the shapes into single instances of jpg images. This worked quite well, no further tuning was needed. To group, label or annotate the various shapes we sorted all instances multiple times based on the percentage of black pixels and compressed file size. For each label, we put the images into newly created folders. A final train/ test split and the dataset was ready.

Our custom dataset we used to train the sign classification model

The Random Forest Image Classifier

To classify the images we used https://scikit-learn.org/, back at that time my most favorite ML library. The images were resized to about 16×16 pixels before we fed them into our random forest image classifier. No further feature extraction has been performed. The simplicity of the shapes and well-aligned crops were good enough to yield satisfaction among the team.

Coupled with a flask based rest API the masterpiece was complete. A sign classifier using random forest, trained within a single day during Hack Zurich 2016.

The post A random forest image classifier in a day appeared first on The Data Blog.

Welcome to my blog about ML and data

igor — Sun, 09 Feb 2020 11:43:11 +0000

My background and motivation for starting a blog about my personal experience and projects around data annotation and machine learning.

Around 2014 my interests in machine learning started to grow. Just the idea of teaching a machine to do certain things instead of hard coding it fascinated me. To feed my hunger for more information I took the ever-growing (the number of students was almost doubling every second year) lecture “machine learning” at ETH Zurich held by Prof. Buhmann. We started with statistics, regression, moved on to support vector machines, bagging, boosting, random forest and finally ended up talking about neural networks and their implications on the whole field.

During my professional work in the innovation lab at SIX Group as well as my spare time I continued to foster my skills and learn more about the mysteries of deep learning. You can read more about the various milestones in my other blog posts.

I’m currently working at Mirage, an ETH spin-off with the goal of making machine learning more accessible and reduce the headaches we face with data annotation. In this blog I will talk a lot about different products we built using deep learning. The various projects cover many aspects from detecting deep fakes to using image segmentation for Hollywood VFX studios.

After multiple less successful attempts in finding product market fit in ML products we succeeded with a filter to help companies find the most relevant data we call WhatToLabel.

That’s me, Igor

You find more information about my coding projects on my GitHub. If you’re in Zurich and want to exchange about startups or machine learning don’t hesitate to contact me!

The post Welcome to my blog about ML and data appeared first on The Data Blog.