Sungju Kim

Agent As a Function

2025-01-13T10:00:00+00:00

Agents can work like functions too.

TL;DR

Traditional functions follow a fixed trajectory, executing mechanically along predetermined paths.
An Autonomous Function performs intelligent computation within the function, following a non-fixed trajectory until the goal is achieved.
Define the goal via system prompt, provide validation rules and tools, and let the agent iterate until completion.

Strictly speaking, every input-output mapping is a function. But the “function” here refers to the functions we write in everyday code—the function or method in programming languages. Specifically, I’m referring to classical deterministic functions.

Traditional functions execute along fixed trajectories. Given the same input, they follow predetermined paths mechanically and produce the same output. This is deterministic and predictable.

But what if a function could think? What if it could adapt its approach based on intermediate results, try different strategies when one fails, and decide for itself when the task is truly complete?

I call this “Agent as a Function” or “Autonomous Function.” Unlike traditional functions that follow fixed trajectories, an Autonomous Function performs intelligent computation within itself. It takes a goal, reasons about how to achieve it, validates its own work, and iterates along non-fixed trajectories until the goal is reached.

Let me show this with a concrete example. Imagine we need a function that downloads a dataset from HuggingFace and normalizes it to OpenAI message format.

Approach 1: Single LLM Call

def normalize_dataset(dataset_name: str) -> list:
    return llm.complete(f"Convert {dataset_name} to OpenAI format")

A single LLM call cannot solve this problem.

Approach 2: LLM Workflow

def normalize_dataset(dataset_name: str) -> list:
    dataset_url = web_search(f"{dataset_name} huggingface download url")
    raw_data = download(dataset_url)

    for attempt in range(3):
        code = llm.complete(f"Write code to convert to OpenAI format: {raw_data[:1000]}...")
        # ... validation and execution logic ...

    raise RuntimeError("Failed after 3 attempts")

You can add retries, but everything is hardcoded. The number of attempts, what to do on failure, when to give up. The LLM has no say in this. It just generates code when asked. All decisions are made by the developer at write time, not by the model at runtime.

Approach 3: Agent as a Function

system_prompt = """You are a data normalization agent.
Your goal is to download a HuggingFace dataset and convert it to OpenAI message format.

## Validation Rules
- Output must be valid JSON
- Each message must have 'role' and 'content' fields
- 'role' must be one of: 'system', 'user', 'assistant'
- All conversations must be properly structured

## Termination
- Call task_complete(result) when validation passes
- Call task_give_up(reason) if you've tried multiple approaches and none work
- Call task_impossible(reason) if the task is fundamentally impossible
"""

# BASE_TOOLS: capabilities to do the work
BASE_TOOLS = [
    web_search,       # search for dataset documentation
    read_file,        # read downloaded data
    write_file,       # write conversion code and output
    run_python,       # execute conversion code
]

# VALIDATION_TOOLS: verify the output
VALIDATION_TOOLS = [
    validate_json_schema,  # check OpenAI message format
    run_tests,             # run format validation tests
]

# TERMINAL_TOOLS: explicit task completion
TERMINAL_TOOLS = [
    task_complete,    # success with result
    task_give_up,     # tried but failed
    task_impossible,  # fundamentally can't be done
]

normalize_hf_to_openai = create_agent_function(
    name="normalize_hf_to_openai",
    system_prompt=system_prompt,
    tools=BASE_TOOLS + VALIDATION_TOOLS + TERMINAL_TOOLS,
    max_iterations=15
)

# Call it like any other function
result = normalize_hf_to_openai(dataset="squad_v2")

The agent searches for the dataset schema, writes conversion code, executes it, validates the output format, and if validation fails, it debugs and retries. It explicitly signals completion status via terminal tools.

The created Autonomous Function works like any other function within a system.

def build_training_dataset(sources: list[str]) -> Dataset:
    normalized = []

    for source in sources:
        # Autonomous Function: data normalization
        result = normalize_hf_to_openai(dataset=source)

        if result.status == "complete":
            normalized.extend(result.data)
        elif result.status == "impossible":
            log.warning(f"Skipping {source}: {result.reason}")

    # Autonomous Function: deduplication
    deduped = deduplicate_conversations(normalized)

    # Autonomous Function: quality filtering
    filtered = filter_low_quality(deduped, threshold=0.8)

    # Note: All three Autonomous Functions above can be combined into one
    return Dataset(filtered)

Each Autonomous Component explicitly signals success, failure, or impossibility. The caller handles each case appropriately.

The key insight here is that we’re moving from deterministic functions to goal-oriented functions. Traditional functions ask “what steps should I execute?” while Autonomous Functions ask “what goal should I achieve?”

When designing these, clarity of the goal definition matters most. Vague goals lead to vague outputs. Terminal tools let the agent signal completion explicitly. And boundaries like max iterations and timeouts provide safety rails.

This is a fundamental shift in how we think about computation. Instead of writing code that specifies every step, we define goals and let intelligent agents figure out the trajectory. The function becomes a container for intelligent problem-solving rather than a fixed sequence of operations.

Agent도 하나의 함수처럼 동작할 수 있다.

TL;DR

기존 함수는 고정된 trajectory를 따라 기계적으로 실행된다.
Autonomous Function은 함수 내에서 지적 연산을 수행하여 고정되지 않은 trajectory를 따라 목적을 달성할 때까지 실행한다.
시스템 프롬프트로 목표를 정의하고, 검증 규칙과 도구를 줘서 Agent가 스스로 완료할 때까지 반복하게 한다.

엄밀히 말하면 모든 입출력 매핑은 함수다. 하지만 여기서 말하는 “함수”는 우리가 일상적으로 작성하는 코드 속 함수, 즉 프로그래밍 언어의 function이나 method를 의미한다. 특히 고전적인 결정론적인 함수를 지칭한다.

기존 함수는 고정된 trajectory를 따라 실행된다. 같은 입력이 들어오면 정해진 경로를 기계적으로 따라가고, 같은 출력을 낸다. 결정적이고 예측 가능하다.

그런데 함수가 생각할 수 있다면 어떨까? 중간 결과를 보고 접근 방식을 바꾸고, 하나가 실패하면 다른 전략을 시도하고, 작업이 정말 끝났는지 스스로 판단할 수 있다면?

이는 LLM Agent로 구현할 수 있고, 나는 이를 “Agent as a Function” 또는 “Autonomous Function”이라고 부른다. 고정된 trajectory를 따르는 기존 함수와 달리, Autonomous Function은 함수 내에서 지적 연산을 수행한다. 목표를 받아서 어떻게 달성할지 추론하고, 자기 작업을 검증하고, 목표에 도달할 때까지 고정되지 않은 trajectory를 따라 반복한다.

예시로 보자. HuggingFace에서 데이터셋 받아서 OpenAI 메시지 포맷으로 변환하는 함수가 필요하다고 해보자.

Approach 1: Single LLM Call

def normalize_dataset(dataset_name: str) -> list:
    return llm.complete(f"Convert {dataset_name} to OpenAI format")

한번의 LLM 호출로는 이 문제를 해결할 수 없다.

Approach 2: LLM Workflow

def normalize_dataset(dataset_name: str) -> list:
    dataset_url = web_search(f"{dataset_name} huggingface download url")
    raw_data = download(dataset_url)

    for attempt in range(3):
        code = llm.complete(f"Write code to convert to OpenAI format: {raw_data[:1000]}...")
        # ... validation and execution logic ...

    raise RuntimeError("Failed after 3 attempts")

재시도를 넣을 수 있지만, 다 하드코딩이다. 몇 번 시도할지, 실패하면 뭘 할지, 언제 포기할지. LLM한테 결정권이 없다. 시키면 코드 생성할 뿐이다. 결정은 전부 개발자가 코드 짤 때 내린다. 런타임에 모델이 판단하는 게 아니다.

Approach 3: Agent as a Function

system_prompt = """You are a data normalization agent.
Your goal is to download a HuggingFace dataset and convert it to OpenAI message format.

## Validation Rules
- Output must be valid JSON
- Each message must have 'role' and 'content' fields
- 'role' must be one of: 'system', 'user', 'assistant'
- All conversations must be properly structured

## Termination
- Call task_complete(result) when validation passes
- Call task_give_up(reason) if you've tried multiple approaches and none work
- Call task_impossible(reason) if the task is fundamentally impossible
"""

# BASE_TOOLS: 작업용 도구
BASE_TOOLS = [
    web_search,       # 문서 검색
    read_file,        # 데이터 읽기
    write_file,       # 코드/결과 쓰기
    run_python,       # 코드 실행
]

# VALIDATION_TOOLS: 검증용 도구
VALIDATION_TOOLS = [
    validate_json_schema,  # 포맷 검증
    run_tests,             # 테스트 실행
]

# TERMINAL_TOOLS: 종료 신호
TERMINAL_TOOLS = [
    task_complete,    # 성공
    task_give_up,     # 포기
    task_impossible,  # 불가능
]

normalize_hf_to_openai = create_agent_function(
    name="normalize_hf_to_openai",
    system_prompt=system_prompt,
    tools=BASE_TOOLS + VALIDATION_TOOLS + TERMINAL_TOOLS,
    max_iterations=15
)

# 다른 함수처럼 호출
result = normalize_hf_to_openai(dataset="squad_v2")

Agent가 스키마 찾고, 변환 코드 짜고, 실행하고, 결과 포맷 검증한다. 검증 실패하면 디버깅하고 다시 시도한다. 끝나면 종료 도구로 상태를 알린다.

제작된 Autonomous Function은 시스템 내에 하나의 함수처럼 동작한다.

def build_training_dataset(sources: list[str]) -> Dataset:
    normalized = []

    for source in sources:
        # Autonomous Function: 데이터 정규화
        result = normalize_hf_to_openai(dataset=source)

        if result.status == "complete":
            normalized.extend(result.data)
        elif result.status == "impossible":
            log.warning(f"Skipping {source}: {result.reason}")

    # Autonomous Function: 중복 제거
    deduped = deduplicate_conversations(normalized)

    # Autonomous Function: 품질 필터링
    filtered = filter_low_quality(deduped, threshold=0.8)

    # 참고: 위 세 개의 Autonomous Function은 하나로 합칠 수도 있다
    return Dataset(filtered)

각 Autonomous Component가 성공, 실패, 불가능을 명시적으로 알려주니까, 호출하는 쪽에서 각 경우를 적절히 처리할 수 있다.

핵심은 결정적 함수에서 목표 지향 함수로 이동하는 것이다. 기존 함수는 “어떤 단계를 실행할까?”를 묻지만, Autonomous Function은 “어떤 목표를 달성할까?”를 묻는다.

설계할 때는 목표 정의의 명확성이 가장 중요하다. 모호한 목표는 모호한 결과로 이어진다. 종료 도구는 Agent가 완료를 명시적으로 알릴 수 있게 한다. 최대 반복 횟수나 타임아웃 같은 경계는 안전장치 역할을 한다.

이건 연산에 대한 사고방식의 근본적인 변화다. 모든 단계를 명시하는 코드를 작성하는 대신, 목표를 정의하고 지적 Agent가 trajectory를 알아내게 한다. 함수가 고정된 연산 순서가 아니라 지적 문제 해결을 담는 컨테이너가 된다.

Scaling

2024-06-01T14:44:00+00:00

As an AI Research Engineer, I have conducted and designed various ML/DL, LLM experiments and have learned a lot from them.

I want to share my experiences and insights to help others who are solving similar AI-related problems. Rather than writing all my experiences and know-how at once, I plan to share them piece by piece.

The first topic I want to share is Scaling.

TL;DR

Find a methodology that shows monotonically increasing performance when scaled.
Create a model “f(model size or dataset size or …) = score” through scaling experiments that can predict performance as scaling increases.
Use the model “f(model size or dataset size or …) = score” to decide whether to push this methodology further or move on to another one.

As an engineer, to solve a problem, you need to find a technology that can solve it well and push it to the extreme to improve performance. Ideally, we find a technique where the performance increases monotonically as we scale it up.

In many cases, performance increases linearly or like a log function. We push performance enhancement methodologies to their limits through scaling until we encounter a plateau and solve the problem by improving performance.

This scaling to solve problems can also be observed in other domains outside of AI/ML. In semiconductors, circuits are scaled by making the circuit linewidth thinner through nanoprocessing every year to increase circuit density. GPUs are scaled by increasing the number of CUDA cores.

In LLMs, the scaling law, where performance continues to improve as the model size is scaled up, is very well-known.

In AI/ML problems, aside from model size, another aspect that can be scaled is the dataset. In AI/ML problems, rather than changing the model architecture or training techniques, scaling the amount of data and improving its quality is more cost-effective for enhancing model performance.

And in many cases, by conducting scaling experiments on the dataset and measuring performance based on the number of data samples, we can create a model that predicts performance based on the number of data samples.

Generally, performance increases on a log scale as the number of data samples increases, so if you plot log(dataset size) vs score, you will see a linear relationship and can find a function “f(dataset size) = score” through linear fitting that predicts performance based on the number of data samples.

By using this function, we can predict how much performance gain we can expect when scaling this methodology and decide whether to continue scaling the current data collection methodology, switch to collecting different data, or move on to another aspect besides data.

And this is actually the most basic concept taught in ML 101. This is fundamental in LLM experiments as well, and not only I, but OpenAI researchers also emphasize and invest in scaling research with many experiments.

Currently, many new AI, LLM technologies are emerging, and it seems like these basic aspects are often overlooked in recent research and experiments. However, LLMs are still ML, so I believe it is important to always remember and experiment with the fundamentals of ML. I believe that keeping the basics first is the faster way to solve problems.

AI 리서치 엔지니어로서 다양한 ML/DL, LLM 실험을 수행하고 설계하면서 많은 것을 배웠다.

비슷한 AI 관련 문제를 해결하고 있는 분들에게 도움이 되고자 내 경험과 인사이트를 공유하고자 한다. 모든 경험과 노하우를 한 번에 쓰기보다는 조금씩 나눠서 공유할 계획이다.

첫 번째로 공유하고 싶은 주제는 스케일링이다.

TL;DR

스케일을 키웠을 때 성능이 단조 증가하는 방법론을 찾는다.
스케일링 실험을 통해 스케일이 증가함에 따라 성능을 예측할 수 있는 모델 “f(모델 크기 또는 데이터셋 크기 또는 …) = 점수”를 만든다.
이 모델 “f(모델 크기 또는 데이터셋 크기 또는 …) = 점수”를 사용하여 이 방법론을 더 밀고 나갈지, 다른 것으로 넘어갈지 결정한다.

엔지니어로서 문제를 해결하려면 그 문제를 잘 해결할 수 있는 기술을 찾아 극한까지 밀어붙여 성능을 향상시켜야 한다. 이상적으로는 스케일을 키울수록 성능이 단조 증가하는 기술을 찾는다.

많은 경우 성능은 선형적으로 또는 로그 함수처럼 증가한다. 고원(plateau)에 도달할 때까지 스케일링을 통해 성능 향상 방법론을 한계까지 밀어붙이고, 성능을 개선하여 문제를 해결한다.

문제 해결을 위한 이러한 스케일링은 AI/ML 외의 다른 분야에서도 관찰된다. 반도체에서는 매년 나노 공정을 통해 회로 선폭을 더 얇게 만들어 회로 집적도를 높이는 방식으로 스케일링한다. GPU는 CUDA 코어 수를 늘리는 방식으로 스케일링된다.

LLM에서는 모델 크기가 커질수록 성능이 계속 향상되는 스케일링 법칙이 매우 잘 알려져 있다.

AI/ML 문제에서 모델 크기 외에 스케일링할 수 있는 또 다른 측면은 데이터셋이다. AI/ML 문제에서 모델 아키텍처나 학습 기법을 바꾸는 것보다 데이터의 양을 늘리고 품질을 개선하는 것이 모델 성능 향상에 더 비용 효율적이다.

그리고 많은 경우, 데이터셋에 대한 스케일링 실험을 수행하고 데이터 샘플 수에 따른 성능을 측정함으로써 데이터 샘플 수에 기반한 성능 예측 모델을 만들 수 있다.

일반적으로 데이터 샘플 수가 증가함에 따라 성능은 로그 스케일로 증가하므로, log(데이터셋 크기) vs 점수를 그래프로 그리면 선형 관계를 볼 수 있고, 선형 피팅을 통해 데이터 샘플 수에 기반한 성능을 예측하는 함수 “f(데이터셋 크기) = 점수”를 찾을 수 있다.

이 함수를 사용하여 이 방법론을 스케일링할 때 얼마나 많은 성능 향상을 기대할 수 있는지 예측하고, 현재 데이터 수집 방법론을 계속 스케일링할지, 다른 데이터를 수집하는 것으로 전환할지, 또는 데이터 외의 다른 측면으로 넘어갈지 결정할 수 있다.

그리고 이것은 사실 ML 101에서 가르치는 가장 기본적인 개념이다. 이것은 LLM 실험에서도 기본이며, 나뿐만 아니라 OpenAI 연구원들도 많은 실험을 통해 스케일링 연구를 강조하고 투자한다.

현재 많은 새로운 AI, LLM 기술이 등장하고 있고, 최근 연구와 실험에서 이러한 기본적인 측면이 종종 간과되는 것 같다. 하지만 LLM도 여전히 ML이므로, ML의 기본을 항상 기억하고 실험하는 것이 중요하다고 생각한다. 기본을 먼저 지키는 것이 문제를 해결하는 더 빠른 방법이라고 믿는다.