Allen Chang

Actively Understanding the Dynamics and Risks of the Threat Intelligence Ecosystem

2026-03-09T00:00:00+00:00

Following the recent acceptance of our paper “Actively Understanding the Dynamics and Risks of the Threat Intelligence Ecosystem” to the Network and Distributed System Security (NDSS) Symposium 2026, this blog serves to summarize my contribution to this research project, and describes at a high level our methodology and results. This represents two years of work with collaborators at Georgia Tech: Tillson Galloway, Omar Alrawi, Thanos Avgetidis, Manos Antonakakis, and my advisor Fabian Monrose.

First, I summarize the methodology, findings, and implications for the threat intelligence ecosystem. Then, I discuss my contributions to the project, as well as the design choices in more detail than the publication.

Our paper and slides can be found on the NDSS website.

Project Summary

Background and Motivation

The cybersecurity industry and community relies heavily on collecting and sharing threat intelligence (TI). Security vendors and other defenders analyze artifacts and produce indicators such as IP addresses, domains, and file signatures to detect and respond to threats. Yet despite the scale of the TI ecosystem, it remains largely a black box; it is inherently difficult to investigate the sharing patterns and tendencies of security vendors. We know little about how security vendors triage artifacts, collect indicators of compromise (IoCs), and disseminate this information. Vendors themselves know little about how their intelligence is being used and what threat vectors may exist in their analysis.

In our work, we set out to measure the TI ecosystem end-to-end, from initial artifact submission to disruption of threats. Our approach reveals not only what vendors participate, but also how quickly vendors act, the depth of their analysis, and the threat vectors of this ecosystem.

Research Questions

The end goal of our research is threefold, and is centered around answering the following research questions (RQs):

RQ1 (Propagation): How do security vendors differ in their ability to analyze malware and share extracted indicators of compromise (IoCs) across the ecosystem?

RQ2 (Disruption): How do differences in analysis and sharing affect the speed and effectiveness with which vendors block IoCs or take down (or suspend) infrastructure?

RQ3 (Evasion): How are adversaries exploiting gaps in analysis, sharing, or disruption, and what strategies can improve the ecosystem’s resilience against such evasions?

Measurement Pipeline

To penetrate this black box, our key idea was to devise malware as probes in order to map the TI ecosystem supply chain. To do this, we devised a measurement pipeline to answer these questions. The goal of this pipeline is to track the binaries as they traverse the ecosystem from submission to execution(s) to disruption; this is done by deploying a set of observers monitoring for watermarked emissions. The pipeline is shown below:

The Generator produces a binary (“Rocket”) which is submitted to a set of TI platforms, sandboxes, and antivirus engines; each binary is unique to each submission. The binary is a defanged malware that is intended to trigger a malicious verdict and subsequent execution. Upon execution the Rocket collects a set of information about the sandbox environment, encodes the information in an HTTP request to a controlled domain, and drops a modified copy of itself with an updated provenance trail.

The Observer consists of emission sensors, a DNS authority and HTTP server, that monitors for requests produced by the Generator, which indicates execution. Identifiers in the request are uniquely mapped to submissions to track emissions with high confidence. The Observer also consists of IoC probes which actively track whether domains, IPs, and artifact hashes appear in blocklists (Google Safe Browsing, Quad9, VirusTotal, AlienVault OTX, commercial feeds).

This design allows us to observe each stage of the IoC propagation chain, which we can intuitively consider as an IoC lifecycle. We can either derive information or receive indicators at 1) submission; 2) first execution (sandbox); 3) sharing and each subsequent execution, and location of execution; 4) when IoCs were put into action to disrupt threats (blocking); 5) domain suspension.

Methodology and Scale

We submitted unique Rockets to 30 vendors across three categories:

10 Antivirus vendors (e.g., Kaspersky, Sophos, Avira)
10 Malware sandbox services (e.g., Hybrid Analysis, Any.Run, Intezer)
10 TI platforms (e.g., VirusTotal, MalwareBazaar, AlienVault OTX)

We registered 9 domains, assigning each domain to a category type and carrying out multiple experiments. The domains were chosen such that no prior registrations were found in DNS zonefiles to avoid experiment contamination. Each Rocket was unique to the vendor and manually submitted to each vendor for an experiment.

To label vendors, we use data exfiltrated from the sandbox. The first execution of a Rocket is attributed to the submission vendor, while later executions can be attributed by clustering similar sandbox environments with relatively high certainty. This resulted in 62 labeled clusters, 19 of which were known from first submission and 43 generated via clustering.

Key Findings

Despite extraction being common, sharing of TI and action is rare. 20/30 submitted Rockets were executed, but only 5 shared the extracted TI. Even more, only 2 vendors contributed to downstream domain takedowns. Hence, there is a big gap between IoC extraction and actionable intelligence dissemination.
“Nexus” vendors may create single points of failure. We label 4 vendors as “nexus” vendors, which have high in- and out-degree in the TI graph; they both consume and share TI.
Adversaries actively exploit sandbox fingerprints. Using VirusTotal Retrohunt, we identified 874 malware samples uploaded within a 90-day window (March-June 2025) containing sandbox-specific IPs from our experiments. Two popular open-source stealer families, such as this one (GitHub) dynamically download IP blocklists from GitHub to detect and evade sandboxes. Simulating evasion using public blocklists showed a 25% reduction in vendors receiving extracted TI.
Network IoCs are reshared far more frequently than binaries. While this is naturally more efficient, it means that some vendors lose out on information that may otherwise be contextually relevant. In some cases, domain lookups were up to 20x more prevalent than binary executions.
Sharing delays propagate downstream. Although IoCs are typically extracted within minutes, sharing delays of hours to days propagate across the supply chain and lifecycle. In case studies, domains were blocked by DNS firewalls within 1-13 hours but takedowns took 8-11 days. This leaves a wide exploitation window for adversaries.

Recommendations

In our paper, we provide recommendations to improve ecosystem resilience.

For vendors:

Diversify sandbox fingerprints across IP ranges, system configurations, and ASNs. While operationally expensive, adversaries actively take advantage of patterns across sandboxes.
Implement recursive analysis of dropped files and packed files, which are increasingly common.
Reduce sharing delays to reduce delay in TI lifecycle.
Monitor for sandbox fingerprint abuse using detection signatures.

For operators:

Assess TI feed sources to understand upstream TI ecosystem dependencies and avoid illusion of consensus.
Prioritize feeds that provide full binaries and behavioral context, not just decontextualized IoCs.
Consider the risks of uploading binaries to public scanning services. IoCs can be used as honeytokens to reveal breaches of confidentiality or incident response auditing.

For researchers:

Consider ethical implications of your research. Avoid human deception, implement opt-out, etc.
Account for aggressive IoC scanning during experiments and anticipate large-scale data cleanup.
Account for vendor dependencies when evaluating TI feeds or training ML models.

My Contributions

While this research was only recently published, I began work on this project in mid-2023. Most of my work focused on the design decisions of the Rocket malware, while my co-authors did much of the data analysis.

Malware Design

The core purpose of the Rocket design is to build a probing binary that vendors will dynamically analyze and run, and have it leave a provenance trail to trace where the binary propagates across the TI ecosystem. This is easier said than done, because competing constraints may have to be simultaneously satisfied. For instance, the binary must be malicious enough to trigger dynamic execution consistently, while being defanged for ethical considerations. It must exfiltrate environment information from the sandbox, but must do so in a privacy-preserving manner.

Inducing Dynamic Execution and a Malicious Verdict

Not all binaries submitted to a vendor will be executed; some will be processed statically. However, the experiments require that Rockets consistently trigger dynamic analysis, so the binary needs to exhibit behaviors that static analysis deems suspicious enough to proceed to dynamic analysis.

To do this, we use known rules to induce a malicious verdict during static triaging. We decided to use a combination of malicious YARA signatures (byte sequences, strings) and other techniques to consistently produce malicious verdicts in a subscription-based private sandbox.

Assuming the binary proceeds to dynamic analysis, we also need to ensure that the sandbox produces a malicious verdict to encourage downstream sharing of IoCs. Verdicts are decided in sandboxes through a variety of rules and behaviors: creating processes or files, network activity, API call sequences (often done through hooking user-mode APIs) and monitoring syscalls, or evasion detection. Conveniently, much of the behavior needed for our measurement (environment fingerprinting, file writes/dropping Satellites, and network emissions) already contributes to suspicion. Importantly, we use a defanged keylogging behavior (which makes desired API calls) in our construction to reinforce a malicious verdict.

Tracking Binary History and Creating a Provenance Trail

During execution, the binary collects and exfiltrates environment information by querying a lab-controlled HTTP server (“emission”). However, this profiling process is not enough to track where the IoCs have been spread, nor who has executed the binary; we can only associate the submission to a vendor to an initial execution. Hence, the key idea is to use a type of provenance tracking in our profiling approach. During profiling, the binary embeds information about the current environment into itself (the “provenance trail”), and during the emission phase, the entire provenance trail is exfiltrated.

To implement this, we must first discuss data exfiltration to understand the restrictions. As mentioned earlier, we use a DNS authority and HTTP server that monitors for requests produced. To share the fingerprint with the Observer, the Generator uses a simple HTTP request and encodes the provenance trail in the fully qualified domain name (FQDN).

This leads to an interesting problem. FQDNs can be 253 characters long, but a given subdomain label can be just 63 characters. Domain names are also case-insensitive, so our encoding strategy limits labels to [a-z0-9], 36 characters. Furthermore, a DNS label must be separated by a ., so 4 characters are lost on label separation. Hence we are upper bounded at $249\log_{2}36\approx 1287$ bits of information to encode the provenance trail. Certainly, this is not enough to encode long strings or IoCs to perfectly uniquely identify each submission. Note that the provenance trail must also be encoded in these 1300 bits of information, so the FQDN becomes even more restrictive.

Fingerprinting Sandboxes

The fingerprints generated by the binary during execution were based on a pilot experiment I first ran in late 2023 to validate prior work, such as SandPrint: Fingerprinting Malware Sandboxes to Provide Intelligence for Sandbox Evasion. For this pilot, we profiled sandboxes using 16 features, and performed single-link hierarchical clustering to group sandboxes by vendor. Then, we chose a subset of features that maximizes mutual information $I(F_1;C)=H(F_1)+H(C)-H(F_1,C)$.

We can visualize the clustering using a dendrogram below; each color shows a different cluster.

The table below displays information gain for each field, which helps with intuition, but is not perfectly representative of the information gain onto a joint variable.

I computed all $2^{16}$ subsets and ultimately settled on install_date, ram, and sys_manufac as our set of exfiltrated features to maximize mutual information while being privacy preserving. We can validate by comparing the clustering with the feature subset to the ground truth (clustering with all 16 features); in over 99% of pairs of executions, the same labeling is achieved. Notably, no significant increase in similarity was found when using a larger subset of features.

This algorithm describing single-link hierarchical clustering can be thought of as an implementation of a union-find data structure, in which we form clusters by combining observations until a threshold of features is reached.

Thus, we can generate a hashed fingerprint based on these three features, which we call $H_i$. It turns out this is not yet sufficient, as a single vendor may execute the same binary multiple times across identical sandbox environments, producing indistinguishable fingerprints. To disambiguate, each execution also generates a random execution ID ($\epsilon_i$) that uniquely identifies each individual run. The combination of the system fingerprint and execution ID allows us to distinguish repeat executions within the same vendor from executions across different vendors, which is important for accurately reconstructing the provenance trail.

Furthermore, we generate a unique binary ID $b$ for each of 30 submissions, grouped in 10 by vendor type. Each group is assigned a domain, and each binary ID is assigned a letter between A-I.

Checksums can also be implemented in the provenance trail. Background noise/fuzzing can be detected by our observers, and it is important to separate real observations with fuzzers. I initially implemented a CRC error detection algorithm in our data analysis pipeline, but this turns out to be largely unnecessary since errors can be detected by malformed provenance trails.

Finally, we create the provenance trail with the following construction:

\[\langle b\mid\mid\epsilon_nH_i\mid\mid\dots\mid\mid\epsilon_1H_1\rangle\]

Each execution appends then exfiltrates the entire provenance trail, allowing us to see the entire history of where the binary has been shared. Importantly, if history diverges at a point, we can observe this fact when the provenance trail is forked. If the dropped Satellite binary is also executed, we can see this when two executions are made from a similar sandbox.

Consider these log entries for examples:

There are many deductions we can automatically make from this.

In chain 1, the Rocket is first executed by INT-TI-3 ($t_1$). The resulting Satellite A1 is then executed by EXT-DE-1 ($t_2$), whose provenance trail contains INT-TI-3’s execution ID, meaning EXT-DE-1 received the Satellite from INT-TI-3. Next, Satellite A2 is executed by INT-TI-3 again ($t_3$), with a provenance trail containing both prior execution IDs. This implies EXT-DE-1 shared the Satellite back to INT-TI-3, revealing a cyclic sharing relationship between the two vendors. Note furthermore that the fingerprint (not the execution ID) is the same in $t_1$ and $t_3$.

In Chain 2, the Rocket is first executed by INT-SB-1 ($t_4$). Its Satellite B1 is then executed by INT-TI-3 ($t_5$) via HTTP, and later the same Satellite’s domain is probed by EXT-NZ-1 ($t_6$) via DNS only. The DNS-only probe at $t_6$ indicates that INT-TI-3 shared the domain IoC with EXT-NZ-1, rather than the binary itself; notably this differentiates binary sharing and IoC sharing.

Rocket/Satellite Implementation

The binary itself was written in Go, for a couple of reasons:

Cross-compilation ability
Single binary deployment, so no DLLs needed and simplifies Rocket/Satellite dropping mechanism

The main implementation challenge is the Satellite dropping mechanism. Recall that upon execution of a Rocket, it must drop a modified copy of itself (Satellite) with an updated provenance trail.

To do this, we reserve a fixed region in the compiled binary as effectively a buffer for the provenance trail, which has a max length. During execution, the Rocket drops a Satellite, finds the buffer, and overwrites the buffer with a new provenance trail. Thus, the in-place modification only changes a small portion of memory. See this portion of the early version of the code:

srcFile, err := os.Open(os.Args[0])
if err != nil {
  fmt.Println(err)
}
defer srcFile.Close()

destFile, err := os.Create(newFilename)
if err != nil {
  fmt.Println(err)
}
defer destFile.Close()

_, err = io.Copy(destFile, srcFile)
if err != nil {
  fmt.Println(err)
}

content, err := ioutil.ReadFile(newFilename)
if err != nil {
  fmt.Println(err)
}

r := regexp.MustCompile("<<<<.{256}>>>>")
match := r.FindStringIndex(string(content))
matchStr := content[match[0]:match[1]]
replacedContent := bytes.Replace(content, matchStr, []byte(newHistory), 1)

err = ioutil.WriteFile(newFilename, replacedContent, 0644)
if err != nil {
  fmt.Println(err)
}
destFile.Close()

// delete itself when done
cmd := exec.Command("cmd.exe", "/c", "del " + os.Args[0])
cmd.Start()

// if copy flag is true, replace itself
if copy == "true" {
  err = os.Rename(newFilename, "temp.exe")
  if err != nil {
    fmt.Printf("Error renaming the file: %v\n", err)
    return
  }
  cmd2 := exec.Command("cmd.exe", "/c", "move temp.exe", os.Args[0])
  cmd2.Start()
}

In one experiment to look at sandbox evasion, we pack the rocket with the UPX packing algorithm. We also encrypt the main sandbox profiler with AES, where an “outer profiler” decrypts the Rocket:

//go:embed innerprofiler.enc
var encFile string
var basename string

func main() {
    key := []byte{0x2b, 0x32, 0x2c, 0xad, 0x83, 0xeb, 0xc4, 0x31, 0xd1, 0xee, 0xe3, 0x86, 0x8e, 0x48, 0xbc, 0x4f}
    iv := []byte{0xfe, 0xe2, 0xb9, 0x2c, 0xf5, 0xb8, 0xb2, 0x60, 0xe2, 0x92, 0x96, 0x68, 0xc0, 0x99, 0xf1, 0x9d}

	encFile2, err := hex.DecodeString(encFile)
	if err != nil {
		fmt.Println("Error decoding hex string:", err)
		return
	}
	
  block, err := aes.NewCipher(key)
  if err != nil {
      fmt.Println("Error creating cipher block:", err)
      return
  }

  mode := cipher.NewCBCDecrypter(block, iv)
  decryptedData := make([]byte, len(encFile2))
  mode.CryptBlocks(decryptedData, encFile2)

  paddingLength := int(decryptedData[len(decryptedData)-1])
  decryptedData = decryptedData[:len(decryptedData)-paddingLength]

  outputFilename := basename
  err = ioutil.WriteFile(outputFilename, decryptedData, 0644)
  if err != nil {
      fmt.Println("Error writing decrypted data to file:", err)
      return
  }
}

This experiment (Experiment II in the paper) tests whether vendors recursively analyze dropped files, which has consequences in ecosystem extraction and analysis depth. Our results show that packed Rockets were executed 35% less often than an unpacked Rocket (Experiment I), despite having more AV detections on VirusTotal. This suggests that vendors are not properly unpacking and analyzing dropped payloads.

In another unpublished experiment, we hypothesized that fuzzy hashing (such as ssdeep) would indicate that many similar samples of our Rocket have been previously submitted, impacting detection rates and sharing behavior. To combat this, we tried two things:

Compile-time mutations. We experimented with introducing natural variation in the binary. This can be done through inserting dead code, reordering functions, or substituting equivalent code. However, this alone was insufficient to defeat fuzzy hashing, since ssdeep operates on rolling hashes over byte sequences rather than the full file hash, and the similarity score is still high.
Embedding random byte regions. Similarity is computed over fixed-size blocks of the binary. We can embed large strings of random bytes into each Rocket at compile time to dilute the proportion of shared content between builds. This worked, but has the side effect of bloating the binary.

Observer Design

I also contributed to the Observer’s IoC probing mechanism. Recall that to answer RQ2 (disruption), we need to track when and how vendors act on the intelligence they extract, such as when domains are blocked by DNS resolvers, OSINT blocklists, or suspended by registrars. To do this, we implement a set of cronjobs every couple hours that probe external services for the presence of our IoCs. These probes check:

DNS blocklists, such as Quad9, Google DNS, Cloudflare, or Palo Alto DNS Security. This is done by issuing DNS queries through each resolver and checking for responses that indicate blocking.
OSINT feeds: Whether our domains, IPs, or file hashes appear in Google Safe Browsing, FireHOL, VirusTotal, or AlienVault OTX. For VirusTotal and OTX, we query their APIs for detection counts and AV labels.
Domain suspension: Whether the registrar has suspended or taken down our domains, detected by checking DNS resolution.

We can log the timestamp and the result to reconstruct a timeline to see when each IoC transitioned from active to blocked to suspended. Combined with the emission logs from the Generator, this gives us the full lifecycle of the IoC.

We can observe results in the table from our paper below:

Notably, we see that while commercial DNS blocking could happen within an hour of sharing, domain suspension took 8-11 days on average, and in many cases never happened at all within our 90-day observation window.

Miscellaneous

Ethical Considerations and Response

Actively measuring security infrastructure naturally raises ethical concerns. Since our binaries are submitted to real vendor pipelines, it consumes real compute resources and generates real IoCs that could affect vendors downstream. We adopted certain safeguards to minimize harm:

Satellite dropping was limited to a depth of 5. This avoids accidentally DoSing vendors in a cyclic sharing relationship.
We implement opt-out. A disclaimer and opt-out link are in the binary itself and printed upon execution.
We implement privacy-preserving fingerprinting. All sandbox features are hashed before transmission, and features that could expose sensitive information about a vendor’s environment were deliberately excluded.

We reached out to all 30 vendors we studied to inform them of our studies and coordinate public disclosure. Despite this, only 17 vendors responded, many appreciative of the detailed analysis. Interestingly, some did not take the study as a vulnerability at all, instead opting to label their behavior as a design decision.

Being Found in Another Paper lol

In a weird coincidence, we see that our experiments were detected by Palo Alto Networks in their 2025 IEEE paper “Resolution Without Dissent: In-Path Per-Query Sanitization to Defeat Surreptitious Communication Over DNS”:

There are 104 cases where we cannot decide the purpose of queries at a confident determination. The FQDNs of these cases look like tunneling and there is no useful information on the Internet. For example, the domain 9mn[.]lat has a number of tunneling-like FQDNs, e.g., dpdf3d[…redacted…]skpx.9mn[.]lat and dc81[…redacted…]cof.9mn[.]lat. The only useful information we found is that 9mn[.]lat was registered on 2023-12-23 and will expire on 2024-12-23. Actually, 70 of the 104 domains are registered within one year and the expiration dates are also within one year. We posit that blocking these low-profile new domains should have trivial business impact on enterprise networks.

It seemed like they were studying specifically DNS rather than actively looking up the IoCs, so they did not come across the disclaimer. This is nice validation though that our binary behaved realistically enough to trigger production security systems.

Acknowledgements

This project has been in the works since mid-2023, and I’m very grateful to many people who made it possible. First, I’d like to thank my advisor Fabian Monrose, whose intuition for the right research questions greatly guided the direction of this work. I’d also like to thank Tillson Galloway and Omar Alrawi, whose expertise in TI research and network security was invaluable throughout this project.

Squid Agent - A Multi-Agent CTF Auto Solver

2025-11-18T00:00:00+00:00

Squid Agent - A Multi-Agent CTF Auto Solver

This blog post is a repost from the blog post I co-authored that is posted at the Squid Proxy Lovers Blog.

Abstract

Capture the Flag (CTF) competitions present complex challenges that require a diverse set of techniques to solve cybersecurity challenges. In recent years, agentic systems and LLMs have risen in popularity when solving CTFs, and we have observed powerful agentic systems deployed by other CTF teams in the past to reduce the solve time of challenges significantly. Following in these footsteps, our team has created an advanced agentic framework to automatically solve CTF challenges, which we call Squid Agent. Using the CTFTiny dataset by NYU, we benchmarked Squid Agent and solved 92%, or 46/50 of the challenges in the dataset. In this blog, we describe the construction of the multi-agent framework as well as lessons learned during the process of developing Squid Agent.

The Birth of Squid Agent

The initial development of Squid Agent was motivated by observations from DEFCON CTF Finals. Here, both Perfect Blue and Maple Mallard Magistrates (Theori AIxCC) had powerful agentic systems that were able to help solve challenges and find bugs. For instance, Perfect Blue’s system, designed specifically for CTFing in mind, was able to solve speedpwn challenges in significantly less time (7-10 minutes) than even the best human pwn player (30-40 minutes). This was a wake-up call for us; we are firmly in the age of agents, and we have to adapt or be left behind.

When we began developing the system, we discovered the CSAW Agentic Automated CTF, which provided a means of quantitatively benchmarking our code base and its ability to solve challenges in comparison to other research groups. CSAW gave us a nice baseline to start off with, providing us with a multi-agent system based on their research paper “D-CIPHER Multi-Agent System”. While it served as a good starting point, a fundamental limitation of their framework is that a uniform strategy was applied across all challenge categories, which fails to account for the different approaches needed across different CTF domains. Intuitively, one should approach different challenges from different categories in vastly different manners.

Here, we summarize D-CIPHER’s framework:

1. Challenge Loading
   ↓
2. Environment Setup (Docker container)
   ↓
3. AutoPrompt Agent (optional)
   ├─ Generates custom initial prompt
   └─ Passes to Planner
   ↓
4. Planner Agent
   ├─ Receives challenge + (optional) custom prompt
   ├─ Creates high-level plan
   ├─ Delegates tasks to Executor
   └─ Receives summaries from Executor
   ↓
5. Executor Agent(s)
   ├─ Receives delegated task
   ├─ Executes actions (commands, reverse engineering, etc.)
   ├─ Returns summary to Planner
   └─ Can be instantiated multiple times
   ↓
6. Loop: Planner → Executor → Planner
   ↓
7. Termination Conditions:
   - Challenge solved (flag found)
   - Give up
   - Max cost exceeded
   - Max rounds exceeded
   ↓
8. Logging & Teardown

Naturally, a defining characteristic of good CTF teams is specialization through having category experts. This allows for an individual to master specific workflows that repeatedly pop up in CTF challenges and build strong intuition while accumulating specialized domain knowledge. Thus, our approach to Squid Agent reflects this principle. We create a set of complex multi-agentic systems, each specializing in one challenge category. Despite requiring a lot more time in integrating tool calls, we observe that this “specialization” framework improves significantly on the approach taken by D-CIPHER.

Design and Methodology

The core technology behind Squid Agent is built on Docker containerization combined with a custom agentic framework. The system’s effectiveness is derived from the number of integrated tool calls and an agentic architecture tailored to specialized agents. We initially implemented Squid Agent with the smolagents library, but after extensive testing and encountering a myriad of bugs, we migrated to a proprietary barebones framework developed for the US Cyber Team. That being said, our team is currently working on creating our own framework that is feature-intensive with a novel twist to the current agentic development model.

Upon challenge ingestion, an orchestration agent triages challenges and selects a specific agent system to use based on the category. Each subagent can instantiate child agents, maintain RAG systems for category-specific knowledge, and access tool calls specific to the challenge category. This system allows us to create very powerful agentic workflows that can deal with a large number of challenges and exploits but are specific enough that our agent system doesn’t break down at complexity of trying to solve “every challenge”.

Currently, our agentic systems are broken down to the following categories:

Reverse Engineering
Binary Exploitation
Web Exploitation
Cryptography
Forensics
Miscellaneous

JSON API

For each challenge, Squid Agent ingests a JSON file to instantiate a challenge environment. A sample JSON is described below:

{
  "challenge-id": {
    "path": "relative/path/to/challenge/directory",
    "category": "rev|pwn|crypto|web|forensics|misc",
    "year": "20xx", // optional
    "challenge":"badnamehere", // optional
    "event": "CSAWXXX" // optional
  }
}

Furthermore, we have the JSON file describing the challenge itself, for which we use the JSON format required by the CSAW Agentic Automated CTF:

{
	"name" : "badnamehere",
	"category": "rev|pwn|crypto|web|forensics|misc",
	"description": "Challenge Description",
	"files": "expected flag",
	"box": "service host",
	"port/iternal_prt": "service port" // optional
}

For Squid Agent to run on a challenge, we create a JSON object following the format above to input challenge information.

Agent Code

        # Rev manager agent
        system_prompt = self.get_prompt("rev_manager_prompt")
        self.agents['manager_agent'] = ToolCallingAgent(
            name="rev_manager_agent",
            description="",
            tools=[
            toolcalls1,
            toolcalls2
            ],
            model=LiteLLMModel(model_id="", api_key=self.api_key),
            managed_agents=[self.agents['binary_analysis_agent'], self.agents['script_dev_agent']],
            max_steps=30,
            planning_interval=None,
            final_answer_checks=[_reject_instructional_final],
        )
        # Here, we side load in the prompts and use our own definitions instead of using the default smolagents prompt.
        self.agents['manager_agent'].prompt_templates["system_prompt"] = system_prompt 

The code shown above represents how the agents are structured in Python. Each agent initializes a class with a variety of parameters that set up the environment for proper functionality. At the top of the code, we use a helper function called get_prompt to retrieve a file from its folder and return its buffer. After that, we initialize the main agent object, defined as ToolCallingAgent.

This initialization performs some basic setup, but the most important part is the tool call configuration. If you are too liberal with your tool calls, it can lead to issues. Maintaining a specific and well-curated tool call list has resulted in substantial success for our team, especially when combined with prompts that reference tool calls during specific stages of an agent’s execution.

Next, we define the model being used. Generally speaking, we use gpt-mini for sub-agents that handle trivial or simple tasks, while manager and complex verification agents are assigned gpt-5. You also need to define which sub-agents the system has access to; how they are used and executed is up to you, but their definition occurs here.

The max_steps variable is quite important in our codebase, as it defines the maximum number of steps an agent can take before termination. The next major aspect is final_answer_checks, where we can define specific validation loops for the code to use, ensuring that the agent verifies its outputs instead of returning them blindly.

Docker Environment

After Squid Agent is run, it creates a segregated docker network with DNS routing to allow for remote testing of challenges if you provide a docker container if the challenge has a remote submission field. If EMC mode is enabled, a create custom docker network is created per challenge.

IDA Pro Integration

One notable architectural constraint involves IDA Pro integration, as used for reversing/pwn challenges; as shown above, there is only a single IDA instance. Unfortunately, IDA Pro requires EULA acceptance before operation, which would require manual intervention. We attempted to spoof the EULA by pre-loading .ida config files in each Docker container, but this did not work after extensive debugging. Hence, our solution was to create a persistent IDA Docker environment, which segregates challenge files which may require decompilation which allows for concurrent access to IDA Pro between agents.

To run headless IDA, we use idat.exe , the text interface of IDA, coupled with custom IDA tooling scripts that accept arguments from the agentic system. During challenge initialization, the IDA container is never shut down. To ensure scalability, we tested this architecture with 200+ challenges in parallel without issue.

Subagent Design

Reverse Engineering

The Reverse Engineering (rev) agent uses a combination of static and dynamic analysis to solve challenges. We give it access to a debugging agent, a set of tool calls to pwndbg , which allows it to debug the binary live. If necessary, the agent can also clean the code via a combination of data flow techniques and AI cleaning. Furthermore, we allow the agent access to IDA Pro to decompile the binary, so that the agent can read C code instead of raw assembly.

After binary analysis is completed and the manager agent is satisfied, it uses a script development agent as necessary to solve the challenge.

Binary Exploitation

The Binary Exploitation (pwn) agent is built on the principle that successful exploitation requires a tight feedback loop between vulnerability identification, exploit development, and runtime validation. Unlike pure reverse engineering where understanding is the goal, pwn challenges demand working exploits that capture flags from live services. The system architecture reflects this by positioning the debugger under script_dev_agent rather than binary_analysis_agent, creating an iterative exploit refinement workflow. A dedicated code_review_agent validates exploit primitives before expensive remote testing, catching common mistakes like incorrect offsets or endianness issues. The system is designed for diverse pwn vectors including traditional binary exploitation and Python sandbox escapes through pyjail techniques.

Rather than comprehensive static analysis, the focus is on identifying exploitable vulnerabilities and writing minimal working exploits under 100 lines. The validation tools run_exploit_and_capture_flag and test_exploit_locallyare central to the workflow, enabling a test-diagnose-fix cycle with real feedback from target services. The IDA agent still provides decompilation to avoid raw assembly parsing, but the analysis is targeted at exploitation vectors rather than complete code understanding. The system handles practical CTF scenarios including Docker-based challenges and archive extraction, addressing varied challenge formats. The hierarchical agent structure with specialized roles (vulnerability analysis, exploit development, code review, debugging) creates a division of labor optimized for the exploit development lifecycle rather than general code comprehension.

Cryptography

The Cryptography (crypto) agent is designed for mathematical precision and adaptive problem-solving in CTF crypto challenges. Unlike binary exploitation or reverse engineering, crypto tasks often require symbolic computation and provable hypothesis testing before exploitation.

The architecture uses a dual-path model: complex challenges go through vulnerability analysis and validation via the criticize_agent, while simple encoding problems route directly to the guessy_agent for brute-force decoding. The criticize_agent also reformulates failed attacks (e.g., proposing custom lattice setups when LLL preconditions fail).

All scripts follow a strict four-stage cycle—write, run, interpret, review—to eliminate untested submissions. Agents are explicitly guided to adapt classic attacks to CTF cases with altered assumptions. OCR support enables handling of image-based or steganographic problems. The vulnerability agent models multi-stage attack chains and filters decoys, which tend to be more common compared to other categories. Sage is used for symbolic math, arbitrary precision arithmetic, and crypto primitives. In general, the system is designed to prioritize mathematical accuracy and attack creativity over execution speed, using GPT-5 for reasoning and GPT-5-mini for orchestration and validation.

Web Exploitation

The Web Exploitation (web) agent is based on a multi-stage vulnerability analysis approach that starts broad and narrows down to exploitable issues. The system first uses a CWE analysis agent to identify potential vulnerabilities based on the technologies and frameworks used in the application, creating a broad checklist without deep verification.

This list is then passed to a “vulnerability researcher” agent that performs detailed code analysis to determine which vulnerabilities are actually present and exploitable. When the agent finds potential issues, it delegates to a triage agent that confirms exploitability through actual testing and validation. Once vulnerabilities are confirmed, an exploit development agent creates theoretical exploit chains, and finally a script development agent implements working exploits with validation tools that provide critical feedback loops. The system emphasizes validation at multiple stages, using tools like run_exploit_and_capture_flag and test_exploit_locally to ensure exploits actually work before submission. Additionally, the script development agent has access to webhook tools for testing interactive web challenges that require callback mechanisms.

Forensics and Miscellaneous

Both agents for the forensics and miscellaneous (misc) category operate within a relatively simple system, each featuring a manager node with two layers. Because of the nature of these challenges, we design the agent to have a flat, straightforward structure with access to a wide range of useful tools needed across forensics, steganography, and guesswork. These two categories probably have the most growth potential in terms of optimization, but the random and open-ended nature of both makes that somewhat difficult.

One idea we wanted to experiment with was giving the misc agent access to the other agentic systems. Naturally, this significantly increases the cost of its setup, but it could significantly improve its solving capabilities.

Results

Using the CTFTiny benchmarking dataset, we were able to solve 92%, or 46/50 of the challenges. Of those, we were able to solve all of the given web and misc, as well as almost all of the pwn, rev, and crypto.

Squid Agent struggles mostly with challenges that one would consider to be “guessy”. For instance, the challenge rev/rox required arbitrarily brute forcing random hard coded data values to XOR a chunk of data—Squid Agent was only able to solve the challenge after the 3rd overhaul on the reversing agent with an improved RAG.

Four challenges were unable to be solved: one crypto, forensics, pwn, and reversing challenge. In general, these challenges are largely novel, for better or for worse. For instance, one interesting reversing challenge rev/maze required the solver to find a path through a self-modifying binary, where the maze is addresses from the binary itself. In comparison, one cryptography challenge required one to realize that the given code was actually secure, and the vulnerability lies in a brute-forcable key on the server—but there was no reason to believe this at all.

Our final submission to the CSAW Agentic Automated CTF challenge on October 15th solved 44/50 challenges in the dataset, in comparison to NYU Craken’s 35/50. After the submission, we continued to work on Squid Agent in preparation for the on-site finals but we pivoted our focus from the Agentic Automated CTF challenge to general CTFs which we hope can be used to aid in solving “easy” challenges quicker—for instance, we reworked the web agent to include a better white-box framework. Regardless, with the changes that was made after final submission, we were able to solve two more reversing challenges that were previously unsolved.

Here is a full breakdown of solves between Squid Agent (logs) vs Craken (logs) :

Category	Challenge	Squid Agent	Kraken
Crypto	Beyond-Quantum	Pass	Fail
Crypto	Collision-Course	Pass	Pass
Crypto	DescribeMe	Pass	Pass
Crypto	ECXOR	Pass	Fail
Crypto	Lupin	Pass	Pass
Crypto	open-ELLIPTI-PH!	Fail	Fail
Crypto	perfect_secrecy	Pass	Fail
Crypto	polly-crack-this	Pass	Pass
Crypto	super_curve	Pass	Pass
Crypto	The Lengths we Extend Ourselves	Pass	Pass
Crypto	hybrid2	Pass	Pass
Crypto	babycrypto	Pass	Pass
Forensics	1black0white	Pass	Pass
Forensics	whyOS	Fail	Pass
Misc	algebra	Pass	Fail
Misc	android-dropper	Pass	Pass
Misc	ezMaze	Pass	Pass
Misc	quantum-leap	Pass	Pass
Misc	showdown	Pass	Pass
Misc	Weak-Password	Pass	Pass
Pwn	baby_boi	Pass	Fail
Pwn	bigboy	Pass	Pass
Pwn	get_it?	Pass	Pass
Pwn	got_milk	Fail	Fail
Pwn	Password-Checker	Pass	Pass
Pwn	pilot	Pass	Pass
Pwn	puffin	Pass	Pass
Pwn	roppity	Pass	Fail
Pwn	slithery	Pass	Fail
Pwn	target practice	Pass	Pass
Pwn	unlimited_subway	Pass	Fail
Rev	A-Walk-Through-x86-Part-2	Pass	Fail
Rev	baby_mult	Pass	Pass
Rev	beleaf	Pass	Pass
Rev	checker	Pass	Pass
Rev	dockREleakage	Pass	Pass
Rev	ezbreezy	Pass	Pass
Rev	gibberish_check	Pass	Fail
Rev	maze	Fail	Fail
Rev	rap	Pass	Pass
Rev	rebug 2	Pass	Pass
Rev	rox	Pass	Fail
Rev	sourcery	Pass	Pass
Rev	tablez	Pass	Pass
Rev	the_big_bang	Pass	Fail
Rev	unVirtualization	Pass	Pass
Rev	whataxor	Pass	Pass
Web	poem-collection	Pass	Pass
Web	ShreeRamQuest	Pass	Pass
Web	smug-dino	Pass	Pass

While pulling an all-nighter the day before CSAW CTF Finals, we decided to fully revamp the rev agent by rewriting the prompts and revising the RAG database. Dudcom, Zia, Uvuvue, and Toasty got Squid Agent to solve almost all of the reversing challenges in the CTFTiny dataset with no challenge resets and got a pretty cool screenshot out of it, which showcases the dashboard for Squid Agent’s multi-challenge mode:

The Future of Squid Agent and CTFs?

We intend to benchmark Squid Agent against the complete 200-challenge dataset NYU_CTF_Bench to showcase our framework in reference to other systems. However, we note that CSAW challenges tend to have a difficulty distribution that require guess-based approaches rather than more traditional and principled problem solving challenges found in other CTFs. Hence, we have found it hard to optimize for this benchmark without lowering the overall performance of the system for traditionally complex challenges.

Since CSAW, we have used Squid Agent in live CTFs with reasonable success. At m0leCon CTF 2026 Qualifiers, a CTF with a 100 point CTFTime weight, Squid Agent solved crypto/Talor 2.1 , a 10-solve crypto challenge, as well as a VM reversing challenge. While we believe it has the potential to become a powerful system that can be run in parallel with our human players, it is a fact that a fully autonomous system will inevitable suffer from several issues.

For one, the larger and more complex a challenge is, the harder it is for Squid Agent to even begin starting it—simply finding the entry point of code or where to begin reversing in a large library becomes rather challenging for an agentic system to do. Storing long-term context is also a challenge, as we are limited by a context window that is often insufficient for larger CTF challenges. We also intend to implement a RAG system, which may be useful a more universal tool for Squid Agent.

Another limitation of Squid Agent in its current form is its limited information in more domain-specific techniques. For instance, something as simple as creating a FAISS vector RAG of how2heap and a RAG of solve templates could easily improve performance in heap challenges by quite a bit. At the end of the day, many CTF challenges tend to become a competition of knowing previous bugs/issues, and being able to use that to your advantage.

What we believe is truly interesting is trying to push the system past this limit by solving novel challenges, such as those which require finding zero-days. This would require a system to be able to identify that there are no configuration or usage bugs, and have the ability to crawl through public code bases to find vulnerabilities. We believe this may be possible, but this begins to bleed into the realm of autonomous vulnerability research systems, and will suffer from a myriad of challenges similar to those discussed by Theori’s AIxCC team when creating an AI-based Cyber Reasoning System. For now, while we believe that more difficult CTFs will be spared, beginner and medium difficulty CTFs will inevitably suffer from “AI one-shot” challenges.

Future Work

We plan to publish a comprehensive white paper upon completion of Squid Agent’s benchmarking with the complete 200-challenge dataset NYU_CTF_Bench, as well as a custom dataset that is more in line with the CTF standard seen in modern competitive CTFs. The paper will provide more detailed technical documentation of our agent architectures, tool calls, and insights from our development journey.

Credit/Team:

Dev team: dudcom, braydenpikachu, uvuvue, ziarashid, toasty3302, moron, appalachian.mounta1n Topic experts: ac01010, corg0, clovismint, quasar0147, vy

osu! CTF 2025

2025-10-30T00:00:00+00:00

Last weekend, my CTF team participated in osu!gaming CTF hosted by Project Sekai et al. and won first place. Given that I’ve played many rhythm games (and particularly osu!) in my undergrad, this CTF was particularly interesting to me. The challenges in this CTF were generally normal with the exception of many being rhythm game themed, and the addition of rhythm game challenge categories. We ended up solving all challenges except the one with the fewest solves, misc/barcade, written by my friend BrownieInMotion.

As a side note, it turns out that there is a fairly large overlap between the rhythm game community and the cybersecurity community, which is not particularly surprising but an interesting observation nonetheless.

For this blog, I’ll be discussing a postmortem writeup of misc/barcade. Although we were very close, we did not solve this challenge before the end of the CTF.

misc/barcade (2 solves),

Look at this ITG cabinet—it’s even running the latest itgmania version, 1.1.0. Custom songs are enabled too! Oh, but this barcade charges $2 for just one stage… my favorite chart doesn’t even appear in song selection because it’s too long. Can you put the machine into Event Mode so I can play it? https://instancer.sekai.team/challenges/barcade

First, we’re told that we’re running ITGmania 1.1.0. This is an open source fork of StepMania 5.1 with networking and quality of life improvements, mostly intended for arcade operators and hobbyists who want to mod the game.

We’re also told that custom songs are enabled. Importantly, this allows us to upload custom songs into the machine through a virtual USB drive.

Finally, we are (presumably) given that the flag is in his “favorite chart”, which you cannot select. The goal is then explicitly given; we wish to enter “Event Mode”, which is the equivalent of a free play in an arcade cabinet. Event Mode is typically enabled for special events (conventions, tournaments, etc.) hence its name.

For the challenge itself, we are met with an instancer (with a very annoying CAPTCHA) that gives us 15 minutes on the machine:

We have a standard set of controls, with the same buttons one would see on a physical cabinet. To play the game, DFJK keys are used, similar to 4-key mania. Also available is a virtual USB drive, already with some files, which we can upload to an arbitrary file location within the USB drive. This is the function we can use to upload custom songs.

Furthermore, in the welcome screen, we can see the current high scores scroll by. While there appears to be four songs, only the first three is selectable, which reflects the challenge description; we assume the fourth hides the flag:

There are many different ways to trigger Event Mode. What should have been the easiest is to simply change the EventMode setting. Hence, one of the first things we tried was to try to read the files already in the USB drive; these init files are described nicely at this blog by mmatt.net. Knowing this, we can change preferences arbitrarily by overwriting the desired file. For instance, we can change the default player name by overwriting ITGmania/Editable.ini with:

[Editable]
DisplayName=newname

This change is reflected upon finishing a song, with the score being attached to the user. Knowing this, we can attempt to overwrite the preference file to enable event mode. Looking at source code, we see:

class PrefsManager
{
public:
...
	Preference<bool>	m_bEventMode;
}

with the file being loaded from Preferences.ini. Hence, we should be able to overwrite ITGmania/Preferences.ini with

[Options]
EventMode=1

but this does not work. It turns out that this file is only read on startup, as shown in Stepmania.cpp:

int sm_main(int argc, char* argv[]) {
  ...
  PREFSMAN	= new PrefsManager;
  ...
  PREFSMAN->ReadPrefsFromDisk();
  ...
}

This is bad; while we can clearly change the preference, ReadPrefsFromDisk() is only run at startup, and we cannot restart the machine in the instancer without a complete reset.

Next, we realized that you could also enable Event Mode through the debug menu. There’s a couple ways to do this; you can also change it in Preferences.ini (suffering the same problem as before), or you can click F3 to enable debug mode. This isn’t possible, as we don’t have such a button on the machine. We briefly considered remapping keybinds, but this also requires entering the debug menu, so this is not possible.

Our final idea was to enable Event Mode by triggering a Lua function that would do this, by writing a custom Lua file that would be triggered to update the preference. This is a natural choice, as we already have a place to upload any file we wish, and we just have to figure out a way to trigger the file. The file would just be one line that looked like this:

PREFSMAN:SetPreference("EventMode", true)

To help with this, Lloyd vibe coded an uploader which uses the /api/upload endpoint in the instancer to upload a directory of files:

"""Upload local files to a remote endpoint with base64-encoded payloads."""

from __future__ import annotations

import argparse
import base64
import json
import sys
from pathlib import Path
from urllib import error, request


def build_endpoint(base_url: str) -> str:
    """Return the upload endpoint derived from the provided base URL."""

    return base_url.rstrip("/") + "/api/upload"


def iter_files(root: Path):
    """Yield all files under the root directory, traversing recursively."""

    for path in root.rglob("*"):
        if path.is_file():
            yield path


def encode_file(path: Path) -> str:
    """Return the base64 representation of the file contents."""

    data = path.read_bytes()
    return base64.b64encode(data).decode("ascii")


def post_json(endpoint: str, payload: dict[str, str]) -> bytes:
    """Send JSON payload to the upload endpoint and return the raw response."""

    body = json.dumps(payload).encode("utf-8")
    req = request.Request(
        endpoint, data=body, headers={"Content-Type": "application/json"}
    )
    with request.urlopen(req) as resp:
        return resp.read()


def upload_directory(directory: Path, base_url: str) -> None:
    """Iterate through files and upload each one individually."""

    endpoint = build_endpoint(base_url)
    for file_path in iter_files(directory):
        relative_path = file_path.relative_to(directory).as_posix()
        payload = {
            "filename": relative_path,
            "content": encode_file(file_path),
        }

        try:
            post_json(endpoint, payload)
        except error.HTTPError as exc:
            print(f"Failed to upload {relative_path}: HTTP {exc.code}", file=sys.stderr)
            continue
        except error.URLError as exc:
            print(f"Failed to upload {relative_path}: {exc.reason}", file=sys.stderr)
            continue

        print(f"Uploaded {relative_path}")


def parse_args(argv: list[str]) -> argparse.Namespace:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "directory", type=Path, help="Path to the directory containing files to upload"
    )
    parser.add_argument("base_url", help="Base URL of the remote service")
    return parser.parse_args(argv)


def main(argv: list[str] | None = None) -> int:
    args = parse_args(argv or sys.argv[1:])
    directory = args.directory.expanduser().resolve()

    if not directory.is_dir():
        print(f"Error: {directory} is not a directory", file=sys.stderr)
        return 1

    upload_directory(directory, args.base_url)
    return 0


if __name__ == "__main__":
    sys.exit(main())

We can thus upload a whole directory of files by running python3 uploader.py ./upload https://barcade-......-instancer.sekai.team/.

To inject a custom Lua file, we first tried to overwrite files in the ITGmania codebase. Background animations and other scripts are implemented as Lua scripts and are run very often when playing the game, so the idea is to overwrite one of these scripts so that the Event Mode preference is updated when it is run. It turns out that this doesn’t work, as the codebase itself is separate from the filesystem of the virtual USB drive we are uploading to.

Thus, we are limited to writing custom Lua files within the USB drive; the path must also be a custom string that is referenced somewhere in a file we upload.

Reading the codebase, I came across just one place where we can reference arbitrary Lua code. In a song’s chart file (.sm), Lua files are referenced to change the foreground/background color of the game when it is played over time. Documentation is described here:

The BGCHANGES line in a simfile is used to control what backgrounds are loaded by the simfile and when they appear.

The data is between the colon and the semicolon.
Each entry is separated from the next by a comma.
Each entry is composed of 1 to 11 values separated by equals.
The meanings of the values are as follows:
1. start beat 
2. file or folder name // <- important
...

Thus, we should be able to upload a custom Lua file and refer to it in a custom song, so that it is run when it is loaded. To upload a custom song, we have to upload it under ITGmania/Songs/name; to test this, we steal a song and chart from a database of StepMania songs online:

Our custom song is Torn (under the PlayerOne custom songs folder), and there are three songs loaded by default under ITG. We can verify by playing the song, which loads successfully. The following files are uploaded:

ITGmania/Songs/Torn/Torn-bg.png
ITGmania/Songs/Torn/Torn.ogg
ITGmania/Songs/Torn/Torn.sm
ITGmania/Songs/Torn/torn-bn.png

To execute arbitrary Lua code, we wish to change the .sm chart file. We see the following:

#TITLE:Torn;
#SUBTITLE:;
#ARTIST:Natalie Browne;
...
#SELECTABLE:YES;
#BPMS:0.000=128.019;
#STOPS:;
#BGCHANGES:;

During the competition we tried different ways to update BGCHANGES, but we could not get any code to run, even though it worked locally. A couple minutes before the end of the CTF, Brayden ended up realizing that the songs were cached, so even if we modify the new song, the Lua file would not be found, but we did not have time to figure out the mount point:

After the competition, we learned that the mount point is described in MemoryCardManager.cpp (duh):

static const RString MEM_CARD_MOUNT_POINT_INTERNAL[NUM_PLAYERS] =
{
	// @ is important; see RageFileManager LoadedDriver::GetPath
	"/@mc1int/",
	"/@mc2int/",
};

Hence, we just had to traverse to this mount point to access the virtual USB drive’s custom Lua file. Doing this is actually very easy; we have to modify the song chart file above, following the format as described in bgchanges_format.txt. We use one entry, with the second value in the entry pointing to the Lua script. Other values are delimited by a =.

#TITLE:Torn;
#SUBTITLE:;
#ARTIST:Natalie Browne;
...
#SELECTABLE:YES;
#BPMS:0.000=128.019;
#STOPS:;
#BGCHANGES:0.000=../../@mc1int/ITGmania/Songs/Torn/script.lua=1.000=0=0=1=====;

and include script.lua in the song upload:

PREFSMAN:SetPreference("EventMode", true)

After uploading the song, we can trigger the script by simply playing the song Torn. When the song completes, we are not met with a game over screen, and instead see the song select screen:

Finally, we can play the song:

Thus the flag is osu{6d86d59bb9d27121}, and we are done.

CyberBay CTF Finals 2025

2025-10-17T00:00:00+00:00

This week I attended CyberBay 2025 in Tampa, Florida for the CTF finals! We qualified two teams through the various qualifier CTFs in September and ended up sending two more members through a merger with the University of Toronto’s CTF team Go Go Squid. My team won the grand prize of $20,000 and Go Go Squid placed 4th, winning $2500!

Tuesday, October 14th

After most of the team arrived on Monday afternoon, we spent most of Tuesday attending the first day of the conference. I spent a lot of time in the hacker village at the vendor pavillion, which featured some “practice” CTF challenges that they claimed would be similar to the CTF challenges during the main event. Most of these were pretty interesting; one company had physical exploitation challenges which required lockpicking and cracking a safe. MetaCTF also brought with them a challenge that demonstrated a blue boxing attack on an old telephone switch.

It was also announced late on Tuesday the format and rules of the King of the Hill (KotH) component of the CTF, which would be worth 20% of the score (with the other 80% coming from the jeopardy CTF). Andrew and Brayden, two members of our team, ended up staying up quite late to prepare automation for the next day. None of us particularly enjoyed the format of the KotH, which felt flawed. For instance, each VM would reset after every 20 minutes, which not only discourages defence, but also encourages timing attacks for right after resets.

That night, we spent a while discussing strategy and decided that Nathan and I would stay downstairs to do the jeopardy portion while Andrew and Brayden would do the KotH portion.

The conference also held a cool drone show on a river cruise that evening.

Wednesday, October 15th

The CTF started at 9:30am and ended at 2:30pm, which made the competition very short. We ended up mostly abandoning the KotH at the end and rushing the CTF. We had assumed the jeopardy portion would be easier than it was (implying the KotH would be the tiebreaker) but realized late that we needed a bigger lead on the jeopardy, since there were a handful of unsolved challenges left at the end.

Here’s a photo from the awards ceremony which includes both of our teams and the UofT team:

After the awards ceremony we went to Ybor City and observed chickens crossing the road:

We also made a reservation at the Michelin starred restaurant Rocca to celebrate the win:

The Tampa riverwalk is also quite nice at night:

Other remarks

As a side note, none of us really know how to split the prize money so that taxes are minimized. We’re trying to set up a 501(c)(3) under the team name which would have allowed us to receive the prize money through the nonprofit, but the government is currently shutdown, so we can’t do that.

The next on-site we’ll be attending is CSAW CTF Finals, hosted by NYU’s OSIRIS Lab. We qualified a CTF team and also qualified for the Agentic Automated CTF, which involves writing an LLM agent to ingest and solve CTF challenges.

I learned how to solve the Rubik’s Cube blind

2025-10-08T00:00:00+00:00

I learned how to solve the Rubik’s cube blindfolded this week! The method I used is the Old Pochmann technique, which is the simplest to learn.

With this method and all 3BLD techniques, all of the work is done beforehand during the memorization phase. The idea is relatively straightforward; instead of memorizing permutations of edges and corners, one memorizes the set of cycles that make up the permutation. Each edge and corner sticker is assigned a letter A-X, so the memorization consists of only ~20 letters.

During execution, you perform swaps between edges/corners to perform the cycles. Two algorithms are needed (a T-perm and a Y-perm) for swapping edges and corners respectively.

This took me about 6 hours to learn over the course of 2 days to get my first success.

In other news

Last week I went to San Diego to visit my brother; here’s some pictures from the trip.

Penasquitos Creek Waterfall

La Jolla Cove

On Trivalue Oddagons

2025-09-29T00:00:00+00:00

One of my friend groups has been getting into Sudoku recently. Sudoku sort of reminds me of Minesweeper in the sense that a lot of the work comes from discovering logical implication chains to prove the existence/nonexistence of a number. I’ve been going through some strategies on this Sudoku campaign which is nice for formalizing the strategies that one typically derives when playing Sudoku (i.e. “Hidden Singles”, “Naked Pairs”, etc.).

Yesterday I was made aware of this one pattern called the trivalue oddagon (also known as Thor’s Hammer, shortened as tridagon) which supposedly is this sort of gimmick pattern that artifically inflates the computer-analyzed difficulty of puzzles. I would imagine it has to do with how this pattern cannot be directly solved on a computer with anything short of bruteforce searches to analyze this pattern. Interestingly, because of its distinctiveness, it’s not too difficult to spot by a human, though my understanding is that it doesn’t come up in puzzles very often at all. Here’s how it looks, courtesy of my friend Karo:

The next step in this puzzle is to deduce that the cell in Row 8 Col 9 (candidates marked $\{1,3,5,7\}$) must be a $1$, because if it were not, this structure (i.e., assigning 3, 5, 7 in these four boxes) would not be possible.

This logic is not very clear to me, so I’ve attempted to prove it below. To do this, we’ll first describe a set of patterns on the Sudoku board that can be represented as a subgraph $G$. Then, we shall show that $G$ is not $4$-colorable.

Formalizing the Trivalue Oddagon Board State

Consider four boxes, WLOG contiguous. Name these boxes $A,B,C,D$ in row-major order. We’ll also assume that the three numbers remaining are $\{1,2,3\}$. Clearly, each box contains a permutation of $\{1,2,3\}$.

Here, the number represents the row (from bottom to top, i.e. with the intuition that the numbers are “ascending”) within its box. We can read each box left-to-right. Hence, the permutations above, described in row-major order, would be $A=(1,2,3), B=(2,3,1), C=(3,1,2), D=(1,3,2)$.

Furthermore note the parity of the permutation, $\sigma$, which is the parity of the number of inversions of each element. There are $0,2,2,1$ inversions respectively, so the parities are $0,0,0,1$.

Note that the parity of each element corresponds to exactly whether or not the box is increasing left-to-right (i.e. wraps up and to the right). If it is even, it is ascending; if it is odd, the element is descending.

Hence, the claim is this: if exactly three of $\{A,B,C,D\}$ are ascending or exactly three are descending, there exists no possible numbering/coloring of this structure.

First, we wish to show a lemma; every odd permutation in $S_3$ has exactly one fixed point. To see this, the odd permutations in $S_3$ are precisely the three transpositions: $(1,2)$, $(1,3)$, and $(2,3)$. Each transposition swaps two elements while leaving the third fixed. Thus, every odd permutation in $S_3$ has exactly one fixed point.

Now, consider the parity of $A\circ B\circ C\circ D$. If it is odd, there is exactly one fixed point.

If exactly three of $\{A,B,C,D\}$ are ascending or exactly three are descending, we see that $A\circ B\circ C\circ D$ is also odd. Thus, there exists exactly one fixed point.

Here, observe that the permutations of three even and one odd element produces exactly one fixed point, highlighted as the red rectangle. Furthermore, note that the other two points are not fixed. Hence, they cannot form an square, so it must be an $8$-cycle, highlighted in the purple dotted lines.

This turns out to be exactly what we need to represent it as a graph.

Graph Representation

The Sudoku is a famous instance of coloring problem in graph theory. Vertices are the cells of the Sudoku, and edges occur between two vertices if and only if the two cells cannot be the same number. Hence, we wish to represent the trivalue oddagon as a graph.

In each box, the three cells may never be the same number, so they form a complete graph with $3$ vertices (i.e. a $K_3=C_3$).

Consider this graph as an example. This graph is messy, but we can represent it more nicely be describing it as the isomorphism of a more well-defined graph. Specifically, consider how it is exactly one edge swap away from $C_3\square C_4$, i.e. the cartesian product. To see the edge swap, swap the edges as shown below, deleting the dotted lines highlighted in red and adding the lines in black. The vertices affected in the swap are always the two vertices that are not fixed.

This amounts to $C_3\square C_4$ exactly; it’s left as an exercise to the reader to double check the isomorphism if needed.

Let $G=C_3\square C_4$ and $G'$ be an edge-swapped $G$, which we will rigorously define later. While the chromatic number $\chi(G)=3$, it turns out that $\chi(G')=4$, which is the reason why the trivalue oddagon structure cannot be possible with three numbers.

To show this, we wish to label the vertices of $G'$. Let $(a,b)$ be the vertex in $G'$ such that it is the $a$th vertex in $C_3$ and the $b$th vertex in $C_4$. Hence, $a\in[0,2], b\in[0,3]$. Thus, by definition of a Cartesian product,

\[e(G)=\{((a_0,b_0),(a_1,b_1))\mid a_0=a_1 \text{ or }b_0=b_1\}\]

We also define $G'$ by swapping edges $((0,0),(3,1))$ and $((0,1),(3,0))$, such that the new edges are $((0,0),(3,0)),((0,1),(3,1))$. Then we wish to show $\chi(G')> 3$.

Lower-bounding $\chi(G')$

To do this, we introduce a graph theoretic intuition of “coloring parity” on each $C_3$. Each $C_3$ must be colored with three numbers (say, $\{1,2,3\}$).

Then for a 3-coloring $f: \{(a,0), (a,1), (a,2)\} \to \{1,2,3\}$ of the $a$-th copy of $C_3$ in $G'$, we say the coloring is ascending if $f(a,i) \equiv f(a,0) + i \pmod{3}$ for all $i \in \{0,1,2\}$. Otherwise, we say it is descending.

Note that any valid 3-coloring of $C_3$ must be either ascending or descending, which forces the coloring to be a cyclic permutation in one direction or the other.

Now, consider the $0$th and $1$st $C_3$. We claim they must have the same parity. Suppose for contradiction that the $0$th is ascending with $f(0,i) = c + i$ (mod 3) and the $1$st is descending.

In $G'$, the adjacencies between these copies are (unchanged from $G$):

$((0,0), (1,0))$, so $f(1,0) \neq c$
$((0,1), (1,1))$, so $f(1,1) \neq c + 1$
$((0,2), (1,2))$, so $f(1,2) \neq c + 2$

For the $1$st $C_3$ to be descending with $f(1,i) = f(1,0) - i$ (mod 3):

If $f(1,0) = c + 1$: then $f(1,1) = c$, contradicting $f(1,1) \neq c + 1$
If $f(1,0) = c + 2$: then $f(1,1) = c + 1$, contradicting $f(1,1) \neq c + 1$

Therefore, adjacent $C_3$ copies must have the same parity. By the same argument, all four $C_3$ copies in the unmodified portion of $G$ must have the same parity.

Now, consider the edges between the $0$th and the $3$rd $C_3$, which has been edge swapped. With the swapped edges $((0,0),(3,1))$ and $((0,1), (3,0))$, if both are ascending, we see $f(3,0) \neq f(0,1) = c+1$, so $f(3,0) \in \{c, c+2\}$.

Consider the case $f(3,0)=c$, $f(3,2)=c+2=f(0,2)$, a contradiction. Furthermore consider the case $f(3,0)=c+2$; then $f(3,1)=c+0=f(0,0)$, a contradiction. WLOG we can see that this coloring is not possible if both are descending. Hence, the $0$th and the $3$rd $C_3$ must not share the same parity.

This is a contradiction, as we have shown $0$th and the $3$rd $C_3$ have the same parity above. Thus, $\chi(G')\not\leq 3$. This shows that any trivalue oddagon cannot be filled with $3$ numbers, and we are done.

Applying the Trivalue Oddagon

Consider the example described earlier:

Here, consider the cell in Row 8 Col 9. Suppose for contradiction it isn’t $1$. Since we have shown this trivalue oddagon not to be three-colorable, we have a contradiction immediately, and we can conclude that cell must be a $1$.

CTFing updates, and a writeup

2025-09-22T00:00:00+00:00

Hello!

I’ve been meaning to start a blog for a while but never got around to it; this will probably be a place where I just write about personal updates and random things inconsistently.

Recently I’ve been getting back into CTFs with Squid Proxy Lovers. I used to play a lot in high school and occassionally with GreyHat, Georgia Tech’s CTF team, but it’s been a while since I was seriously active, and I wanted to find a more competitive team. One of my cointerns at Trail of Bits this summer invited me to the CTF team that he captains, so I figured it was a good opportunity to get back into it.

Last week we played in CSAW Qualifiers (2nd) and CyberBayCTF Qualifiers (2nd), both for on-site finals in October. This week we casually played CrewCTF. CrewCTF had a lot of high-quality challenges, so I figured I would do a writeup on a challenge I did here.

misc/Bytecode Bonanza

misc/Bytecode Bonanza was a two-part challenge revolving around writing Python bytecode in a limited instruction set. We’re given a set of three challenges which go through this filter function:

def create_function(parameters, prompt):
  bytecode = bytes.fromhex(input(prompt))
  
  if len(bytecode) > 512:
    print("Too long")
    exit()
  
  opcodes = [bytecode[i*2] + bytecode[i*2+1]*256 for i in range((len(bytecode)+1) // 2)]
  
  allowlist = [ 0x0001, 0x0004, 0x0006, 0x000f, 0x0017, 0x0190 ] + [0x0073 + i * 512 for i in range(128)]
  
  if any([op not in allowlist for op in opcodes]):
    print("Illegal opcode")
    exit()
  
  preamble = b"".join([bytes([0x7c, i]) for i in range(parameters)])
  
  code = preamble + bytecode + bytes([0x53, 0])
  
  dummy = dummies[parameters]
  
  dummy.__code__ = dummy.__code__.replace(co_code=code,co_stacksize=1000000000)
  
  return dummy

The goal is to input bytecode as hex, which is injected into a dummy function’s __code__ attribute. We can observe that the main restriction is a 512 length upper bound, and that the bytecodes we can use are in a very limited instruction set in allowlist. After doing some research we can observe the following behaviors:

0x01: POP_TOP, pops the top of stack (TOS)
0x04: DUP_TOP, duplicates the TOS
0x06: ROT_FOUR, rotates TOS down into 4th-to-top position and 2nd-4th members up one position
0x0f: UNARY_INVERT, bitwise inverts TOS as -(x+1)
0x17: BINARY_ADD, pops top 2 elements off stack and pushes their sum
0x190: EXTENDED_ARG, only used with POP_JUMP_IF_TRUE to add 256 to the address jumped to, since we can write 512 bytes of bytecode
0x73, 0x273, …, 0xfe73, POP_JUMP_IF_TRUE, limited to even offsets 0-255 since each opcode is 2 bytes long

We also note that Python bytecode is stack-based, and our arguments are fed in order dummy(a,b,c) -> [a,b,c]. For sake of notation we’ll assume the square arrays shown above grow left to right.

Tooling and Setup

The first thing to do is to make a debugger for testing bytecode. I vibe coded a quick debugger using dis to parse my bytecode:

============================================================
BYTECODE ANALYSIS
============================================================
Hex: 040004000f0017001700040006000600060004000600060006...
Length: 470 bytes

Opcodes (reading as 16-bit little-endian):
  Offset   0: 0x0004 (    4) DUP_TOP
  Offset   2: 0x0004 (    4) DUP_TOP
  Offset   4: 0x000f (   15) UNARY_POSITIVE
  Offset   6: 0x0017 (   23) BINARY_ADD
  Offset   8: 0x0017 (   23) BINARY_ADD
  Offset  10: 0x0004 (    4) DUP_TOP
  Offset  12: 0x0006 (    6) ROT_FOUR
  Offset  14: 0x0006 (    6) ROT_FOUR
  Offset  16: 0x0006 (    6) ROT_FOUR
  Offset  18: 0x0004 (    4) DUP_TOP
  ... (225 more opcodes)

Full function bytecode (with preamble and return):
  7c 00 7c 01 7c 02 04 00 04 00 0f 00 17 00 17 00 04 00 06 00 06 00 06 00 04 ... 04 00 73 26 01 00 01 00 53 00

============================================================
DISASSEMBLY (first 20 lines)
============================================================
 12           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 LOAD_FAST                2 (c)
              6 DUP_TOP
              8 DUP_TOP
             10 UNARY_INVERT
             12 BINARY_ADD

Since I intended to be writing raw bytecode, this is pretty useful for debugging typoes.

Making Gadgets

The CTF challenge was split into two parts; the first asks you to implement (1) subtraction, (2) return constant, and (3) multiply, and the second asks you to implement RSA/modular exponentiation.

Here are some miscellaneous observations about the opcode set:

Stack movement is very annoying. Only the top element is easily accessed, everything else can only be accessed with ROT_FOUR, which destroys the stack below it (needing to be restored)
We can only jump if true (nonzero)

My intuition about this challenge is that we need to very neatly describe a stack invariant and preserve it meticuously. This is very necessary for jumps. To do this, I write a set of well-defined gadgets that have specific functions.

04000f001700040017000f00: Replaces TOS with 1
04000f0017000f00: Replaces TOS with 0 (admittedly very nice solution)
040004000f001700040017000f001700: Add 1 to TOS
0f0017000f0004000f0004000f00170017000f00: Pop off [A,B] from TOS and pushes A-B
040004000f00170017000f00: Negate TOS
04000400060006000600040006000400040006000100010001000600 Duplicate [A,B] from TOS, resulting in [...A,B,A,B]. This is very useful in certain applications, despite the bytecode being long.

Hence, subproblem 1 (subtraction) can be solved with a gadget above.

Return 1337 as Constant

Subproblem 2 (return 1337) is slightly more complicated, but not particularly hard. We observe bin(1337)=10100111001, so we build up the number from MSB to LSG. We can double the TOS by writing 04001700 (duplicate, then add), so it becomes trivial to return any constant. We write:

04000f001700040017000f00 # Replace TOS with 1
04001700 # Double TOS
04001700 # Double TOS
040004000f001700040017000f001700 # Add 1 to TOS
04001700 # etc ...
04001700
04001700
040004000f001700040017000f001700
04001700
040004000f001700040017000f001700
04001700
040004000f001700040017000f001700
040017000400170004001700
040004000f001700040017000f001700

to complete part 2.

Multiplication

This is where things get interesting, since we have to manipulate the instruction pointer to loop now. We can consider many multiplication algorithms, but since the numbers are small, we can naively add integer A to itself B times. Then the pseudocode is as follows:

def mult(a,b):
  sum = a
  while 1:
    b -= 1
    if b = 0:
      return sum
    sum += a

In implementation, the while loop is one jump instruction. Since our jumps can only be if true/nonzero, the if statement can only check whether b is nonzero. If it is nonzero, we wish to jump to sum += a; if it is not, the next instruction should jump to a return statement at the end.

Furthermore, we must consider setting up a stack invariant. After some experimentation I end up with [X, sum, a, b], where X is garbage. X is included since rotating the stack occurs four at a time. Hence, this is the code I came up with:

040004000600060006000400060006000600 # Setup stack invariant
040004000f0017001700 # b -= 1
04007334 # If b is nonzero, skip the following line and jump to (A)
040004000f00170004001700 0f00734e # If it is zero, we return sum and jump to end
0600040006001700040006000600010006000600060004007316 # (A). Add S += a
010006000600010001000100 # Destroy stack  

RSA and Modular Exponentiation

This gets very messy, and there’s a couple of more gadgets we need to implement. Exponentiation without modular arithmetic is easy, since it is just the next hyperoperation after multiplication (which we already implemented), so the algorithm would be exactly the same. However, taking a modulo is surprisingly difficult with this instruction set. Consider the classical algorithm; a % b is computed by subtracting b until a < 0 (and you take the least nonnegative value). However, given the only jump statement we have, we have nowhere to derive sign of an integer from.

The solution that we end up with abuses the limited input size. p,q <= 100, so we can simply test equality of all positive integers to see if a number is positive. Thus, we implement a comparator as follows:

040004000f001700040017000f000400170004001700040017000400170004001700040017000400170004001700040017000400170004001700040017000400170004001700040017000400170004001700040017000400170004001700
040004000f0017001700 04000400060006000600040006000400040006000100010001000600
0f0017000f0004000f0004000f00170017000f00 73 b0
04000f001700040017000f00 0400 73 bc
0400 73 b8
0400 73 bc
0400 73 64

We upper bound the integer at $2^{14}$, hence the long first line. It pushes 1 to the TOS if nonnegative and 0 to the TOS if negative. This is slightly suboptimal but easy to work with as a primitive.

Next, we implement the modulus operation. This takes the comparator and repeatedly subtracts until it is less than 0, then takes the least nonnegative value:

04000400060006000600040006000400040006000100010001000600
0f0017000f0004000f0004000f00170017000f00

04000400060006000600040006000400040006000100010001000600010004001700040017000400170004001700040017000400170004001700
040004000f0017001700 04000400060006000600040006000400040006000100010001000600
0f0017000f0004000f0004000f00170017000f00 73 bc
04000f001700040017000f00 0400 73 c8
0400 73 c4
0400 73 c8
0400 73 70 
73 ce

0100 73 e2
04000400060001000600060006000100 0f00 73 06

Here we also introduce a stack invariant in the first line of the code. I also change the upper bound from $2^{14}$ to $128pq$, which introduces a small speedup to the code and saves many bytes from the line that pushes $2^{14}$.

Then the final payload is as follows:

040004000f0017001700
04000600060006000400060006000600010006000600
040004000f0017001700 
0400 73 3a
0400 9001 73 d8

06000600 040004000600060006000400060004000400060001000100010006000100 # Multiplication
040004000f0017001700 040004000600060006000400060006000100 
040004000f0017001700 040073 8a
0400 73 a6
0600040006001700040006000100 040004000f0017001700 0400 73 78
0600 0600 0600 0100 0100 0100
060006000600
04000400060006000600040006000400040006000100010001000600 

04000400060006000600040006000400040006000100010001000600 # Modulus
0f0017000f0004000f0004000f00170017000f00
04000400060006000600040006000400040006000100010001000600010004001700040017000400170004001700040017000400170004001700

040004000f0017001700 04000400060006000600040006000400040006000100010001000600 # Comparator
0f0017000f0004000f0004000f00170017000f00 9001 73 8e
04000f001700040017000f00 0400 9001 73 a0
0400 9001 73 9a
0400 9001 73 a0
0400 9001 73 3e
9001 73 aa
0100 9001 73 be
04000400060001000600060006000100 0f00 73 d4

04000600060004000600010001000100060006000600
0400  73 26
01000100

and we are done.

Allen Chang

Actively Understanding the Dynamics and Risks of the Threat Intelligence Ecosystem

Project Summary

Background and Motivation

Research Questions

Measurement Pipeline

Methodology and Scale

Key Findings

Recommendations

My Contributions

Malware Design

Inducing Dynamic Execution and a Malicious Verdict

Tracking Binary History and Creating a Provenance Trail

Fingerprinting Sandboxes

Rocket/Satellite Implementation

Observer Design

Miscellaneous

Ethical Considerations and Response

Being Found in Another Paper lol

Acknowledgements

Squid Agent - A Multi-Agent CTF Auto Solver

Squid Agent - A Multi-Agent CTF Auto Solver

This blog post is a repost from the blog post I co-authored that is posted at the Squid Proxy Lovers Blog.

Abstract

The Birth of Squid Agent

Design and Methodology

JSON API

Agent Code

Docker Environment

IDA Pro Integration

Subagent Design

Reverse Engineering

Binary Exploitation

Cryptography

Web Exploitation

Forensics and Miscellaneous

Results

The Future of Squid Agent and CTFs?

Future Work

Credit/Team:

osu! CTF 2025

misc/barcade (2 solves),

CyberBay CTF Finals 2025

Tuesday, October 14th

Wednesday, October 15th

Other remarks

I learned how to solve the Rubik’s Cube blind

In other news

On Trivalue Oddagons

Formalizing the Trivalue Oddagon Board State

Graph Representation

Lower-bounding \(\chi(G')\)

Applying the Trivalue Oddagon

CTFing updates, and a writeup

misc/Bytecode Bonanza

Tooling and Setup

Making Gadgets

Return 1337 as Constant

Multiplication

RSA and Modular Exponentiation