Skip to content

[Bugfix] Use Thread Context ClassLoader for user class loading#514

Merged
xintongsong merged 2 commits intoapache:mainfrom
klaudworks:fix-classloader-tccl
Feb 4, 2026
Merged

[Bugfix] Use Thread Context ClassLoader for user class loading#514
xintongsong merged 2 commits intoapache:mainfrom
klaudworks:fix-classloader-tccl

Conversation

@klaudworks
Copy link
Contributor

@klaudworks klaudworks commented Feb 3, 2026

Linked issue: #515

Summary

This PR fixes ClassNotFoundException when loading user-defined resource classes (e.g., custom ChatModel implementations like for me AzureOpenAIChatModelSetup) from user JARs uploaded via the REST API.

Problem

When flink-agents-dist.jar is deployed in /opt/flink/lib (which is required), it is loaded by the System ClassLoader. User JARs uploaded at runtime are loaded by Flink's User ClassLoader.

The existing code uses Class.forName(className) which uses the caller's classloader (System ClassLoader). The System ClassLoader cannot see classes in its child classloaders, resulting in ClassNotFoundException.

Solution

Use the Thread Context ClassLoader (TCCL) instead:

Class.forName(className, true, Thread.currentThread().getContextClassLoader())

Flink sets the TCCL to the User ClassLoader before executing user code, making user-defined classes accessible to framework code.

Testing

Validated in a Flink 1.20 cluster with flink-agents, successfully loading custom AzureOpenAIChatModelSetup classes from user JARs uploaded via the REST API.

Docs

  • doc-needed
  • doc-not-needed
  • doc-included

This fixes ClassNotFoundException when loading user-defined resource
classes (e.g., custom ChatModel implementations) from user JARs uploaded
via the REST API.

The issue occurs because framework code in /opt/flink/lib is loaded by
the System ClassLoader, which cannot see classes in the User ClassLoader
(child classloader for uploaded JARs). By using the Thread Context
ClassLoader (TCCL), which Flink sets to the User ClassLoader before
executing user code, the framework can now properly load user-defined
classes.

Files updated:
- JavaResourceProvider: main resource instantiation
- JavaSerializableResourceProvider: serializable resource deserialization
- AgentPlan: PythonResourceWrapper class checks
- ActionJsonDeserializer: parameter type and config deserialization
- FunctionToolJsonDeserializer: parameter type deserialization
- EventLogRecordJsonDeserializer: event class deserialization
@github-actions github-actions bot added priority/major Default priority of the PR or issue. fixVersion/0.2.0 The feature or bug should be implemented/fixed in the 0.2.0 version. doc-label-missing The Bot applies this label either because none or multiple labels were provided. labels Feb 3, 2026
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

@klaudworks Please add the following content to your PR description and select a checkbox:

- [ ] `doc-needed` 
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-included` 

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing The Bot applies this label either because none or multiple labels were provided. labels Feb 3, 2026
@klaudworks klaudworks closed this Feb 3, 2026
@klaudworks klaudworks reopened this Feb 3, 2026
@klaudworks
Copy link
Contributor Author

The failed test is a flaky test. It relies on Ollama to do some calculation which is non-deterministic. You can rerun it and it should be fine.

@xintongsong
Copy link
Contributor

xintongsong commented Feb 4, 2026

@klaudworks, thank you for reporting and fixing this bug. This is indeed a valid problem, and the fix sounds good. @wenjin272, could you please help check the details?

The 0.2.0 version is already in RC vote, and will soon be releases. Given that there's a workaround to this issue (placing user-defined resource in flink/lib), I'd suggest not to withdraw the RC and fix this in the next patch release. WDYT?

Additionally, I think we should:

  • Create an issue describing the problem, workaround and track the fix version. This is helpful for other people who run into the same problem, because people usually tend to search in issues rather than PRs when they run into problems. @klaudworks, would you like to do that?
  • We are planning to add some e2e tests that actually submit jobs into a flink standalone cluster, instead of currently running in a mini-cluster. I think user-defined resource should be included as one of the test cases. @wenjin272

@xintongsong xintongsong added fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. fixVersion/0.2.1 The feature or bug should be implemented/fixed in the 0.2.1 version. and removed fixVersion/0.2.0 The feature or bug should be implemented/fixed in the 0.2.0 version. labels Feb 4, 2026
@wenjin272
Copy link
Collaborator

Hi, @klaudworks, thanks for your detailed bug report and fix. Since you only verified with custom resource, I verify the fix with custom event. It works well. The fix looks good to me.

Additionally, I investigated when we set the TCCL to the UserCodeClassLoader.

  1. For agent plan deserialization, the UserCodeClassLoader is set by UserDefinedObjectsHolder in Flink StreamGraph.
  2. For resource creation, the UserCodeClassLoader is set in JavaActionTask before execute action.
  3. For event log deserialization, there is no place to set the UserCodeClassLoader because deserialization does not actually occur.

@klaudworks
Copy link
Contributor Author

klaudworks commented Feb 4, 2026

Thanks for looking into this!

@xintongsong I'll write a proper issue for this in the next few minutes.

@xintongsong Depending on the deployment scenario the current workaround might be quite inconvenient. In our case for example, I am running the platform providing flink and can't anticipate what kind of jobs users will try run. I'd have to rebuild the flink Docker image whenever someone modifies a custom resource.

@klaudworks
Copy link
Contributor Author

@wenjin272 thanks for further validating the fix!

@xintongsong
Copy link
Contributor

@klaudworks,

The workaround is admittedly not perfect. There will soon bee a patch release, likely in late Feb or early Mar, which should also carry fixes for other issues discovered during this period. Does it sound good to you?

The reason I don't want to withdraw the RC is that, we are trying to finalize the release before the Chinese New Year vacation. If we can make it, people who don't take the vacation can start trying out the 0.2.0 release and report issues during this time, so that when we are back we can resolve the issues immediately and ship 0.2.1 around the time mentioned above. Otherwise, we'll have to postpone the release to after the vacation, which means everything is paused during this time, and we'll probably have a stable patch release around end of March. And if we withdraw the RC now, it's very likely we won't make it before the vacation. It takes more than one week, building another RC, testing, voting, finalizing the artifacts, etc.

@xintongsong xintongsong changed the title [hotfix] Use Thread Context ClassLoader for user class loading [Bugfix] Use Thread Context ClassLoader for user class loading Feb 4, 2026
@klaudworks
Copy link
Contributor Author

@xintongsong that sounds good. Thank you for considering me but don't worry about it. I am running a custom version of flink-agents with the patch included.

@xintongsong xintongsong merged commit ddbff41 into apache:main Feb 4, 2026
41 of 42 checks passed
@xintongsong
Copy link
Contributor

Ported to release-0.2 in b8559a3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.2.1 The feature or bug should be implemented/fixed in the 0.2.1 version. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants