[Bugfix] Use Thread Context ClassLoader for user class loading#514
[Bugfix] Use Thread Context ClassLoader for user class loading#514xintongsong merged 2 commits intoapache:mainfrom
Conversation
This fixes ClassNotFoundException when loading user-defined resource classes (e.g., custom ChatModel implementations) from user JARs uploaded via the REST API. The issue occurs because framework code in /opt/flink/lib is loaded by the System ClassLoader, which cannot see classes in the User ClassLoader (child classloader for uploaded JARs). By using the Thread Context ClassLoader (TCCL), which Flink sets to the User ClassLoader before executing user code, the framework can now properly load user-defined classes. Files updated: - JavaResourceProvider: main resource instantiation - JavaSerializableResourceProvider: serializable resource deserialization - AgentPlan: PythonResourceWrapper class checks - ActionJsonDeserializer: parameter type and config deserialization - FunctionToolJsonDeserializer: parameter type deserialization - EventLogRecordJsonDeserializer: event class deserialization
|
@klaudworks Please add the following content to your PR description and select a checkbox: |
|
The failed test is a flaky test. It relies on Ollama to do some calculation which is non-deterministic. You can rerun it and it should be fine. |
|
@klaudworks, thank you for reporting and fixing this bug. This is indeed a valid problem, and the fix sounds good. @wenjin272, could you please help check the details? The 0.2.0 version is already in RC vote, and will soon be releases. Given that there's a workaround to this issue (placing user-defined resource in Additionally, I think we should:
|
|
Hi, @klaudworks, thanks for your detailed bug report and fix. Since you only verified with custom resource, I verify the fix with custom event. It works well. The fix looks good to me. Additionally, I investigated when we set the TCCL to the UserCodeClassLoader.
|
|
Thanks for looking into this! @xintongsong I'll write a proper issue for this in the next few minutes. @xintongsong Depending on the deployment scenario the current workaround might be quite inconvenient. In our case for example, I am running the platform providing flink and can't anticipate what kind of jobs users will try run. I'd have to rebuild the flink Docker image whenever someone modifies a custom resource. |
|
@wenjin272 thanks for further validating the fix! |
|
The workaround is admittedly not perfect. There will soon bee a patch release, likely in late Feb or early Mar, which should also carry fixes for other issues discovered during this period. Does it sound good to you? The reason I don't want to withdraw the RC is that, we are trying to finalize the release before the Chinese New Year vacation. If we can make it, people who don't take the vacation can start trying out the 0.2.0 release and report issues during this time, so that when we are back we can resolve the issues immediately and ship 0.2.1 around the time mentioned above. Otherwise, we'll have to postpone the release to after the vacation, which means everything is paused during this time, and we'll probably have a stable patch release around end of March. And if we withdraw the RC now, it's very likely we won't make it before the vacation. It takes more than one week, building another RC, testing, voting, finalizing the artifacts, etc. |
|
@xintongsong that sounds good. Thank you for considering me but don't worry about it. I am running a custom version of flink-agents with the patch included. |
|
Ported to release-0.2 in b8559a3 |
Linked issue: #515
Summary
This PR fixes
ClassNotFoundExceptionwhen loading user-defined resource classes (e.g., customChatModelimplementations like for meAzureOpenAIChatModelSetup) from user JARs uploaded via the REST API.Problem
When
flink-agents-dist.jaris deployed in/opt/flink/lib(which is required), it is loaded by the System ClassLoader. User JARs uploaded at runtime are loaded by Flink's User ClassLoader.The existing code uses
Class.forName(className)which uses the caller's classloader (System ClassLoader). The System ClassLoader cannot see classes in its child classloaders, resulting inClassNotFoundException.Solution
Use the Thread Context ClassLoader (TCCL) instead:
Flink sets the TCCL to the User ClassLoader before executing user code, making user-defined classes accessible to framework code.
Testing
Validated in a Flink 1.20 cluster with flink-agents, successfully loading custom
AzureOpenAIChatModelSetupclasses from user JARs uploaded via the REST API.Docs
doc-neededdoc-not-neededdoc-included