Skip to content

Added lambda that integrates Forklift#411

Draft
JaimeZepeda08 wants to merge 16 commits intoopen-lambda:mainfrom
JaimeZepeda08:main
Draft

Added lambda that integrates Forklift#411
JaimeZepeda08 wants to merge 16 commits intoopen-lambda:mainfrom
JaimeZepeda08:main

Conversation

@JaimeZepeda08
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Haven't read everything super closely yet, but leaving some initial feedback.

Let's not have big data files here, like deps.json and wl.json. In the docs, we can describe a user would use your lambda. wget those files (from wherever they are, probably on the forklift repo), and then curl to your function.

event = json.loads(event)

workload_data = event.get("workload")
if not workload_data:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these checks? Seems it is all in a big try/except, and you'll return a similar error below anyway


Expected event format:
{
"workload": { ... workload data ... },
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show the next level of detail (below) about how workload data is structured?

}
"""
try:
if isinstance(event, str):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check? Are there different possible inputs?


result = generate_tree(workload_data, num_nodes, multi_package)

return {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a flask lambda instead? Then, we could return a status code, like 500 or something. The benefit is that somebody could do a "curl ... > tree.json" to get a tree, without needing to extract the result separately.


# Testing code
if __name__ == "__main__":
with open("wl.json", "r") as file:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not sys.argv[1]?


def generate_tree(workload_data, num_nodes, multi_package=True):
calls = parse_workload(workload_data)
deps_json = load_deps_json()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not load from a file. Can we include include this in the POST body passed to our function?

candidate_queue = [] # priority queue for candidate nodes


def enqueue_top_child_candidate(parent, deps, multi_package=True):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we pass around deps and multi_package to all our functions. Perhaps put this all in a class, and these are attributes?

continue

pkg_name = child_pkgV.split("==")[0]
version = child_pkgV.split("==")[1] if "==" in child_pkgV else None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow pkgs without a version? If our input is based on pip compile output, won't we have versions? I may misunderstand, though.

continue

# keep track of packages that would be loaded by this candidate
packages_to_load = set([child_pkgV])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can create a set without converting:

{child_pkgV}

from collections import defaultdict


# TODO: instead of loading deps from a file, use pip-compile on the fly
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking lets require callers to pass deps in. Then, however we figure out those dependencies, it will be outside the nice, clean tree-building logic here.

Copy link
Copy Markdown
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer!

@@ -0,0 +1,290 @@
'''
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call "forklift", not "forklift-service"

self._enqueue_top_child_candidate(child)

def build_tree(self, desired_nodes):
self.candidate_queue = []
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to see all class attrs in init even if None at that point.

For the queue, let's have namedtuple's so we can see the code is correct easier. Things like the following need careful checking:

_, _, best_candidate

found := false
for _, p := range packages {
if p == nodePkg {
// compare only using package name
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate PR

call_counts[name] += 1

# build call matrix
calls = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a dict the best type for a calls matrix? Do want something (a) sparse with (b) column names.

Need to check, but Claude says Pandas can have a pd.SparseDtype(int, fill_value=0) type to make it sparse. Maybe Pandas DF is the best matrix format then?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in this case a dict would work better because the algorithm repeatedly partitions calls by rows and we only operate on rows.

From AI:
"Pandas sparse data structures (specifically SparseDtype and SparseArray introduced in version 1.0+) are designed to handle sparse data efficiently, but they do not deal well with frequent, row-wise, or random mutations."

What do you think?

Copy link
Copy Markdown
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Getting close.

@@ -0,0 +1 @@
flask No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the pattern of requirements.in, and a pip compiled requirements.txt, as in other example lambdas.

app = Flask(__name__)


class CallMatrix:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a generic SparseMatrix would allow us to better mirror the logic in the paper. The class has package-specific awareness, and therefore takes on some responsibilities handled by other functions in the pseudocode.

self.root = None
self.candidate_queue = []

def _enqueue_top_child_candidate(self, parent):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is quite a bit longer and more complicated than the pseudocode for enqueue_top_child_candidate in the paper. I think it is because this function is taking on more concerns, not just the core logic.

E.g., you have separate parse functions. Splitting pkg name from version feels like it should be handled there, not in the core logic of this function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a helper function or two would help as well?

return jsonify(result), 200

except Exception as e:
import traceback
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports should be top of file


try:
event = request.get_json()
if event is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a couple special cases where we return different errors for specific kinds of invalid data. But given all is in a try/except, do we need special code for these?

'''

try:
event = request.get_json()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the body of this try should be it's own function to cleanly separate logic: one function for core logic, another (which calls the former) that adds error handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants