Skip to content

add infer operator#6704

Merged
mccanne merged 4 commits intomainfrom
infer
Mar 9, 2026
Merged

add infer operator#6704
mccanne merged 4 commits intomainfrom
infer

Conversation

@mccanne
Copy link
Collaborator

@mccanne mccanne commented Mar 6, 2026

No description provided.

Copy link
Contributor

@philrz philrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this branch at commit ac39eb7 I tried to run this on attached sample data bench2.csv and it panicked.

$ super -version
Version: v0.2.0-21-gac39eb7a2

$ super -c "infer" bench2.csv
panic: runtime error: invalid memory address or nil pointer dereference
goroutine 1 [running]:
runtime/debug.Stack()
	/usr/local/opt/go/libexec/src/runtime/debug/stack.go:26 +0x5e
github.com/brimdata/super/runtime/sam/op.(*Catcher).Pull.func1()
	/Users/phil/work/super/runtime/sam/op/catcher.go:25 +0x3d
panic({0xecbb400?, 0xf114d00?})
	/usr/local/opt/go/libexec/src/runtime/panic.go:860 +0x13a
github.com/brimdata/super/runtime/sam/op/infer.(*inferNode).load(0x3532d1f74f90, {0xef991b0?, 0x3532d21fc4b0}, {0x3532d2080000?, 0xbe8e80f?, 0x3532d22876d8?})
	/Users/phil/work/super/runtime/sam/op/infer/infer.go:71 +0x23a
github.com/brimdata/super/runtime/sam/op/infer.(*converter).infer(0x3532d21f3ec0, {0x3532d1df4c08, 0x64, 0x97})
	/Users/phil/work/super/runtime/sam/op/infer/op.go:176 +0xef
github.com/brimdata/super/runtime/sam/op/infer.(*converter).drain(0x3532d21f3ec0, 0x3532d1f744b0, 0x0)
	/Users/phil/work/super/runtime/sam/op/infer/op.go:115 +0x12a
github.com/brimdata/super/runtime/sam/op/infer.(*converter).process(0x3532d21f3ec0, {0xefa6530, 0x3532d203c0c0})
	/Users/phil/work/super/runtime/sam/op/infer/op.go:99 +0x1a6
github.com/brimdata/super/runtime/sam/op/infer.(*Op).Pull(0x3532d21f3ce0, 0x59?)
	/Users/phil/work/super/runtime/sam/op/infer/op.go:57 +0x1d0
github.com/brimdata/super/runtime/sam/op.(*Single).Pull(0x3532d21f3d40, 0x40?)
	/Users/phil/work/super/runtime/sam/op/mux.go:159 +0x33
github.com/brimdata/super/runtime/sam/op.(*Catcher).Pull(0x3532d211b9e8?, 0x25?)
	/Users/phil/work/super/runtime/sam/op/catcher.go:28 +0x5c
github.com/brimdata/super/runtime/exec.(*Query).Pull(0xbe22a7f?, 0x40?)
	/Users/phil/work/super/runtime/exec/query.go:49 +0x3c
github.com/brimdata/super/sbuf.CopyMux(0x3532d2287d08, {0xef901e0, 0x3532d21f3d70})
	/Users/phil/work/super/sbuf/mux.go:39 +0x38
github.com/brimdata/super/cmd/super/root.(*Command).Run(0x3532d1f1a488, {0x3532d1c140f0, 0x1, 0x1})
	/Users/phil/work/super/cmd/super/root/command.go:109 +0x9f9
github.com/brimdata/super/pkg/charm.path.run({0x3532d1bc6a78, 0x1, 0x1}, {0x3532d1c140f0, 0x1, 0x0?})
	/Users/phil/work/super/pkg/charm/path.go:11 +0x7b
github.com/brimdata/super/pkg/charm.(*Spec).Exec(0xf1986c0, {0x3532d1c140d0, 0x3, 0x3})
	/Users/phil/work/super/pkg/charm/charm.go:74 +0x1fa
main.main()
	/Users/phil/work/super/cmd/super/main.go:39 +0x5b

* [time](../types/time.md), or
* [bool](../types/bool.md).

`int64` inference takes precedence over `float64. All of the other candidate types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`int64` inference takes precedence over `float64. All of the other candidate types
`int64` inference takes precedence over `float64`. All of the other candidate types

then computes an inferred type for the sample, where the inferred type is identical
to the input type except for any embedded string types inferred to be of a candidate type.
Such inference occurs when all of the values contained by that string type
are uniformly coercable to the candidate type, which may be one of:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
are uniformly coercable to the candidate type, which may be one of:
are uniformly coercible to the candidate type, which may be one of:

are unambiguous with one another.

If end of input is reached before collecting the desired sample size, then
the inferences is conducted on the available values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the inferences is conducted on the available values.
the inference is conducted on the available values.

the inferences is conducted on the available values.

Once a type is inferred for a given sample, the values are cast to that type
and output by the operator. If the inferred type is unchanged, then then the values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and output by the operator. If the inferred type is unchanged, then then the values
and output by the operator. If the inferred type is unchanged, then the values

and output by the operator. If the inferred type is unchanged, then then the values
are output unmodified.

The operator may be reorder values as they are collected into a sample and analyzed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The operator may be reorder values as they are collected into a sample and analyzed.
The operator may reorder values as they are collected into a sample and analyzed.

@philrz philrz requested a review from a team March 7, 2026 00:54
@philrz
Copy link
Contributor

philrz commented Mar 8, 2026

I noticed something else interesting when testing this branch again now at commit 7a05d0f.

While right now infer only targets strings, we've had discussions offline about whether it would also make sense to infer when numeric values could all be represented as integers, rather than the current conservative behavior where reading raw numbers from CSV causes them to always be stored as floats. As an interim hack, I thought to try turning such numbers into strings first.

I've found is that this works as expected:

$ super -version
Version: v0.2.0-23-g7a05d0f0a

$ echo '{object_size:923.} {object_size:469.}' | super -c "cast(this, <{object_size:string}>) | infer" -
{object_size:923}
{object_size:469}

But when one of the numbers is very large, infer leaves them as floats.

$ echo '{object_size:92310104.} {object_size:469.}' | super -c "cast(this, <{object_size:string}>) | infer" -
{object_size:92310104.}
{object_size:469.}

I suspect this may be down to the way the larger value gets formatted as scientific notation when rendered.

$ echo '{object_size:923.} {object_size:92310104.} {object_size:469.}' | super -c "cast(this, <{object_size:string}>)" -
{object_size:"923"}
{object_size:"9.2310104e+07"}
{object_size:"469"}

Though a straight cast to int64 doesn't seem to care.

$ echo '{object_size:923.} {object_size:92310104.} {object_size:469.}' | super -c "cast(this, <{object_size:int64}>)" -
{object_size:923}
{object_size:92310104}
{object_size:469}

@mccanne
Copy link
Collaborator Author

mccanne commented Mar 9, 2026

I don't think infer should infer "9.2310104e+07" as an integer. We can solve the problem a different way by inferring float64's (which typically come from JSON numbers) as int64 when the inference sample all rounds to integers. My intentions was that we do this on a subsequent PR.

@@ -0,0 +1,112 @@
spq: values [1,"2"] | infer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add vector: true to these tests

@mccanne mccanne merged commit 801c2d4 into main Mar 9, 2026
4 checks passed
@mccanne mccanne deleted the infer branch March 9, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants