Conversation
| Features: row.Text.TextFeaturizer())); | ||
|
|
||
| var pipeline = preprocess. | ||
| Appand(row => row.Label.TrainLinearClassification(row.Features)); |
There was a problem hiding this comment.
Appand [](start = 16, length = 6)
Append #Resolved
| Features: row.Subject.Concat(row.Content).TextFeaturizer())); | ||
|
|
||
| var pipeline = preprocess. | ||
| Appand(row => row.Label.TrainSDCAClassifier(row.Features)). |
There was a problem hiding this comment.
Appand [](start = 16, length = 6)
one more #Resolved
| string dataPath = "iris-data.txt"; | ||
| // Create reader with specific schema. | ||
| var dataReader = TextLoader.CreateReader(env, ctx => ( | ||
| Label: ctx.LoadText(0), |
There was a problem hiding this comment.
L [](start = 15, length = 1)
Tom is using lowercase for those, in his StaticPipeTest. Shall we adhere to that? #Closed
There was a problem hiding this comment.
I guess I'd probably prefer the way it is, but let's see what Tom says
In reply to: 215033893 [](ancestors = 215033893)
There was a problem hiding this comment.
I guess I'd probably prefer the way it is, but let's see what Tom says
Thanks @Zruty0 and @sfilipi . I do not insist either way, except to point this out:
We are composing a value-tuple. And for better or worse, the value-tuple idiom that seems to have been developed as promoted by the .NET team in every example I see everywhere, both from them and non-MS users of value-tuples, is that value-tuple names tend to be camelCased. (This flies in the face of prior .NET naming conventions for public member names being PascalCased, but, whatever.)
So if I search the web for value-tuple example the relevant results I see on the first page are this:
You will note, every last one of these is camelCased.
Given all this, I might prefer that if we're using value-tuples we make them look like other instances of value-tuples, rather than being different because we think we're cool enough to not resemble other usages of .NET idioms.
The casing is nothing relating to us, the casing is just related to look most natural in code. And if what people consider most natural in value-tuples is camelCasing, we ought to have a really great reason to deviate from this in our PiGSTy examples. I'd be willing to listen to such an argument for sure, but in the absence of that we really should change the names here. #Closed
There was a problem hiding this comment.
| var dataPath = "wikipedia-detox-250-line-data.tsv"; | ||
| // Load the data into the system. | ||
| var data = TextLoader.CreateReader(env, ctx => ( | ||
| Label: ctx.ReadFloat(0), |
There was a problem hiding this comment.
ReadFloat [](start = 30, length = 9)
I think they are LoadFloat, LoadText #Resolved
| // Load the data into the system. | ||
| var data = dataReader.Read(dataPath); | ||
|
|
||
| var preprocess = data.Schema.MakeEstimator(row => ( |
There was a problem hiding this comment.
data.Schema.MakeEstimator [](start = 29, length = 25)
this should instead be Estimator.Create(data).Append(row => ... I guess. Take a look at Estimator.Create from Tom #Closed
There was a problem hiding this comment.
While I wrote things that way at your request, I actually am not fond of how the code will look, and would prefer something closer to what @Ivanidzo4ka has done here. I think that if I have any of these schema-bearing objects, I should be able to do something like .StartPipe on it or something to get an empty estimator. Perhaps we should open an issue or something to discuss this @Zruty0? #Closed
There was a problem hiding this comment.
There was a problem hiding this comment.
On the discussion, we decided to keep it as data.MakeEstimator().Append(row=>...). I'm not terribly happy with this, but oh well.
In reply to: 215063741 [](ancestors = 215063741,215056489)
| Features: row.SepalWidth.ConcatWith(row.SepalLength, row.PetalWidth, row.PetalLength))); | ||
|
|
||
|
|
||
| var pipeline = preprocess |
There was a problem hiding this comment.
var pipeline = preprocess [](start = 11, length = 26)
I would couple it all into one pipeline:
var pipeline = Estimator.Create(...)
.Append(...) // preprocessing
.Append(...) // learner
.Append(...) // post-processing
``` #Resolved|
|
||
| var predictions = model.Transform(dataReader.Read(testDataPath)); | ||
| var evaluator = new MultiClassEvaluator(env); | ||
| var metrics = evaluator.Evaluate(predictions); |
There was a problem hiding this comment.
maybe fold these 2 lines into one? #Closed
| var pipeline = preprocess. | ||
| Append(row => row.Label.TrainLinearClassification(row.Features)); | ||
|
|
||
| var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: data, trainFraction: 0.7); |
There was a problem hiding this comment.
var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: data, trainFraction: 0.7); [](start = 12, length = 95)
Let's have Tom comment: does this sound like it'll break us out of pigsty? So pipeline.Fit(trainData) is not possible? Or should we have CrossValidator.TrainTestSplit be pigsty-capable? #Closed
There was a problem hiding this comment.
Hi @Zruty0 , if we view it as desirable to have a method with signature (DataView<TShape> split1, DataView<TShape> split2) Split(IDataView<TShape> source, double ratio), assuredly nothing prevents us from doing so. But that's fairly obviously the case, so I wonder if I've missed the point of the question somehow? #Closed
| dataPath); | ||
|
|
||
| // Load the data into the system. | ||
| var data = dataReader.Read(dataPath); |
There was a problem hiding this comment.
dataPath [](start = 39, length = 8)
this implies that our text reader will be an IDataReader<string>, right? Or string to be implicitly converted to MultiFileSource. I'm OK with the second idea, but is Tom?
There was a problem hiding this comment.
Hi @Zruty0 , I am OK with it I think. On multiple occassions I would have found an implicit conversion to IMultiStreamSource a useful simplifying instrument.
There was a problem hiding this comment.
I tried and failed. The implicit ctor is not called if you pass dataPath to a method: I guess it's only for assignments
In reply to: 215094774 [](ancestors = 215094774)
| [Fact] | ||
| public void SimpleIrisDescisionTrees() | ||
| { | ||
| var env = new TlcEnvironment(new SysRandom(0), verbose: true); |
There was a problem hiding this comment.
new SysRandom(0), verbose: true [](start = 41, length = 31)
no need for these params, if they are not required #Resolved
| var data = TextLoader.CreateReader(env, ctx => ( | ||
| Label: ctx.LoadFloat(0), | ||
| Text: ctx.LoadText(1)), | ||
| dataPath, hasHeader: true).Read(dataPath); |
There was a problem hiding this comment.
This is something that is slightly annoying me as I'm reading across the examples, and I wonder if anyone else feels similarly.
This .CreateReader(..., dataPath).Read(dataPath) seems vaguely tautological -- it probably isn't, but it seems like it is. While surely we must have DataReader be able to handle the case of different data, would there be benefit to having a DataReader somehow retain a reference to the TIn instance that it was initialized on, and so have a parameterless Read that just implicitly read over whatever it was initialized as?
Nonetheless, you could imagine this being something we would not want to do all the time -- if the initialization structure was some huge in-memory structure (e.g., the case where we are initializing and training a pipeline based on data in memory). But then we have to have a separate API to allow both cases. And if we don't want to do that always, perhaps it's best if we don't do it ever, so a user does not have to learn additional special cased idioms for what amounts to a fairly minor enhancement.
With that in mind, I think the code in its current form is probably fine (since there are some strong architectural reasons to disfavor caching, as we see above), but at the very least we will need to make sure this "idea" comes across somehow perhaps better than it does now, since at first glance this looks, frankly, kind of silly. I wonder at least if there's something we can do to make it seem, somehow, less silly.
| var pipeline = preprocess. | ||
| Append(row => row.Label.TrainLinearClassification(row.Features)); | ||
|
|
||
| var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: data, trainFraction: 0.7); |
There was a problem hiding this comment.
I'm afraid I don't understand what is meant to be promoted here. Cross-validation suggests to me that there's some sort of loop, with independent featurization happening somewhere. And in cross-validation specifically, that helper function -- assuming we agree that it is necessary to have one -- would handle the splitting, according to the number of folds, not by some sort of trainFraction parameter. #Closed
There was a problem hiding this comment.
If the idea is just some sort of utility to split the dataset in two, we cannot name whatever that utility is, cannot have anything to do with cross-validation, since that already refers to a very specific way of splitting up data. #Closed
There was a problem hiding this comment.
Okay, okay, so not CrossValidator (I was just trying to follow the example in sklearn, but I'm not too fond of it). But still, the data instances will not be DataView<TPig>, they will be IDataView, unless we make TrainTestSplit work in pigsty-land, right? I suppose we should actually do this, so line 121 is possible
In reply to: 215061185 [](ancestors = 215061185)
There was a problem hiding this comment.
Sorry, why wouldn't the instances be DataView<TPig>? So you feed in DataView<TPig>, and you get two (or even more, if we're ambitious) possibly stratified instances of DataView<TPig>. I see no barrier to that, but do you?
In any case, I think we should make it general policy that dropping out of pigsty is something the user must explicitly do... that is, without an .AsDynamic somewhere explicitly in the code the user themselves write, there should never be a situation where the user can unknowingly drop out of pigsty. #Closed
There was a problem hiding this comment.
|
|
||
| var predictions = model.Transform(testData); | ||
| var evaluator = new MultiClassEvaluator(env); | ||
| var metrics = evaluator.Evaluate(predictions); |
There was a problem hiding this comment.
This is insufficient. You will have to tell it what you want to evaluate -- that is, you will need to have some sort of delegate to select both the label and the prediction that you have done here. In fact the fact that you've attempted to do without this has hidden a nasty flaw: your preidctions was the result of a model whose last estimator was this:
Append(row => row.PredictedLabel.KeyToValue());You have only the predicted label, since you got rid of everything else with this simple projection.
One of the consequences of having these tuple-shape items is that we will need to use delegates, I think, to index into them. So there will be a delegate in evaluate to select the label, the predicted label, and probably the probabilities as well. #Resolved
There was a problem hiding this comment.
As discussed in person, let's do
MultiClassEvaluator.Evaluate(predictions, ctx=>predictions.label, ctx => predictions.prediction), and make sure 'label' and 'prediction' are available in line 147
In reply to: 215062579 [](ancestors = 215062579)
| Content: ctx.LoadText(2)), | ||
| dataPath, hasHeader: true).Read(dataPath); | ||
|
|
||
| var preprocess = data.Schema.MakeEstimator(row => ( |
There was a problem hiding this comment.
I'm not sure what the purpose of these calls to Schema are. Surely if we had an extension method on any schema-shape bearing object, a MakeEstimator call would suffice, would it not? #Resolved
| public void SimpleIrisDescisionTrees() | ||
| { | ||
| var env = new TlcEnvironment(new SysRandom(0), verbose: true); | ||
| string dataPath = "iris-data.txt"; |
There was a problem hiding this comment.
iris-data.txt [](start = 31, length = 13)
think it is iris.txt #Resolved
| var model = pipeline.Fit(data); | ||
|
|
||
| var predictions = model.Transform(dataReader.Read(testDataPath)); | ||
| var evaluator = new MultiClassEvaluator(env); |
There was a problem hiding this comment.
var evaluator = new MultiClassEvaluator(env); [](start = 9, length = 48)
remove #Resolved
| // Convert string label to key. | ||
| label: row.label.Dictionarize(), | ||
| // Concatenate all features into a vector. | ||
| features: row.subject.Concat(row.content).TextFeaturizer())) |
There was a problem hiding this comment.
features: row.subject.Concat(row.content).TextFeaturizer())) [](start = 16, length = 60)
indentation #Resolved
| dataPath); | ||
|
|
||
|
|
||
| var pipeline = data.MakeEstimator() |
There was a problem hiding this comment.
data is not defined at this point. Perhaps line 91 should be moved up? #Resolved
| row.label.TrainLinearClassification(row.features))); | ||
|
|
||
|
|
||
| var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: dataReader.Read(dataPath), trainFraction: 0.7); |
There was a problem hiding this comment.
Why is TrainTestSplit coming from CrossValidator? This is a bit confusing. #Closed
There was a problem hiding this comment.
This is just a generic name which Pete suggest me to use.
I'm totally up to anything more reasonable, do you have a suggestion?
In reply to: 215109742 [](ancestors = 215109742)
There was a problem hiding this comment.
Perhaps something like "ModelValidation"?
Scikit-Learn has train_test_split() in model_selection #Resolved
There was a problem hiding this comment.
| content: ctx.LoadText(2)), | ||
| dataPath, hasHeader: true); | ||
|
|
||
| var preprocess = data.MakeEstimator(). |
There was a problem hiding this comment.
I think preprocess should likely be renamed to pipeline based on line 148. #Resolved
| sepalLength: ctx.LoadFloat(2), | ||
| petalWidth: ctx.LoadFloat(3), | ||
| petalLength: ctx.LoadFloat(4)), | ||
| dataPath); |
There was a problem hiding this comment.
It would be nice if we can make dataPath the second parameter here. It is easily missed given the ctx is quite involved. @Zruty0 mentioned this is because it is an optional parameter but just thought I'd share in case there are any ideas. #Resolved
There was a problem hiding this comment.
I think it's OK to actually have a version of the ctor, where dataPath is a required 2nd argument...
In reply to: 215140669 [](ancestors = 215140669)
| // Convert string label to key. | ||
| label: row.label.Dictionarize(), | ||
| // Concatenate all features into a vector. | ||
| features: row.sepalWidth.ConcatWith(row.sepalLength, row.petalWidth, row.petalLength))) |
There was a problem hiding this comment.
There was likely a separate discussion on this, but why was ConcatWith chosen instead of Concat (used in LINQ)? This applies to all the estimators.
There was a problem hiding this comment.
If you have a vector column foo, then foo.Concat() can be interpreted as 'concatenate all slots inside foo'. foo.ConcatWith doesn't allow for this confusion
In reply to: 215141225 [](ancestors = 215141225)
There was a problem hiding this comment.
Good point. I think given the similarities with LINQ there is a benefit to staying consistent, but agree that removing confusion is better.
|
|
||
| // Load the data into the system. | ||
| var data = dataReader.Read(dataPath); | ||
| var model = pipeline.Fit(data); |
There was a problem hiding this comment.
This example doesn't gain much from having dataReader.Read(dataPath), right? Would it make sense for pipeline.Fit() to take a dataPath? This would be more similar to Scikit-Learn, but perhaps is not in sync with other concepts here.
There was a problem hiding this comment.
if your model is fitted using a data path, then it can only work from another data path (and not individual examples, for example)
In reply to: 215141509 [](ancestors = 215141509)
| [Fact] | ||
| public void TwitterSentimentAnalysis() | ||
| { | ||
| var env = new ConsoleEnvironment(); |
There was a problem hiding this comment.
I know we generally need env, but does it add any value in simple scenarios where it doesn't do anything?
There was a problem hiding this comment.
It is assuredly doing something. You're using it to create the text loader, and from there all subsequent components spawned from that object, which is to say, all of them.
There was a problem hiding this comment.
@TomFinley - The ConsoleEnvironment looks not clear to me. If we need the TlcEnvironment (ConsoleEnvironment) in most cases, in order to create a DataReader:
var dataReader = TextLoader.CreateReader(env, dataPath,...
Does it make sense to call it ConsoleEnvironment if I run the build/training code from a different type of application like an ASP.NET Core Web app (or a WinForms, WPF if using .NET Framework)?
Why don't we simply call it "Environment" or "MLNetEnvironment"?
Or is the TlcEnvironment (ConsoleEnvironment) only used in Console apps and if running code to build/train a model in a non-console app I don't need the "env" object in order to create a DataReader?
|
|
||
| var pipeline = dataReader.MakeEstimator() | ||
| .Append(row => ( | ||
| label: row.label, |
There was a problem hiding this comment.
Is there a way to keep all columns without writing them out individually? For example, I have 1000 categorical columns and 1 text column. I would to apply TextFeaturizer to the text column but keep the 1000 categorical columns.
There was a problem hiding this comment.
For various reasons a valid schema shape is not only a value-tuple of PipelineColumn objects, but also recursively defined structures of value-tuples... that is, you could have a value-tuple of value-tuple of value-tuple, etc. This means that in this expression it would have been totally legal to say, prev: row, and then you just bring them along for free, at the cost of having to work through that prev identifier. This is discussed in #632, in particular, see the section "ValueTuples of PipelineColumns," and in particular the sentence "it provides a convenient way to just bring along all the inputs, if we wanted to do so, by just assigning the input tuple itself as an item in the output tuple."
| features: row.text.TextFeaturizer())) | ||
| .Append(row => ( | ||
| label: row.label, | ||
| row.label.TrainLinearClassification(row.features))); |
There was a problem hiding this comment.
Should TextFeaturizer be FeaturizeText for consistency? TrainLinearClassification -> PredictWithLinearClassifier?
TomFinley
left a comment
There was a problem hiding this comment.
Thanks @Ivanidzo4ka . I'm not sure if @GalOshri has further blocking comments, or if we might resolve them later?
GalOshri
left a comment
There was a problem hiding this comment.
LGTM, don't block on my existing comments. Let's consider them later.
No description provided.