Skip to content

Some pigsty examples#809

Merged
Ivanidzo4ka merged 10 commits intodotnet:masterfrom
Ivanidzo4ka:ivanidze/pigstyexamples
Sep 6, 2018
Merged

Some pigsty examples#809
Ivanidzo4ka merged 10 commits intodotnet:masterfrom
Ivanidzo4ka:ivanidze/pigstyexamples

Conversation

@Ivanidzo4ka
Copy link
Contributor

No description provided.

Features: row.Text.TextFeaturizer()));

var pipeline = preprocess.
Appand(row => row.Label.TrainLinearClassification(row.Features));
Copy link
Member

@sfilipi sfilipi Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appand [](start = 16, length = 6)

Append #Resolved

Features: row.Subject.Concat(row.Content).TextFeaturizer()));

var pipeline = preprocess.
Appand(row => row.Label.TrainSDCAClassifier(row.Features)).
Copy link
Member

@sfilipi sfilipi Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appand [](start = 16, length = 6)

one more #Resolved

string dataPath = "iris-data.txt";
// Create reader with specific schema.
var dataReader = TextLoader.CreateReader(env, ctx => (
Label: ctx.LoadText(0),
Copy link
Member

@sfilipi sfilipi Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L [](start = 15, length = 1)

Tom is using lowercase for those, in his StaticPipeTest. Shall we adhere to that? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'd probably prefer the way it is, but let's see what Tom says


In reply to: 215033893 [](ancestors = 215033893)

Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'd probably prefer the way it is, but let's see what Tom says

Thanks @Zruty0 and @sfilipi . I do not insist either way, except to point this out:

We are composing a value-tuple. And for better or worse, the value-tuple idiom that seems to have been developed as promoted by the .NET team in every example I see everywhere, both from them and non-MS users of value-tuples, is that value-tuple names tend to be camelCased. (This flies in the face of prior .NET naming conventions for public member names being PascalCased, but, whatever.)

So if I search the web for value-tuple example the relevant results I see on the first page are this:

You will note, every last one of these is camelCased.

Given all this, I might prefer that if we're using value-tuples we make them look like other instances of value-tuples, rather than being different because we think we're cool enough to not resemble other usages of .NET idioms.

The casing is nothing relating to us, the casing is just related to look most natural in code. And if what people consider most natural in value-tuples is camelCasing, we ought to have a really great reason to deviate from this in our PiGSTy examples. I'd be willing to listen to such an argument for sure, but in the absence of that we really should change the names here. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's rename to camelCase


In reply to: 215054683 [](ancestors = 215054683)

var dataPath = "wikipedia-detox-250-line-data.tsv";
// Load the data into the system.
var data = TextLoader.CreateReader(env, ctx => (
Label: ctx.ReadFloat(0),
Copy link
Member

@sfilipi sfilipi Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadFloat [](start = 30, length = 9)

I think they are LoadFloat, LoadText #Resolved

@Ivanidzo4ka Ivanidzo4ka added the API Issues pertaining the friendly API label Sep 4, 2018
@Ivanidzo4ka Ivanidzo4ka added this to the 0918 milestone Sep 4, 2018
// Load the data into the system.
var data = dataReader.Read(dataPath);

var preprocess = data.Schema.MakeEstimator(row => (
Copy link
Contributor

@Zruty0 Zruty0 Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.Schema.MakeEstimator [](start = 29, length = 25)

this should instead be Estimator.Create(data).Append(row => ... I guess. Take a look at Estimator.Create from Tom #Closed

Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I wrote things that way at your request, I actually am not fond of how the code will look, and would prefer something closer to what @Ivanidzo4ka has done here. I think that if I have any of these schema-bearing objects, I should be able to do something like .StartPipe on it or something to get an empty estimator. Perhaps we should open an issue or something to discuss this @Zruty0? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go over this at today's meeting


In reply to: 215056489 [](ancestors = 215056489)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the discussion, we decided to keep it as data.MakeEstimator().Append(row=>...). I'm not terribly happy with this, but oh well.


In reply to: 215063741 [](ancestors = 215063741,215056489)

Features: row.SepalWidth.ConcatWith(row.SepalLength, row.PetalWidth, row.PetalLength)));


var pipeline = preprocess
Copy link
Contributor

@Zruty0 Zruty0 Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var pipeline = preprocess [](start = 11, length = 26)

I would couple it all into one pipeline:

var pipeline = Estimator.Create(...)
    .Append(...) // preprocessing
    .Append(...) // learner
    .Append(...) // post-processing
``` #Resolved


var predictions = model.Transform(dataReader.Read(testDataPath));
var evaluator = new MultiClassEvaluator(env);
var metrics = evaluator.Evaluate(predictions);
Copy link
Contributor

@Zruty0 Zruty0 Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe fold these 2 lines into one? #Closed

var pipeline = preprocess.
Append(row => row.Label.TrainLinearClassification(row.Features));

var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: data, trainFraction: 0.7);
Copy link
Contributor

@Zruty0 Zruty0 Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: data, trainFraction: 0.7); [](start = 12, length = 95)

Let's have Tom comment: does this sound like it'll break us out of pigsty? So pipeline.Fit(trainData) is not possible? Or should we have CrossValidator.TrainTestSplit be pigsty-capable? #Closed

Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Zruty0 , if we view it as desirable to have a method with signature (DataView<TShape> split1, DataView<TShape> split2) Split(IDataView<TShape> source, double ratio), assuredly nothing prevents us from doing so. But that's fairly obviously the case, so I wonder if I've missed the point of the question somehow? #Closed

dataPath);

// Load the data into the system.
var data = dataReader.Read(dataPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataPath [](start = 39, length = 8)

this implies that our text reader will be an IDataReader<string>, right? Or string to be implicitly converted to MultiFileSource. I'm OK with the second idea, but is Tom?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Zruty0 , I am OK with it I think. On multiple occassions I would have found an implicit conversion to IMultiStreamSource a useful simplifying instrument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and failed. The implicit ctor is not called if you pass dataPath to a method: I guess it's only for assignments


In reply to: 215094774 [](ancestors = 215094774)

[Fact]
public void SimpleIrisDescisionTrees()
{
var env = new TlcEnvironment(new SysRandom(0), verbose: true);
Copy link
Contributor

@Zruty0 Zruty0 Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new SysRandom(0), verbose: true [](start = 41, length = 31)

no need for these params, if they are not required #Resolved

var data = TextLoader.CreateReader(env, ctx => (
Label: ctx.LoadFloat(0),
Text: ctx.LoadText(1)),
dataPath, hasHeader: true).Read(dataPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that is slightly annoying me as I'm reading across the examples, and I wonder if anyone else feels similarly.

This .CreateReader(..., dataPath).Read(dataPath) seems vaguely tautological -- it probably isn't, but it seems like it is. While surely we must have DataReader be able to handle the case of different data, would there be benefit to having a DataReader somehow retain a reference to the TIn instance that it was initialized on, and so have a parameterless Read that just implicitly read over whatever it was initialized as?

Nonetheless, you could imagine this being something we would not want to do all the time -- if the initialization structure was some huge in-memory structure (e.g., the case where we are initializing and training a pipeline based on data in memory). But then we have to have a separate API to allow both cases. And if we don't want to do that always, perhaps it's best if we don't do it ever, so a user does not have to learn additional special cased idioms for what amounts to a fairly minor enhancement.

With that in mind, I think the code in its current form is probably fine (since there are some strong architectural reasons to disfavor caching, as we see above), but at the very least we will need to make sure this "idea" comes across somehow perhaps better than it does now, since at first glance this looks, frankly, kind of silly. I wonder at least if there's something we can do to make it seem, somehow, less silly.

var pipeline = preprocess.
Append(row => row.Label.TrainLinearClassification(row.Features));

var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: data, trainFraction: 0.7);
Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand what is meant to be promoted here. Cross-validation suggests to me that there's some sort of loop, with independent featurization happening somewhere. And in cross-validation specifically, that helper function -- assuming we agree that it is necessary to have one -- would handle the splitting, according to the number of folds, not by some sort of trainFraction parameter. #Closed

Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the idea is just some sort of utility to split the dataset in two, we cannot name whatever that utility is, cannot have anything to do with cross-validation, since that already refers to a very specific way of splitting up data. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, okay, so not CrossValidator (I was just trying to follow the example in sklearn, but I'm not too fond of it). But still, the data instances will not be DataView<TPig>, they will be IDataView, unless we make TrainTestSplit work in pigsty-land, right? I suppose we should actually do this, so line 121 is possible


In reply to: 215061185 [](ancestors = 215061185)

Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, why wouldn't the instances be DataView<TPig>? So you feed in DataView<TPig>, and you get two (or even more, if we're ambitious) possibly stratified instances of DataView<TPig>. I see no barrier to that, but do you?

In any case, I think we should make it general policy that dropping out of pigsty is something the user must explicitly do... that is, without an .AsDynamic somewhere explicitly in the code the user themselves write, there should never be a situation where the user can unknowingly drop out of pigsty. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me


In reply to: 215095002 [](ancestors = 215095002)


var predictions = model.Transform(testData);
var evaluator = new MultiClassEvaluator(env);
var metrics = evaluator.Evaluate(predictions);
Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insufficient. You will have to tell it what you want to evaluate -- that is, you will need to have some sort of delegate to select both the label and the prediction that you have done here. In fact the fact that you've attempted to do without this has hidden a nasty flaw: your preidctions was the result of a model whose last estimator was this:

Append(row => row.PredictedLabel.KeyToValue());

You have only the predicted label, since you got rid of everything else with this simple projection.

One of the consequences of having these tuple-shape items is that we will need to use delegates, I think, to index into them. So there will be a delegate in evaluate to select the label, the predicted label, and probably the probabilities as well. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person, let's do
MultiClassEvaluator.Evaluate(predictions, ctx=>predictions.label, ctx => predictions.prediction), and make sure 'label' and 'prediction' are available in line 147


In reply to: 215062579 [](ancestors = 215062579)

Content: ctx.LoadText(2)),
dataPath, hasHeader: true).Read(dataPath);

var preprocess = data.Schema.MakeEstimator(row => (
Copy link
Contributor

@TomFinley TomFinley Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the purpose of these calls to Schema are. Surely if we had an extension method on any schema-shape bearing object, a MakeEstimator call would suffice, would it not? #Resolved

public void SimpleIrisDescisionTrees()
{
var env = new TlcEnvironment(new SysRandom(0), verbose: true);
string dataPath = "iris-data.txt";
Copy link
Member

@sfilipi sfilipi Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iris-data.txt [](start = 31, length = 13)

think it is iris.txt #Resolved

@sfilipi
Copy link
Member

sfilipi commented Sep 4, 2018

        string dataPath = "iris-data.txt";

i see iris.txt in the data folder #Closed


Refers to: test/Microsoft.ML.Tests/Scenarios/Api/AspirationalExamples.cs:32 in 575fe51. [](commit_id = 575fe51, deletion_comment = False)

var model = pipeline.Fit(data);

var predictions = model.Transform(dataReader.Read(testDataPath));
var evaluator = new MultiClassEvaluator(env);
Copy link
Contributor

@Zruty0 Zruty0 Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var evaluator = new MultiClassEvaluator(env); [](start = 9, length = 48)

remove #Resolved

// Convert string label to key.
label: row.label.Dictionarize(),
// Concatenate all features into a vector.
features: row.subject.Concat(row.content).TextFeaturizer()))
Copy link
Contributor

@Zruty0 Zruty0 Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

features: row.subject.Concat(row.content).TextFeaturizer())) [](start = 16, length = 60)

indentation #Resolved

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ivanidzo4ka Ivanidzo4ka changed the title WIP Some pigsty examples Some pigsty examples Sep 5, 2018
@Ivanidzo4ka
Copy link
Contributor Author

        string dataPath = "iris-data.txt";

I saw you change this example in your code, so I leave it for you :)


In reply to: 418548540 [](ancestors = 418548540)


Refers to: test/Microsoft.ML.Tests/Scenarios/Api/AspirationalExamples.cs:32 in 575fe51. [](commit_id = 575fe51, deletion_comment = False)

dataPath);


var pipeline = data.MakeEstimator()
Copy link
Contributor

@GalOshri GalOshri Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data is not defined at this point. Perhaps line 91 should be moved up? #Resolved

row.label.TrainLinearClassification(row.features)));


var (trainData, testData) = CrossValidator.TrainTestSplit(env, data: dataReader.Read(dataPath), trainFraction: 0.7);
Copy link
Contributor

@GalOshri GalOshri Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is TrainTestSplit coming from CrossValidator? This is a bit confusing. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a generic name which Pete suggest me to use.
I'm totally up to anything more reasonable, do you have a suggestion?


In reply to: 215109742 [](ancestors = 215109742)

Copy link
Contributor

@GalOshri GalOshri Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something like "ModelValidation"?
Scikit-Learn has train_test_split() in model_selection #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we settled on extension to DataView<TPig>


In reply to: 215140323 [](ancestors = 215140323)

content: ctx.LoadText(2)),
dataPath, hasHeader: true);

var preprocess = data.MakeEstimator().
Copy link
Contributor

@GalOshri GalOshri Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think preprocess should likely be renamed to pipeline based on line 148. #Resolved

sepalLength: ctx.LoadFloat(2),
petalWidth: ctx.LoadFloat(3),
petalLength: ctx.LoadFloat(4)),
dataPath);
Copy link
Contributor

@GalOshri GalOshri Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we can make dataPath the second parameter here. It is easily missed given the ctx is quite involved. @Zruty0 mentioned this is because it is an optional parameter but just thought I'd share in case there are any ideas. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to actually have a version of the ctor, where dataPath is a required 2nd argument...


In reply to: 215140669 [](ancestors = 215140669)

// Convert string label to key.
label: row.label.Dictionarize(),
// Concatenate all features into a vector.
features: row.sepalWidth.ConcatWith(row.sepalLength, row.petalWidth, row.petalLength)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was likely a separate discussion on this, but why was ConcatWith chosen instead of Concat (used in LINQ)? This applies to all the estimators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a vector column foo, then foo.Concat() can be interpreted as 'concatenate all slots inside foo'. foo.ConcatWith doesn't allow for this confusion


In reply to: 215141225 [](ancestors = 215141225)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think given the similarities with LINQ there is a benefit to staying consistent, but agree that removing confusion is better.


// Load the data into the system.
var data = dataReader.Read(dataPath);
var model = pipeline.Fit(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't gain much from having dataReader.Read(dataPath), right? Would it make sense for pipeline.Fit() to take a dataPath? This would be more similar to Scikit-Learn, but perhaps is not in sync with other concepts here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if your model is fitted using a data path, then it can only work from another data path (and not individual examples, for example)


In reply to: 215141509 [](ancestors = 215141509)

[Fact]
public void TwitterSentimentAnalysis()
{
var env = new ConsoleEnvironment();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we generally need env, but does it add any value in simple scenarios where it doesn't do anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is assuredly doing something. You're using it to create the text loader, and from there all subsequent components spawned from that object, which is to say, all of them.

Copy link
Contributor

@CESARDELATORRE CESARDELATORRE Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomFinley - The ConsoleEnvironment looks not clear to me. If we need the TlcEnvironment (ConsoleEnvironment) in most cases, in order to create a DataReader:

var dataReader = TextLoader.CreateReader(env, dataPath,...

Does it make sense to call it ConsoleEnvironment if I run the build/training code from a different type of application like an ASP.NET Core Web app (or a WinForms, WPF if using .NET Framework)?

Why don't we simply call it "Environment" or "MLNetEnvironment"?

Or is the TlcEnvironment (ConsoleEnvironment) only used in Console apps and if running code to build/train a model in a non-console app I don't need the "env" object in order to create a DataReader?


var pipeline = dataReader.MakeEstimator()
.Append(row => (
label: row.label,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to keep all columns without writing them out individually? For example, I have 1000 categorical columns and 1 text column. I would to apply TextFeaturizer to the text column but keep the 1000 categorical columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For various reasons a valid schema shape is not only a value-tuple of PipelineColumn objects, but also recursively defined structures of value-tuples... that is, you could have a value-tuple of value-tuple of value-tuple, etc. This means that in this expression it would have been totally legal to say, prev: row, and then you just bring them along for free, at the cost of having to work through that prev identifier. This is discussed in #632, in particular, see the section "ValueTuples of PipelineColumns," and in particular the sentence "it provides a convenient way to just bring along all the inputs, if we wanted to do so, by just assigning the input tuple itself as an item in the output tuple."

features: row.text.TextFeaturizer()))
.Append(row => (
label: row.label,
row.label.TrainLinearClassification(row.features)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should TextFeaturizer be FeaturizeText for consistency? TrainLinearClassification -> PredictWithLinearClassifier?

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ivanidzo4ka . I'm not sure if @GalOshri has further blocking comments, or if we might resolve them later?

Copy link
Contributor

@GalOshri GalOshri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, don't block on my existing comments. Let's consider them later.

@Ivanidzo4ka Ivanidzo4ka merged commit 7835cd2 into dotnet:master Sep 6, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

API Issues pertaining the friendly API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants