Exposed ngram extraction options in TextFeaturizer#2911
Exposed ngram extraction options in TextFeaturizer#2911zeahmed merged 11 commits intodotnet:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2911 +/- ##
==========================================
+ Coverage 71.82% 72.19% +0.36%
==========================================
Files 812 796 -16
Lines 142719 142020 -699
Branches 16092 16044 -48
==========================================
+ Hits 102513 102526 +13
+ Misses 35827 35115 -712
Partials 4379 4379
|
| /// The maximum number of grams to store in the dictionary, for each level of ngrams, | ||
| /// from 1 (in position 0) up to ngramLength (in position ngramLength-1) | ||
| /// </summary> | ||
| public int[] MaxNumTerms; |
There was a problem hiding this comment.
MaxNumTerms [](start = 25, length = 11)
@wschin calls it maximumNgramsCount would be nice to be consistent. #Resolved
So originally we had Factory for StopWordRemover, which I think I change to this boolean field. Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:100 in d3e2841. [](commit_id = d3e2841, deletion_comment = False) |
| internal Column Columns; | ||
|
|
||
| [Argument(ArgumentType.AtMostOnce, HelpText = "Dataset language or 'AutoDetect' to detect language per row.", ShortName = "lang", SortOrder = 3)] | ||
| public Language Language = DefaultLanguage; |
There was a problem hiding this comment.
Language [](start = 28, length = 8)
Not related to your PR, but whole point of this thing is to set or autodetect language, we left autodetect outside of ML.NET so it's looks completely unnecessary here, but as I said in other comment, this is out of scope of this PR. #Resolved
| public Options() | ||
| { | ||
| WordFeatureExtractor = new WordBagEstimator.Options(); | ||
| CharFeatureExtractor = new WordBagEstimator.Options() { NgramLength = 3, AllLengths = false }; |
There was a problem hiding this comment.
Shouldn't these fields initialized to null by default? And then fallback to WordFeatureExtractorFactory / CharFeatureExtractorFactory whenever you try to use them and they are not defined. This way code flow from entrypoint / maml will be same as with API. #Resolved
There was a problem hiding this comment.
WordFeatureExtractor and CharFeatureExtractor both can be null. When they are null its means do not apply the respective transforms. So, there is no way to detect if null was set by user or its a default value.
Right now, WordFeatureExtractorFactory / CharFeatureExtractorFactory is preferred if the maml is used.
In reply to: 264465314 [](ancestors = 264465314)
| [Argument(ArgumentType.Multiple, HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)] | ||
| public INgramExtractorFactoryFactory CharFeatureExtractor = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false }; | ||
| [Argument(ArgumentType.Multiple, Name = "CharFeatureExtractor", HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)] | ||
| internal INgramExtractorFactoryFactory CharFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false }; |
There was a problem hiding this comment.
INgramExtractorFactoryFactory [](start = 21, length = 29)
FactoryFactory :), is it possible just have name INgramExtractorFactory
#Resolved
There was a problem hiding this comment.
Good point but we already have a class name INgramExtractorFactory
and there is a relation between
INgramExtractorFactory and INgramExtractorFactoryFactory. I leave it for now. Maybe you can create an issue to track it if you think this needs to be renamed?
In reply to: 264465575 [](ancestors = 264465575)
| [Argument(ArgumentType.Multiple, HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)] | ||
| public INgramExtractorFactoryFactory WordFeatureExtractor = new NgramExtractorTransform.NgramExtractorArguments(); | ||
| [Argument(ArgumentType.Multiple, Name = "WordFeatureExtractor", HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)] | ||
| internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments(); |
There was a problem hiding this comment.
INgramExtractorFactoryFactory [](start = 21, length = 29)
same here FactoryFactory :) #Resolved
| internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments(); | ||
|
|
||
| public WordBagEstimator.Options WordFeatureExtractor; | ||
|
|
There was a problem hiding this comment.
Public field needs document. In addition, to internalize IFactory please follow the pattern Tom and I have built in this PR --- PR description of #2851 #Resolved
| internal INgramExtractorFactoryFactory CharFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false }; | ||
|
|
||
| public WordBagEstimator.Options CharFeatureExtractor; | ||
|
|
There was a problem hiding this comment.
Public field needs document. In addition, to internalize IFactory please follow the pattern Tom and I have built in this PR --- PR description of #2851 #Resolved
|
I don't know how Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:434 in e03104b. [](commit_id = e03104b, deletion_comment = False) |
|
|
||
| public Options() | ||
| { | ||
| WordFeatureExtractor = new WordBagEstimator.Options(); |
There was a problem hiding this comment.
WordFeatureExtractor [](start = 16, length = 20)
This will kill the initial value of WordFeatureExtractorFactory. How about remove new NgramExtractorTransform.NgramExtractorArguments() in internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments();? A similar comment applies to CharFeatureExtractor.
#Resolved
This PR fixes #2802, fixes #2838 and partially addresses #838,.