Add delimiter option for reading CSV files for Feathr#307
Add delimiter option for reading CSV files for Feathr#307xiaoyongzhu merged 22 commits intofeathr-ai:mainfrom
Conversation
Signed-off-by: Chang Yong Lik <[email protected]>
2b1c0a6 to
ddb5643
Compare
c764d1e to
36fe0b3
Compare
Signed-off-by: Chang Yong Lik <[email protected]> (cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0)
Signed-off-by: Chang Yong Lik <[email protected]>
Signed-off-by: Chang Yong Lik <[email protected]>
Signed-off-by: Chang Yong Lik <[email protected]>
Signed-off-by: Chang Yong Lik <[email protected]>
Signed-off-by: Chang Yong Lik <[email protected]>
|
@ahlag thanks for the contribution! This PR looks good to me, but I'm not sure why the test fails. Spent a bit time to investigate and feel it might be caused by the newly added tests? |
|
@xiaoyongzhu |
|
Talked with @ahlag offline and asked him to run |
Signed-off-by: changyonglik <[email protected]>
|
@xiaoyongzhu TestFileFormat.scala val sqlContext = ss.sqlContext
sqlContext.setConf("spark.feathr.inputFormat.csvOptions.sep", "\t")FileFormat.scala val csvDelimiterOption = ss.sparkContext.getConf.get("spark.feathr.inputFormat.csvOptions.sep", ",") |
Hmm it's a bit weird. Is it possible to force set the delimiters? |
|
Look like there can only be one SparkContext |
|
I think I will try a new approach. Since an existing SparkContext cannot be edited nor can another one be created, I will test it with e2e by passing the config from the client. |
I did some research and found this answer: Maybe you can try sqlContext.sql? |
|
Ok, I'll give this a shot |
|
@xiaoyongzhu |
hangfei
left a comment
There was a problem hiding this comment.
Also update wiki that we support tsv.
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/hdfs/FileFormat.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/hdfs/FileFormat.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/source/dataloader/hdfs/FileFormat.scala
Show resolved
Hide resolved
src/test/scala/com/linkedin/feathr/offline/util/TestSourceUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/util/SourceUtils.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: changyonglik <[email protected]>
Signed-off-by: changyonglik <[email protected]>
Signed-off-by: changyonglik <[email protected]>
880ea90 to
3ec3fdf
Compare
Signed-off-by: changyonglik <[email protected]>
3842194 to
7711253
Compare
Signed-off-by: changyonglik <[email protected]>
062d916 to
2d41211
Compare
Signed-off-by: changyonglik <[email protected]>
f2ecf06 to
0b01cc6
Compare
Signed-off-by: changyonglik <[email protected]>
78b3cbd to
b4e55fb
Compare
|
@xiaoyongzhu @hangfei |
|
@ahlag would you mind merge latest main and resolve the conflicts so that we can get this merged, thanks for your time! |
|
@xiaoyongzhu @blrchen |
* Added documentation Signed-off-by: Chang Yong Lik <[email protected]> * Added delimiter to CSVLoader Signed-off-by: Chang Yong Lik <[email protected]> (cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0) * Added delimiter to BatchDataLoader, FileFormat and SourceUtils Signed-off-by: Chang Yong Lik <[email protected]> * Added test case for BatchDataLoader Signed-off-by: Chang Yong Lik <[email protected]> * Added test case for FileFormat Signed-off-by: Chang Yong Lik <[email protected]> * Added test case for BatchDataLoader Signed-off-by: Chang Yong Lik <[email protected]> * Added test case and fixed indent Signed-off-by: Chang Yong Lik <[email protected]> * Passing failure Signed-off-by: changyonglik <[email protected]> * Removed unused imports from BatchDataLoader Signed-off-by: changyonglik <[email protected]> * Fixed test failures Signed-off-by: changyonglik <[email protected]> * Added release version Signed-off-by: changyonglik <[email protected]> * Removed tailing space Signed-off-by: changyonglik <[email protected]> * Removed wildcard imports Signed-off-by: changyonglik <[email protected]> * Paraphrased comments and docstring Signed-off-by: changyonglik <[email protected]> * Added DelimiterUtils Signed-off-by: changyonglik <[email protected]> * Refactored utils Signed-off-by: changyonglik <[email protected]> * Updated wiki to support both tsv and csv Signed-off-by: changyonglik <[email protected]> * Fixed spelling error Signed-off-by: changyonglik <[email protected]> * trigger GitHub actions Signed-off-by: Chang Yong Lik <[email protected]> Signed-off-by: changyonglik <[email protected]>
Signed-off-by: Chang Yong Lik [email protected]
Description
Resolves #241
How was this patch tested?
Tested locally with
sbtProgress Tracker
Does this PR introduce any user-facing changes?
Allows users to specify delimiters