GitHub - jcwoods/csvprof: utility to profile CSV files

This utility generates statistics for simple delimited data files. It does not (yet) handle quoted data or data fields which themselves contain delimiters (eg, quoted CSV data).

To build

you must have libcsv3 available. Download it here: https://github.com/robertpostill/libcsv
In the Makefile, update the location of libcsv (libraries, headers) to reflect where the library was installed on your host.
run "make".

To run: csvprof [-d ] [-h ] [-v] [inputFile]

    delimiter - the field delimiter (a single character).  If not given,
                default is a pipe character ('|').

    hasHeader - must be [0|1].  If 1, the first row will be used as a
                header (contains field names).  If 0, fields will
                simply be enumerated.  Default is 0.

    verbose   - if given, verbose output will be generated.  This
                consists of some additional run time statistics, but
                does not generate additional analytical information.

    inputFile - the name of the file to be analyzed.  If not given,
               standard input will be used.

Output:

Given data consisting of:

C|JULIE|LEW|19360222||F|2591 MCCRACKEN STREET|||MUSKEGON|MI|49441
C|JULIE|LEW|19360222||F|144 CROSS ST.|||LAKEWOOD|NJ|87010
TONIA|SHERYL|DEKKER|19950218||F|914 GRUBB DRIVE|||MESQUITE|TX|75149
KENNETH|F|MARTIN|19781008||M|1 PETIT PL|||ISLAND PARK|NY|11558
KENNETH|F|MARTIN|19781008||M|7100 CESSNA DR|||GREENSBORO|NC|27409
J|DALE|YOUNG|19840306||M|1913 HWY 163|||RICEVILLE|TN|37370
J|DALE|YOUNG|19840306||M|FM ROAD 990|||SIMMS|TX|75574
REBECCA||NIX|19430125||F|3485 S POINT PRAIRIE RD|% JOYCE ZOLLMANN||WENTZVILLE|MO|63385
REBECCA||NIX|19430125||F|1919 E 13TH ST|LABLE & CO||CLEVELAND|OH|44114
REBECCA||NIX|19430125||F|1 ADA PKWY STE 100|||IRVINE|CA|92618
REBECCA||NIX|19430125||F|1391 170TH AVE NW|WILLIAM ZIEGLER GEN PTR||GEORGETOWN|MN|56546
C|B|BATLEY|19990606||M|26564 WESTGATE RD|||NEOLA|IA|51559
C|B|BATLEY|19990606||M|415 W. 6TH STREET|||VANCOUVER|WA|98660
SHONA||ABUGHAZALEH|19890110||F|2204 FAIRWAY DR|||DODGE CITY|KS|67801
SHONA||ABUGHAZALEH|19890110||F|805 W AIRLINE AVE|||GASTONIA|NC|28052
PAMELA|E|SNOW|19470428||F|PO BOX 847|||BERNALILLO|NM|87004
J|TERA|BRITTON|19921103||F|3343 PAUL DAVIS DRIVE|||MARINA|CA|93933
COLLEEN|S|STRANGE|19600411||F|PO BOX 84|||HAGUE|VA|22469
COLLEEN|S|STRANGE|19600411||F|2318 N COVINGTON ST|||WICHITA|KS|67205

The output would look like this:

jwoods@pecan:trunk$ ./csvprof sample.csv 
Total of 19 records read
    Record Length: 42 (min), 76 (max), 53 (avg) bytes:
    Field[0]:     1 (min),     7 (max),     4 (avg),    19 (count),     0 (empty)
        HAS_ALPHA HAS_UPPER 
    Field[1]:     0 (min),     6 (max),     1 (avg),    19 (count),     6 (empty)
        HAS_ALPHA HAS_UPPER 
    Field[2]:     3 (min),    11 (max),     5 (avg),    19 (count),     0 (empty)
        HAS_ALPHA HAS_UPPER 
    Field[3]:     8 (min),     8 (max),     8 (avg),    19 (count),     0 (empty)
        HAS_DIGIT 
    Field[4]:     0 (min),     0 (max),     0 (avg),    19 (count),    19 (empty)
    Field[5]:     1 (min),     1 (max),     1 (avg),    19 (count),     0 (empty)
        HAS_ALPHA HAS_UPPER 
    Field[6]:     9 (min),    23 (max),    15 (avg),    19 (count),     0 (empty)
        HAS_ALPHA HAS_UPPER HAS_DIGIT HAS_SPACE HAS_PUNCT 
    Field[7]:     0 (min),    23 (max),     2 (avg),    19 (count),    16 (empty)
        HAS_ALPHA HAS_UPPER HAS_SPACE HAS_PUNCT 
    Field[8]:     0 (min),     0 (max),     0 (avg),    19 (count),    19 (empty)
    Field[9]:     5 (min),    11 (max),     8 (avg),    19 (count),     0 (empty)
        HAS_ALPHA HAS_UPPER HAS_SPACE 
    Field[10]:     2 (min),     2 (max),     2 (avg),    19 (count),     0 (empty)
        HAS_ALPHA HAS_UPPER 
    Field[11]:     5 (min),     5 (max),     5 (avg),    19 (count),     0 (empty)
        HAS_DIGIT

From this output, we can tell:

fields 0, 2, 3, 5, 6, 9, 10, and 11 are never empty
fields 4 and 8 are always empty
There are no lower case alphabetic characters in the data in any field (may be important if checking that data is properly standardized).
fields 3 and 11 are always numeric
fields 3, 5, and 10 are fixed length.
fields 6, 7, and 9 contain whitespace (representing multiple tokens or words).
All other fields contain a single token

If you know what your input data is supposed to contain, you can very quickly confirm that the content matches your specification. If you're doing discovery (perhaps for ETL), this will tell you how to configure your input fields.

ToDo: this program captures a lot more information than is currently being output. At some point, it will produce graphical representations of byte values for each offset within each field. These graphical representations should allow differences in fields, record structure, or files to be detected quickly.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
csvprof.c		csvprof.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages