IBM Synthetic Data Generator for Itemsets and Sequences
Type make, which will create the executable file 'gen'
type ./gen -help for general help
For itemsets, type ./gen lit -help For sequences, type ./gen seq -help
These datasets mimic the transactions in a retailing environment, where people tend to buy sets of items together, the so called potential maximal frequent set. The size of the maximal elements is clustered around a mean with a few long itemsets. A transaction may contain one or more of such frequent sets. The transaction size is also clustered around a mean, but a few of them may contain many items. Let D denote the number of transactions, T the average transaction size, I the size of a maximal potentially frequent itemset, L the number of maximal potentially frequent itemsets, and N the number of items. The data is generated using the following procedure. We first generate L maximal itemsets of average size I by choosing from the N items. We next generate D transactions of average size T by choosing from the L maximal itemsets.
Type: ./gen lit -help
for all the parameters to generate sequence datasets:
Command Line Options:
-ncust number_of_customers (in 1000's) (default: 100)
-slen avg_trans_per_customer (default: 10)
-tlen avg_items_per_transaction (default: 2.5)
-nitems number_of_different_items (in '000s) (default: 10000)
-rept repetition-level (default: 0)
-seq.npats number_of_seq_patterns (default: 5000)
-seq.patlen avg_length_of_maximal_pattern (default: 4)
-seq.corr correlation_between_patterns (default: 0.25)
-seq.conf avg_confidence_in_a_rule (default: 0.75)
-lit.npats number_of_patterns (default: 25000)
-lit.patlen avg_length_of_maximal_pattern (default: 1.25)
-lit.corr correlation_between_patterns (default: 0.25)
-lit.conf avg_confidence_in_a_rule (default: 0.75)
-fname (write to filename.data and filename.pat)
-ascii (Write data in ASCII format; default: False)
-version (to print out version info)
An example run can be:
./gen lit -ntrans 100 -tlen 10 -nitems 1 -npats 1000 -patlen 4 -fname T10I4D100K -ascii
This will generate a datafile named "T10I4D100K.data"
The generator generates sequence datasets that mimic real-world transactions, where people buy a sequence of sets of items. Some customers may buy only some items from the sequences, or they may buy items from multiple sequences. The input-sequence size and event size are clustered around a mean and a few of them may have many elements.
The datasets are generated using the following process. First NI maximal events of average size I are generated by choosing from N items. Then NS maximal sequences of average size S are created by assigning events from NI to each sequence. Next a customer (or input-sequence) of average C transactions (or events) is created, and sequences in NS are assigned to different customer elements, respecting the average transaction size of T. The generation stops when D input-sequences have been generated. Default values are NS = 5000, NI = 25000 and N = 10000.
Type: ./gen seq -help
for all the parameters to generate sequence datasets:
Command Line Options:
-ncust number_of_customers (in 1000's) (default: 100)
-slen avg_trans_per_customer (default: 10)
-tlen avg_items_per_transaction (default: 2.5)
-nitems number_of_different_items (in '000s) (default: 10000)
-rept repetition-level (default: 0)
-seq.npats number_of_seq_patterns (default: 5000)
-seq.patlen avg_length_of_maximal_pattern (default: 4)
-seq.corr correlation_between_patterns (default: 0.25)
-seq.conf avg_confidence_in_a_rule (default: 0.75)
-lit.npats number_of_patterns (default: 25000)
-lit.patlen avg_length_of_maximal_pattern (default: 1.25)
-lit.corr correlation_between_patterns (default: 0.25)
-lit.conf avg_confidence_in_a_rule (default: 0.75)
-fname (write to filename.data and filename.pat)
-ascii (Write data in ASCII format; default: False)
-version (to print out version info)
An example run can be:
./gen seq -ncust 200 -fname C10T2.5S4I1.25D200K -ascii
This will generate four files:
[fname].data -- the actual data file
[fname].conf -- configuration info
[fname].pat -- the embedded patterns
[fname].ntpc -- info on number of trans per customer (ignore this file)