Skip to content

AON-PRISMA/feaas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Private Record Linkage

This repository contains software for computing record linkage among multiple clients while preserving user data privacy.

  • Use pre-trained large language models with contrastive fine-tuning to encode client records into fixed-length embeddings.
  • Clients encrypt and send the embeddings to the server.
  • Server applies Berlekamp-Welch algorithm to efficiently compute the matching records without learning additional information.
  • Client encoding and server matching using Berlekamp-Welch can be done using 16 or 32 bit field arithmetic. Switching is done using a flag.
  • Server side field arithmetic uses lookup tables, which are auto-generated when running the server code (a one-time process). Tables can be generated beforehand to skip the auto-generation delay using a generation command.
  • Server returns the results back to clients.

Requirement

  • Python=3.9
  • Go=1.19
  • System requirements:
    • Server:
      • RAM: The server requires 64 GB memory for the 32-bit version.
      • CPU: To take advantage of parallel computation, machines with multicore processors are encouraged to be used.
    • Client:
      • GPU: GPUs can be used to speed up the encoding process, but are not required.

Install and compile

  • Clone the repo:
git clone https://github.com/AON-PRISMA/feaas.git
cd feaas

Client

  • Create a Python environment and install Python 3.9

  • Install required libraries:

pip install -r requirements.txt
  • Build the Go component on the client-side:
cd client/share_gen/
# for 32-bit version
go build -buildmode=c-shared -o _share_gen.so

# for 16-bit version
go build -tags 16 -buildmode=c-shared -o _share_gen.so

Server

  • Install required libraries:
cd ../../server/
go get
  • Build look-up tables:
# for 32-bit version
go run tablegen.go -num_bits=32

# for 16-bit version
go run tablegen.go -num_bits=16
  • Build server code:
# for 32-bit version
go build -o server server.go

# for 16-bit version
go build -tags 16 -o server server.go
  • To build the client and server using one command (first cd into feaas/server/):
sh build.sh [16|32]

Usage

Server

cd feaas/server/
./server -num_c=[NUM_C] -rep=[REP] -lsh_rep=[LSH_REP] -tls=[TLS] -addr=[ADDR] -output=[OUTPUT]

arguments:
    NUM_C:      total number of participating clients
    REP:        number of repetitions for the encryption to improve security
    LSH_REP:    repetition param for lsh; negative num means no use of lsh
    TLS:        whether to enable tls; 1 or 0
    ADDR:       ip address the server is listening (default value: 127.0.0.1:8000) 
    OUTPUT:     the output csv path for storing matching results

Client

1. Determine the threshold and generate data encodings

cd feaas/encoding
python encode.py --load_config [CONFIG_FILE] -d [DATA_PATH] --data_name [DATA_NAME] -r [RATIO] -n [MODEL_NAME] -p [CKPT_PATH] -i [NUM_INTERVAL] --bs [BS] -s [SAVE_PATH]

arguments:
    CONFIG_FILE:   the config file containing default command-line arguments; arguments are overridden if provided explicitly
    DATA_PATH:     path of the data file; expect a .csv file
    DATA_NAME:     dataset name; one of ['ag', 'febrl4', 'abt_buy']  
    RATIO:         ratio (num of negatives /num of positives) of the dataset for selecting the threshold
    MODEL_NAME:    model architecture name for inference
    CKPT_PATH:     path of the model checkpoint 
    NUM_INTERVAL:  number of intervals for quantization (converting embeddings to integer vectors)
    BS:            batch size used during inference
    SAVE_PATH:     path of the encodings output

output:
    Threshold for decision-making in later steps

2. Encrypt and send

cd feaas/client/
python main.py --load_config [CONFIG_FILE] --cid [CID] --server_addr [SERVER_ADDR] --client_addr [CLIENT_ADDR] --tls_file [TLS_FILE] --data_path [DATA_PATH]  --th [TH] --rep [REP] --lsh_rep [LSH_REP] --lsh_num_ind [LSH_NUM_IND] --lsh_num_bin [LSH_NUM_BIN]
--num_bits [NUM_BITS]

arguments:
    CONFIG_FILE:   the config file containing default command-line arguments; arguments are overridden if provided explicitly
    CID:           current client id, has to be non-repetitive and starts from 1
    SERVER_ADDR:   ip address to connect to the server (default value: 127.0.0.1:8000)
    CLIENT_ADDR:   ip address used between clients (default value: 127.0.0.1:7000)
    TLS_FILE:      path to the tls file for verification; if not provided, non-tls is used
    DATA_PATH:     dataset path, expect a .npy file
    TH:            threshold for decision-making (output from Step 1)
    REP:           number of repetitions for encryption to improve security
    LSH_REP:       repetition param for lsh; negative num means no use of lsh
    LSH_NUM_IND:   number of selected indices for lsh
    LSH_NUM_BIN:   number of bins for lsh
    NUM_BITS:      number of bits of the protocol; the clients and server should use the same NUM_BITS version
  • Note: Client 1 with (cid=1) serves as the coordinator among clients and needs to be started before any other clients.

Examples

The commands below perform an end-to-end run of the 32-bit version on the Amazon-Google dataset. For demonstration purposes, two clients communicate with a server locally. The matching output is stored in server/result.csv.

  • Compile the code:
cd feaas/server/
sh build.sh 32

Server

  • Start the server and wait for clients' connection. Note: the initialization of the server might take a few minutes to complete for the 32-bit version.
./server -num_c=2 -rep=1 -lsh_rep=10 -tls=0 -output=result.csv

Client 1

  • Determine the threshold from a given dataset and generate data embeddings from querying the pre-trained model:
cd feaas/encoding/
python encode.py --load_config ../data/amazon_google/encoding_config.json -d ../data/amazon_google/test_df_1.csv -s ../client/embed_1.npy
  • Start the client application:
    • Note: Clients may need to explicitly set the num_bits arguments or modify the config.json file to match the server's version.
cd ../client/
python main.py --cid 1 --load_config ../data/amazon_google/config.json --data_path embed_1.npy

Client 2

  • Open up a new terminal. This is similar to client 1, but we use client 2's data records.
cd feaas/encoding/
python encode.py --load_config ../data/amazon_google/encoding_config.json -d ../data/amazon_google/test_df_2.csv -s ../client/embed_2.npy
cd ../client/
python main.py --cid 2 --load_config ../data/amazon_google/config.json --data_path embed_2.npy

Datasets

The example dataset is downloaded and adapted from: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution

License

See the LICENSE file for license rights and limitations.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors