CSV to OpenSearch Indexer

A Go application that reads CSV files, converts each row to XML format, optionally parses them with Apache Tika, and indexes the documents to OpenSearch for searchable storage.

Features

CSV Parsing: Reads CSV files with header rows and creates a map for each data row
XML Generation: Converts each CSV row into a separate XML file
Tika Integration: Uses Apache Tika server to parse and extract content from XML files
OpenSearch Indexing: Indexes parsed documents to OpenSearch for full-text search
Bulk Operations: Supports bulk indexing for better performance
Docker Support: Includes Docker Compose setup for OpenSearch and Tika services

Prerequisites

Go 1.21 or later
Docker and Docker Compose
Make (optional, for using Makefile commands)

Project Structure

search/
├── cmd/
│   └── main.go              # Main application entry point
├── internal/
│   ├── csvparser/           # CSV parsing module
│   │   └── parser.go
│   ├── xmlconverter/        # XML conversion module
│   │   └── converter.go
│   ├── tika/                # Tika client module
│   │   └── client.go
│   └── opensearch/          # OpenSearch indexer module
│       └── indexer.go
├── data/                    # Sample data directory
│   └── sample.csv
├── output/                  # Generated XML files directory
├── docker-compose.yml       # Docker services configuration
├── go.mod                   # Go module definition
├── Makefile                 # Build and run commands
└── README.md               # This file

Quick Start

1. Clone and Setup

# Navigate to the project directory
cd search

# Download dependencies
go mod download

2. Start Services

Start OpenSearch and Tika using Docker Compose:

# Using Make
make docker-up

# Or using Docker Compose directly
docker-compose up -d

Wait about 10-15 seconds for services to be ready. You can verify:

# Check OpenSearch
curl http://localhost:9200

# Check Tika
curl http://localhost:9998/version

3. Run the Application

Process the sample CSV file:

# Using Make
make run

# Or using Go directly
go run cmd/main.go -csv data/sample.csv

# Or build and run
go build -o bin/csv-indexer cmd/main.go
./bin/csv-indexer -csv data/sample.csv

4. Access Your Data

OpenSearch Dashboards: http://localhost:5601
OpenSearch API: http://localhost:9200

Search your index:

curl -X GET "http://localhost:9200/csv-documents/_search?pretty"

Usage

Command Line Options

go run cmd/main.go [options]

Option	Description	Default
`-csv`	Path to the CSV file to process (required)	-
`-output`	Directory to store XML files	`output`
`-index`	OpenSearch index name	`csv-documents`
`-opensearch`	OpenSearch URL	`http://localhost:9200`
`-tika`	Tika server URL	`http://localhost:9998`
`-bulk`	Use bulk indexing	`true`
`-skip-tika`	Skip Tika parsing and index XML directly	`false`
`-help`	Show help message	-

Examples

# Process a CSV file with default settings
go run cmd/main.go -csv data/sample.csv

# Use a custom index name
go run cmd/main.go -csv data/sample.csv -index my-custom-index

# Skip Tika parsing (faster but less content extraction)
go run cmd/main.go -csv data/sample.csv -skip-tika

# Process without bulk indexing (slower but shows individual progress)
go run cmd/main.go -csv data/sample.csv -bulk=false

Makefile Commands

The project includes a Makefile for common operations:

make help              # Show all available commands
make docker-up         # Start Docker services
make docker-down       # Stop Docker services
make docker-clean      # Stop services and remove volumes
make build            # Build the application
make run              # Run with sample data
make run-skip-tika    # Run without Tika parsing
make test-opensearch  # Test OpenSearch connection
make test-tika        # Test Tika connection
make search           # Interactive search in indexed documents
make list-indices     # List all OpenSearch indices
make delete-index     # Delete the csv-documents index
make clean            # Clean generated files
make full-run         # Complete setup and run

How It Works

CSV Parsing: The application reads your CSV file and creates a map for each row using the headers as keys.

XML Generation: Each row is converted to an XML document with sanitized field names:

<?xml version="1.0" encoding="UTF-8"?>
<record>
  <name>John Doe</name>
  <age>28</age>
  <city>New York</city>
  ...
</record>

Tika Parsing (optional): XML files are sent to the Tika server for content extraction and metadata parsing.
OpenSearch Indexing: Documents are indexed with:
- Full text content
- Original CSV data as metadata
- File paths
- Timestamps
Search: Documents become searchable through OpenSearch's full-text search capabilities.

CSV File Format

Your CSV file should:

Have a header row as the first line
Use comma separators
Follow standard CSV escaping rules

Example:

name,age,city,occupation,email
John Doe,28,New York,Software Engineer,[email protected]
Jane Smith,32,San Francisco,Product Manager,[email protected]

Docker Services

OpenSearch

URL: http://localhost:9200
Single-node setup with security disabled (for development)
Data persisted in Docker volume

OpenSearch Dashboards

URL: http://localhost:5601
Web interface for data visualization and management

Tika Server

URL: http://localhost:9998
Apache Tika 2.9.1 for content extraction

Troubleshooting

Services Won't Start

# Check if ports are already in use
netstat -an | grep -E "9200|9998|5601"

# Check Docker logs
docker-compose logs

OpenSearch Out of Memory

Increase memory in docker-compose.yml:

environment:
  - "OPENSEARCH_JAVA_OPTS=-Xms1g -Xmx1g"

Tika Connection Failed

Ensure Tika container is running: docker-compose ps
Wait longer for service startup
Use -skip-tika flag to bypass Tika parsing

No Documents Indexed

Check CSV file format and path
Verify OpenSearch is running
Check application output for error messages

Development

Running Tests

go test ./...

Code Formatting

go fmt ./...
go vet ./...

Adding Dependencies

go get <package>
go mod tidy

API Endpoints

OpenSearch Endpoints

Health check: GET http://localhost:9200
Index stats: GET http://localhost:9200/csv-documents/_stats
Search: POST http://localhost:9200/csv-documents/_search
Delete index: DELETE http://localhost:9200/csv-documents

Tika Endpoints

Version: GET http://localhost:9998/version
Parse: PUT http://localhost:9998/tika
Metadata: PUT http://localhost:9998/meta

License

MIT

Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a Pull Request

Support

For issues, questions, or suggestions, please open an issue on the project repository.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.zed		.zed
cmd		cmd
data		data
internal		internal
test		test
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
config.yaml.example		config.yaml.example
demo.sh		demo.sh
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

CSV to OpenSearch Indexer

Features

Prerequisites

Project Structure

Quick Start

1. Clone and Setup

2. Start Services

3. Run the Application

4. Access Your Data

Usage

Command Line Options

Examples

Makefile Commands

How It Works

CSV File Format

Docker Services

OpenSearch

OpenSearch Dashboards

Tika Server

Troubleshooting

Services Won't Start

OpenSearch Out of Memory

Tika Connection Failed

No Documents Indexed

Development

Running Tests

Code Formatting

Adding Dependencies

API Endpoints

OpenSearch Endpoints

Tika Endpoints

License

Contributing

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages