Skip to content

rcrupp/search

Repository files navigation

CSV to OpenSearch Indexer

A Go application that reads CSV files, converts each row to XML format, optionally parses them with Apache Tika, and indexes the documents to OpenSearch for searchable storage.

Features

  • CSV Parsing: Reads CSV files with header rows and creates a map for each data row
  • XML Generation: Converts each CSV row into a separate XML file
  • Tika Integration: Uses Apache Tika server to parse and extract content from XML files
  • OpenSearch Indexing: Indexes parsed documents to OpenSearch for full-text search
  • Bulk Operations: Supports bulk indexing for better performance
  • Docker Support: Includes Docker Compose setup for OpenSearch and Tika services

Prerequisites

  • Go 1.21 or later
  • Docker and Docker Compose
  • Make (optional, for using Makefile commands)

Project Structure

search/
├── cmd/
│   └── main.go              # Main application entry point
├── internal/
│   ├── csvparser/           # CSV parsing module
│   │   └── parser.go
│   ├── xmlconverter/        # XML conversion module
│   │   └── converter.go
│   ├── tika/                # Tika client module
│   │   └── client.go
│   └── opensearch/          # OpenSearch indexer module
│       └── indexer.go
├── data/                    # Sample data directory
│   └── sample.csv
├── output/                  # Generated XML files directory
├── docker-compose.yml       # Docker services configuration
├── go.mod                   # Go module definition
├── Makefile                 # Build and run commands
└── README.md               # This file

Quick Start

1. Clone and Setup

# Navigate to the project directory
cd search

# Download dependencies
go mod download

2. Start Services

Start OpenSearch and Tika using Docker Compose:

# Using Make
make docker-up

# Or using Docker Compose directly
docker-compose up -d

Wait about 10-15 seconds for services to be ready. You can verify:

# Check OpenSearch
curl http://localhost:9200

# Check Tika
curl http://localhost:9998/version

3. Run the Application

Process the sample CSV file:

# Using Make
make run

# Or using Go directly
go run cmd/main.go -csv data/sample.csv

# Or build and run
go build -o bin/csv-indexer cmd/main.go
./bin/csv-indexer -csv data/sample.csv

4. Access Your Data

Usage

Command Line Options

go run cmd/main.go [options]
Option Description Default
-csv Path to the CSV file to process (required) -
-output Directory to store XML files output
-index OpenSearch index name csv-documents
-opensearch OpenSearch URL http://localhost:9200
-tika Tika server URL http://localhost:9998
-bulk Use bulk indexing true
-skip-tika Skip Tika parsing and index XML directly false
-help Show help message -

Examples

# Process a CSV file with default settings
go run cmd/main.go -csv data/sample.csv

# Use a custom index name
go run cmd/main.go -csv data/sample.csv -index my-custom-index

# Skip Tika parsing (faster but less content extraction)
go run cmd/main.go -csv data/sample.csv -skip-tika

# Process without bulk indexing (slower but shows individual progress)
go run cmd/main.go -csv data/sample.csv -bulk=false

Makefile Commands

The project includes a Makefile for common operations:

make help              # Show all available commands
make docker-up         # Start Docker services
make docker-down       # Stop Docker services
make docker-clean      # Stop services and remove volumes
make build            # Build the application
make run              # Run with sample data
make run-skip-tika    # Run without Tika parsing
make test-opensearch  # Test OpenSearch connection
make test-tika        # Test Tika connection
make search           # Interactive search in indexed documents
make list-indices     # List all OpenSearch indices
make delete-index     # Delete the csv-documents index
make clean            # Clean generated files
make full-run         # Complete setup and run

How It Works

  1. CSV Parsing: The application reads your CSV file and creates a map for each row using the headers as keys.

  2. XML Generation: Each row is converted to an XML document with sanitized field names:

    <?xml version="1.0" encoding="UTF-8"?>
    <record>
      <name>John Doe</name>
      <age>28</age>
      <city>New York</city>
      ...
    </record>
  3. Tika Parsing (optional): XML files are sent to the Tika server for content extraction and metadata parsing.

  4. OpenSearch Indexing: Documents are indexed with:

    • Full text content
    • Original CSV data as metadata
    • File paths
    • Timestamps
  5. Search: Documents become searchable through OpenSearch's full-text search capabilities.

CSV File Format

Your CSV file should:

  • Have a header row as the first line
  • Use comma separators
  • Follow standard CSV escaping rules

Example:

name,age,city,occupation,email
John Doe,28,New York,Software Engineer,[email protected]
Jane Smith,32,San Francisco,Product Manager,[email protected]

Docker Services

OpenSearch

  • URL: http://localhost:9200
  • Single-node setup with security disabled (for development)
  • Data persisted in Docker volume

OpenSearch Dashboards

Tika Server

Troubleshooting

Services Won't Start

# Check if ports are already in use
netstat -an | grep -E "9200|9998|5601"

# Check Docker logs
docker-compose logs

OpenSearch Out of Memory

Increase memory in docker-compose.yml:

environment:
  - "OPENSEARCH_JAVA_OPTS=-Xms1g -Xmx1g"

Tika Connection Failed

  • Ensure Tika container is running: docker-compose ps
  • Wait longer for service startup
  • Use -skip-tika flag to bypass Tika parsing

No Documents Indexed

  • Check CSV file format and path
  • Verify OpenSearch is running
  • Check application output for error messages

Development

Running Tests

go test ./...

Code Formatting

go fmt ./...
go vet ./...

Adding Dependencies

go get <package>
go mod tidy

API Endpoints

OpenSearch Endpoints

  • Health check: GET http://localhost:9200
  • Index stats: GET http://localhost:9200/csv-documents/_stats
  • Search: POST http://localhost:9200/csv-documents/_search
  • Delete index: DELETE http://localhost:9200/csv-documents

Tika Endpoints

  • Version: GET http://localhost:9998/version
  • Parse: PUT http://localhost:9998/tika
  • Metadata: PUT http://localhost:9998/meta

License

MIT

Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

Support

For issues, questions, or suggestions, please open an issue on the project repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors