This is an incomplete, work-in-progress 5.5.1-to-7.0 converter. Some parts are ported directly from the C converter (such as the ANSEL Charset and date and age parsing) while others are built from the ground up. The hope is that having two somewhat-separate implementations will allow me to use the two to test one another, a hope that has already resulted in a few bug fixes in the C version.
The file edu/virginia/ged5to7/GedcomDefinitions.java contains preprocessed copies of the TSV files from https://github.com/FamilySearch/GEDCOM/tree/main/extracted-files. When a new (minor or major) version of the spec is released, updates to those files will need to be incorporated by running
javac DownloadDefinitions.java
java DownloadDefinitionsThe above will overwrite the file edu/virginia/ged5to7/GedcomDefinitions.java with an updated version.
DownloadDefinitions.java is otherwise unneeded, and should not be included in distributions of the ged5to7 package.
- Detect character encodings, as documented in ELF Serialisation.
- Convert to UTF-8
- Normalize line whitespace, including stripping leading spaces
- Remove
CONC - Fix
@usage - Limit character set of cross-reference identifiers
- Normalize case of tags
- Covert
DATE- replace date_phrase with
PHRASEstructure - replace calendar escapes with calendar tags
- change
BCandB.C.toBCEand remove if found in unsupported calendars - replace dual years with single years and
PHRASEs - replace just-year dual years in unqualified date with
BET/AND
- replace date_phrase with
- Convert
AGE- change age words to canonical forms (stillborn as
0y, child as< 8y, infant as< 1y) withPHRASEs - Normalize spacing in
AGEpayloads - add missing
y
- change age words to canonical forms (stillborn as
- Change any illegal tag
XYZinto_EXT_XYZ- or to
_XYZand add a SCHMA entry for it - leave unchanged under extensions
- or to
- change
SOURwith text payload into pointer toSOURwithNOTE - change
OBJEwith no payload to pointer to newOBJErecord - change
NOTErecord or with pointer payload intoSNOTE- use heuristic to change some pointer-
NOTEto nested-NOTEinstead ofSNOTE
- use heuristic to change some pointer-
- Convert
LANGpayloads to BCP 47 tags, using FHISO's mapping - Convert
MEDI.FORMpayloads to media types - Enumerated values
- Normalize case
- Convert user-text to
PHRASEs
- Convert
FONEandROMNtoTRANand theirTYPEs to BCP-47LANGs - tag renaming, including
EMAI,_EMAIL→EMAILFORM.TYPE→FORM.MEDI- (deferred)
_SDATE→SDATE--_SDATEis also used as "accessed at" date for web resources by some applications so this change is not universally correct _UID→UID_ASSO→ASSO_CRE,_CREAT→CREA_DATE→DATE- other?
-
ASSO.RELA→ASSO.ROLE(changing payload OTHER + PHRASE) - change
RFN,RIN, andAFNtoEXID - change
_FSFTID,_APIDtoEXID - remove
SUBN,HEAD.FILE,HEAD.CHAR- (deferred)
HEAD.PLACwas originally on this list, but has been deferred to a later version
- (deferred)
- change
FILEpayloads into URLs- Windows-style
\becomes/ - Windows diver letter
C:\WINDOWSbecomesfile:///c:/WINDOWS - POSIX-stye
/User/foobecomesfile:///User/foo
- Windows-style
- update the
GEDC.VERSto7.0 - (extra) change string-valued
INDI.ALIAintoNAMEwithTYPEAKA - (5.5) change base64-encoded OBJE into GEDZIP
- add
SCHMAfor all used known extensions- add URIs (or standard tags) for all extensions from https://wiki-de.genealogy.net/GEDCOM/_Nutzerdef-Tag and http://www.gencom.org.nz/GEDCOM_tags.html