Skip to content

add support of grapheme clusters#520

Merged
raphlinus merged 17 commits intoxi-editor:masterfrom
jmuk:grapheme
May 1, 2018
Merged

add support of grapheme clusters#520
raphlinus merged 17 commits intoxi-editor:masterfrom
jmuk:grapheme

Conversation

@jmuk
Copy link
Contributor

@jmuk jmuk commented Feb 13, 2018

now next/prev_grapheme_offset works correctly rather than
falling back to codepoints.

fixes #298

now next/prev_grapheme_offset works correctly rather than
falling back to codepoints.

fixes xi-editor#298
Copy link
Member

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for this! A few little nits but otherwise looks good.

}

fn is_boundary(s: &String, offset: usize) -> bool {
if let Ok(r) = GraphemeCursor::new(offset, s.len(), true).is_boundary(s, 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can call .ok() on any Result<T> to convert it to an Option<T>, discarding the error.

cursor.is_boundary::<GraphemeMetric>()
}

// graphemes should probably be developed as a cursor-based interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of this comment now. :)

}

#[derive(Clone, Copy)]
pub struct GraphemeMetric(());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can declare a unit struct without any braces: pub struct GraphemeMetric;.

@raphlinus
Copy link
Member

I'm thrilled to see progress on this, and hate to throw cold water on the idea, but unfortunately trying to treat grapheme clusters as a metric doesn't quite work. The underlying reason is that grapheme cluster counting is fundamentally not a monoid homomorphism, as awesome as it would be if that were the case. Here's how to write a test case that will fail:

Create a string of RIS code points, let's say \u{1F1E6} that doesn't fit in a single leaf node (ie > 1kb). Now test whether the offset at len(first leaf) + 4 bytes is a boundary. The GraphemeMetric will always say no, but that would be wrong if the number of code points in the first leaf is odd. If the number of code points in the first leaf is even, edit the rope (usually taking the substring from offset 4 will work, unless the first leaf is small), so that one or the other would fail.

Because the size of graphemes is not bounded, and because boundaries can in the worst case depend on unbounded amounts of context, the boundary detection must be able to handle bringing in an arbitrary number of leaves as pre-context.

Bottom line, this needs to be implemented directly in {prev,next}_grapheme_offset with some attention to leaf boundaries. It is, sadly, not trivial to test - this is one reason I'd like to see some infrastructure for property testing developed.

Thanks for the attempt, and I'm happy to provide guidance to see it through.

@mqzry
Copy link
Member

mqzry commented Mar 13, 2018

I am taking a stab at this. Did I understand it correctly that what is needed is:

  • Instantiate a GraphemeCursor
  • create a chunk iterators that walk in reverse and normal order to handle GraphemeIncomplete::PreContext or GraphemeIncomplete::NextChunk results
  • Call the right function on the grapheme cursor for example is_boundary
  • Handle any Err(GraphemeIncomplete) return value properly

@jmuk
Copy link
Contributor Author

jmuk commented Mar 13, 2018

Yup, actually I was working on that. Some prototype is working on my local machine although it needs some refinements (sorry for my slow progress).

However, I've found a few issues on unicode-segmentation side, which cause panic. I'll send a PR to them first. Stay tuned.

@jmuk
Copy link
Contributor Author

jmuk commented Mar 13, 2018

unicode-rs/unicode-segmentation#37 here is the PR.

@jmuk
Copy link
Contributor Author

jmuk commented Mar 17, 2018

So I pushed my version of the code. Please take a look, if that's okay. I'm fine to hold this until the problem I've seen in unicode-segmentation are fixed.

@jmuk
Copy link
Contributor Author

jmuk commented Mar 20, 2018

This PR actually relies on unicode-rs/unicode-segmentation#40 too. I can create workarounds if that turned out unwanted.

@mqzry
Copy link
Member

mqzry commented Apr 29, 2018

What is the status for this PR? All relevant PRs seem to be merged.

@raphlinus
Copy link
Member

Good question. I think unicode-segmentation needs to be published to crates.io. @Manishearth?

@Manishearth
Copy link

Manishearth commented Apr 29, 2018 via email

@raphlinus
Copy link
Member

Thanks @Manishearth! Would you like to resolve the conflicts, @jmuk? If not, we can take care of it.

Incidentally, this will help a lot in dealing with CRLF line endings, right now the cursor can get between the two which causes all kinds of unexpected behavior.

@jmuk
Copy link
Contributor Author

jmuk commented Apr 30, 2018

@raphlinus done!

Copy link
Member

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks, and thanks for your patience.

@@ -363,13 +366,13 @@ impl Rope {

// graphemes should probably be developed as a cursor-based interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment can be removed, as you've actually done it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

}
// GraphemeIncomplete::NextChunk error will not happen, since this cursor thinks the end of
// the current leaf as the end of the string. Can a cursor know the total length of
// the tree?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer this question, this should be available at cursor.root.len(). In my first reading, I'm not sure whether this would allow you to simplify the following logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay -- yet cursor.root isn't visible here in rope.rs. I added total_len() to tree.rs, and it helps to simplify the code; at least I think so. PTAL.

rust/Cargo.lock Outdated
[[package]]
name = "unicode-segmentation"
version = "1.2.0"
source = "git+https://github.com/jmuk/unicode-segmentation.git?branch=fix_for_xi#1d7d062b38e6e5299ed0e5750f6890230c6cd0fa"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to wait for unicode-segmentation to release the fixed version, as (among other things), this would prevent releasing xi-rope as a crate. I know it's been a long time since we've done that, but we're discussing doing a release before long.

let mut c = GraphemeCursor::new(pos, l.len() + leaf_offset, true);
let mut next_boundary = c.next_boundary(&l, leaf_offset);
while let Err(GraphemeIncomplete::PreContext(_)) = next_boundary {
let (pl, poffset) = self.prev_leaf()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the minimum rust version in README, as ? for Option is not supported in 1.20. A quick test shows that 1.22 should still work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the README had been updated already to require Rust 1.22; as 3c53388

assert_eq!(None, a.next_grapheme_offset(17));
}

#[test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks!

Copy link
Member

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Played around with this, feels good.

@raphlinus raphlinus merged commit 6fd9768 into xi-editor:master May 1, 2018
@raphlinus
Copy link
Member

Many thanks!

@jmuk
Copy link
Contributor Author

jmuk commented May 1, 2018

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use unicode-segmentation 1.2 for grapheme cluster boundaries

6 participants