birchwood-abbey.net Git - mit-scheme.git/commit

author	Taylor R Campbell <campbell@mumble.net>
	Mon, 27 May 2019 16:21:08 +0000 (16:21 +0000)
committer	Taylor R Campbell <campbell@mumble.net>
	Tue, 28 May 2019 13:40:58 +0000 (13:40 +0000)
commit	d7cf9e8a8b14fe56e746dd7ddec03b3e67d6d4a9
tree	9c921299fe4599f1365eb91dcf973afd2a474705	tree \| snapshot
parent	7a371df58258c30895cefb960a1c2c1d1a233987	commit \| diff

Implement character replacement on ill-formed octet sequences.

- (utf8->string bv start end #t) now replaces by U+FFFD.

  Existing behaviour of (utf8->string bv [start end]) is unchanged so
  that utf8->string will fail noisily rather than quietly fail to be
  invertible by string->utf8 on certain inputs.

- Generic I/O input now replaces ill-formed octet sequences by U+FFFD.

  TODO: Add (port/set-coding-error port <action>) for <action> =
  replace or <action> = error, perhaps.

TODO: This does not exactly implement the replacement algorithm
recommended as a best practice by Unicode 9, §3.9, pp. 127-129.  That
algorithm is inconveneint because our decoder is factored into (a)
claiming a length based on the first code unit, and then (b)
consuming exactly that many bytes; the algorithm requires us to
refactor it so that part (b) can say `never mind' and consume fewer
bytes than (a) requeste.

src/runtime/bytevector.scm		diff \| blob \| history
src/runtime/char.scm		diff \| blob \| history
src/runtime/generic-io.scm		diff \| blob \| history
src/runtime/runtime.pkg		diff \| blob \| history
tests/runtime/test-char.scm		diff \| blob \| history