I also added decodeFront which operates on the first code point in the
range (unlike stride and strideBack, it requires a different name, since
the signatures of decode and decodeFront are almost identical - the only
difference being that decode takes the index by ref, and decodeFront
takes it as out).
It's easier to define aliases this way. I created two templates though
so that you can still pass both template arguments if you want to, and
it'll avoid breaking any code.
- Use fast path tests for non-complex unicode sequences that can be
inlined. These rely on the built-in array bounds check.
- Factor out complex cases into separate functions that do exception
based validity checks. The char[] and wchar[] versions use
pointers to avoid redundant array bounds checks, thus they can only
be trusted.
- Complete rewrite of decode for char[] to use less branching and
unrolled loops. This requires less registers AND less instructions.
The overlong check is done much cheaper on the code point.
- The decode functions were made templates to short circuit the very
restricted function inlining possibilities.
The pragmas have not been as effective as we might have liked, since
they only work with templates and can't tell you where in your code you
need to make changes, and they seemed to have been more annoying to
programmers than helpful, so we're going to discontinue them. We'll
leave them in for stuff that's actually been deprecated until deprecated
has been improved enough to take a message, but we'll leave "scheduled
for deprecation" messages to the documentation and changelog.
It's pretty clear from discussions on the newsgroup that we want to keep
toUTF16z, so it shouldn't be scheduled for deprecation anymore. That
change is part of https://github.com/D-Programming-Language/phobos/pull/279 ,
but for whatever reason that pull request hasn't been reviewed yet, let
alone merged in, and this change shouldn't need review, _and_ it should
be in before the next release, so I'm just making it and checking it in.
It's simply removing the "scheduled for deprecation" note in the
documentation and its associated pragma.
The main purpose of these changes was to make as much as possible in
std.utf pure (other than toUTFx, which I'll be replacing with toUTF in a
future pull request), but I also ended up doing a fair bit of
documentation cleanup. Almost everything in std.utf is pure now though,
which should help considerably in making it possible to make
string functions pure.
I also put @system on the two overloads of toUTFZ which do pointer
arithmetic. They're obviously @system anyway, but tagging them with it
makes it clearer.
toUTFz no longer guarantees that the string will remain zero-terminated.
If the string can be zero-terminated but isn't immutable and doesn't
need to be copied to have the requested character pointer type, then it
no longer copies. This means that it's possible to have a string which
is zero-terminated and then stops being zero-terminated if you alter the
character one passed its end, but that's not likely to be an issue in
most cases, and a note in the documentation points it out so that
programmers can know about it and deal with it appropriately.
I haven't made std.conv.to use it yet, and I haven't touched toUTF16z or
toStringz at all, but here's an implementation for toUTFz. After this is
in, we can make std.conv.to use it when converting to character
pointers, and we should probably make it so that we have toStringz,
toWstringz, and toDstringz which use it and return immutable character
pointers and get rid of toUTF16z.
indexOf and lastIndexOf should not work properly with unicode for all
string types (unlike before). As part of that, I also ended up fixing a
bug in std.array.back for strings (wstrings in particular were broken).
I also improved various, related unit tests.