Fix default escape settings.

* wordsplit.c (wordsplit_escape): New global.
(wordsplit_init): Backslash interpretation is disabled if
not expliticitly configured.
(wsnode_quoteremoval): Unquote unless _WSNF_NOEXPAND is set.
(scan_word): Fix backslash handling if WRDSF_QUOTE flags are set.
* wsp.c: Fix option handling.
* wordsplit.at: Test handling of C-style escapes.

* README: Document changes.
* wordsplit.3: Likewise.
This commit is contained in:
Sergey Poznyakoff 2023-06-21 16:01:32 +03:00
parent 403b1c769f
commit e2f0c64db9
6 changed files with 146 additions and 78 deletions

25
README
View file

@ -1,7 +1,10 @@
README file for the wordsplit library
See the end of file for copying conditions.
* Overview
This package provides a set of C functions for parsing input strings.
Default parsing rules are are similar to those used in Bourne shell.
Default parsing rules are similar to those used in Bourne shell.
This includes tilde expansion, variable expansion, quote removal, word
splitting, command substitution, and path expansion. Parsing is
controlled by a number of settings which allow the caller to alter
@ -46,7 +49,7 @@ programs. It consists of the following files:
wordsplit.c - Main source file.
wordsplit.3 - Documentation.
For most uses, you will need only these three. The rest of files
For most uses, you will need only these three. The remaining files
are for building the autotest-based testsuite:
wsp.c - Auxiliary test program.
@ -54,7 +57,7 @@ are for building the autotest-based testsuite:
* Incorporating wordsplit into your project
The project is designed to be used as a git submodule. To incorporate
Wordsplit is designed to be used as a git submodule. To incorporate
it into your project, first select the location for the wordsplit
directory within your project. Then add the submodule at this
location. The rest is quite straightforward: you need to add
@ -117,7 +120,7 @@ Add wordsplit.c to the nodist_program_SOURCES variable:
The nodist_ prefix is necessary to prevent Make from trying to
distribute this file from the current directory (where it doesn't
exist of course). During compilation it will be located using VPATH.
exist, of course). During compilation it will be located using VPATH.
Finally, add both wordsplit/wordsplit.c and wordsplit/wordsplit.h to
the EXTRA_DIST variable and modify AM_CPPFLAGS as shown in the
@ -213,18 +216,18 @@ Then, add the following fragment to build the auxiliary files:
* History
First version of wordsplit appeared in March 2009 as a part of the
First version of wordsplit appeared in March 2009 as part of the
Wydawca[1] project. Its main usage was to assist in configuration
file parsing. The parser subsystem proved to be quite useful and
soon evolved into a separate project - Grecs[2]. This package had been
since used (as a git submodule) in a number of other projects, such as
GNU Dico[3] and Direvent[4], to name a few.
In 2010 the wordsplit sources were incorporated to the GNU
Mailutils[5] package, where they replaced the obsolete argcv module.
Mailutils uses its own configuration package, which meant that using
Grecs was not expedient. Therefore the sources had been exported from
Grecs. Since then both Mailutils and Grecs versions are periodically
In 2010 wordsplit sources were incorporated to the GNU Mailutils[5]
package, where they replaced the obsolete argcv module. Mailutils
uses its own configuration package, which meant that using Grecs was
not expedient. Therefore the sources had been exported from
Grecs. Since then both Mailutils and Grecs versions were periodically
synchronized.
Several other projects, such as GNU Rush[6] and fileserv[7], followed
@ -275,7 +278,7 @@ the following information:
* Copying
Copyright (C) 2009-2021 Sergey Poznyakoff
Copyright (C) 2009-2023 Sergey Poznyakoff
Permission is granted to anyone to make or distribute verbatim copies
of this document as received, in any medium, provided that the

View file

@ -14,7 +14,7 @@
.\" You should have received a copy of the GNU General Public License
.\" along with wordsplit. If not, see <http://www.gnu.org/licenses/>.
.\"
.TH WORDSPLIT 3 "July 24, 2019" "WORDSPLIT" "Wordsplit User Reference"
.TH WORDSPLIT 3 "June 22, 2023" "WORDSPLIT" "Wordsplit User Reference"
.SH NAME
wordsplit \- split string into words
.SH SYNOPSIS
@ -299,16 +299,15 @@ the \fBWRDSF_DQUOTE\fR flag. The macro \fBWRDSF_QUOTE\fR enables both.
Backslash interpretation translates unquoted
.I escape sequences
into corresponding characters. An escape sequence is a backslash followed
by one or more characters. By default, each sequence \fB\\\fIC\fR
appearing in unquoted words is replaced with the character \fIC\fR. In
doubly-quoted strings, two backslash sequences are recognized:
\fB\\\\\fR translates to a single backslash, and \fB\\\(dq\fR
translates to a double-quote.
by one or more characters. By default, that is if no flags are
supplied, no escape sequences are defined, and each sequence
\fB\\\fIC\fR is reproduced verbatim.
.PP
There are several ways to enable backslash interpretation and to
define escape sequences. The simplest one is to use the
\fBWRDSF_CESCAPES\fR flag. This flag defines the C-like escape
sequences:
.PP
Two flags are provided to modify this behavior. If
.I WRDSF_CESCAPES
flag is set, the following escape sequences are recognized:
.sp
.nf
.ta 8n 18n 42n
.ul
@ -329,19 +328,59 @@ for a two-digit hex number is replaced with ASCII character \fINN\fR.
The sequence \fB\\0\fINNN\fR, where \fINNN\fR stands for a three-digit
octal number is replaced with ASCII character whose code is \fINNN\fR.
.PP
The \fBWRDSF_ESCAPE\fR flag allows the caller to customize escape
sequences. If it is set, the \fBws_escape\fR member must be
initialized. This member provides escape tables for unquoted words
(\fBws_escape[0]\fR) and quoted strings (\fBws_escape[1]\fR). Each
table is a string consisting of an even number of characters. In each
pair of characters, the first one is a character that can appear after
backslash, and the following one is its translation. For example, the
above table of C escapes is represented as
\fB\(dq\\\\\\\\"\\"a\\ab\\bf\\fn\\nr\\rt\\tv\\v\(dq\fR.
Additionally, outside of quoted strings (if these are enabled by the
use of \fBWRDSF_DQUOTE\fR flag) backslash character can be used to
escape horizontal whitespace: horizontal space (ASCII 32) and
tab (ASCII 9) characters.
.PP
It is valid to initialize \fBws_escape\fR elements to zero. In this
The \fBWRDSF_CESCAPES\fR bit is included in the default flag
set \fBWRDSF_DEFFLAGS\fR.
.PP
The \fBWRDSF_ESCAPE\fR flag provides a more elaborate way of defining
escape sequences. If it is set, the \fBws_escape\fR member must be
initialized. This member provides escape tables for unquoted words
(\fBws_escape[WRDSX_WORD]\fR) and quoted strings
(\fBws_escape[WRDSX_QUOTE]\fR). Each table is a string consisting of
an even number of characters. In each pair of characters, the first
one is a character that can appear after backslash, and the following
one is its translation. For example, the table of C escapes is
represented as follows:
.TP
\fB\(dq\\\\\\\\"\\"a\\ab\\bf\\fn\\nr\\rt\\tv\\v\(dq\fR
.PP
It is valid to initialize \fBws_escape\fR elements to NULL. In this
case, no backslash translation occurs.
.PP
For convenience, the global variable
.B wordsplit_escape
defines several most often used escape translation tables:
.PP
.EX
extern char const *wordsplit_escape[];
.EE
.PP
It is indexed by the following constants:
.TP
.B WS_ESC_C
C-style escapes, the definition of which is shown above. This is the
translation table that is used within quoted strings when
.B WRDSF_CESCAPES
is in effect.
.TP
.B WS_ESC_C_WS
The \fBWS_ESC_C\fR table augmented by two entries: for horizontal tab
character and whitespace. This is the table that is used for unquoted
words when
.B WRDSF_CESCAPES
is in effect.
.TP
.B WS_ESC_DQ
Backslash character escapes double-quote and itself. Useful for
handling doubly-quoted strings in various Internet protocols.
.TP
.B WS_ESC_DQ_WS
Escape double-quote, backslash, horizontal tab and whitespace characters.
.PP
Interpretation of octal and hex escapes is controlled by the following
bits in \fBws_options\fR:
.TP

View file

@ -1,5 +1,5 @@
# Test suite for wordsplit -*- Autotest -*-
# Copyright (C) 2014-2021 Sergey Poznyakoff
# Copyright (C) 2014-2023 Sergey Poznyakoff
#
# Wordsplit is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
@ -523,8 +523,8 @@ TOTAL: 3
WSPGROUP()
TESTWSP([C escapes on],[wcp-c-escape],[-cescapes],
[a\ttab form\ffeed and new\nline],
TESTWSP([C escapes on],[wcp-c-escape],[-nodefault -dquote -cescapes],
["a\ttab" "form\ffeed" and "new\nline"],
[NF: 4
0: a\ttab
1: form\ffeed
@ -533,8 +533,8 @@ TESTWSP([C escapes on],[wcp-c-escape],[-cescapes],
TOTAL: 4
])
TESTWSP([C escapes off],[wcp-c-escape-off],[-nocescapes],
[a\ttab form\ffeed and new\nline],
TESTWSP([C escapes off],[wcp-c-escape-off],[-nodefault -dquote -nocescapes],
["a\ttab" "form\ffeed" and "new\nline"],
[NF: 4
0: attab
1: formffeed
@ -543,6 +543,15 @@ TESTWSP([C escapes off],[wcp-c-escape-off],[-nocescapes],
TOTAL: 4
])
TESTWSP([C escapes on (unquoted)],[wcp-c-escape],[-nodefault -cescapes],
[a\ttab \"form\ffeed\" and\ new\\nline],
[NF: 3
0: a\ttab
1: "\"form\ffeed\""
2: "and new\\nline"
TOTAL: 3
])
TESTWSP([ws elimination],[wsp-ws-elim],[-delim ' ()' -ws -return_delims],
[( list items )],
[NF: 4

View file

@ -1,5 +1,5 @@
/* wordsplit - a word splitter
Copyright (C) 2009-2021 Sergey Poznyakoff
Copyright (C) 2009-2023 Sergey Poznyakoff
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@ -67,6 +67,8 @@ is_name_char (struct wordsplit *wsp, int c)
#define to_num(c) \
(ISDIGIT(c) ? c - '0' : (ISXDIGIT(c) ? toupper(c) - 'A' + 10 : 255 ))
static int wsplt_unquote_char (const char *transtab, int c);
#define ALLOC_INIT 128
#define ALLOC_INCR 128
@ -247,7 +249,16 @@ wordsplit_init0 (struct wordsplit *wsp)
wsp->ws_errno = 0;
}
char wordsplit_c_escape_tab[] = "\\\\\"\"a\ab\bf\fn\nr\rt\tv\v";
char const *wordsplit_escape[] = {
/* C-style escapes, for quoted strings */
[WS_ESC_C] = "\\\\\"\"a\ab\bf\fn\nr\rt\tv\v",
/* C-style escapes, outsize of quoted strings */
[WS_ESC_C_WS] = "\\\\\"\"a\ab\bf\fn\nr\rt\tv\v \t\t",
/* Escape double-quote and backslash. */
[WS_ESC_DQ] = "\\\\\"\"",
/* Escape double-quote, backslash, and whitespace. */
[WS_ESC_DQ_WS] = "\\\\\"\" \t\t"
};
static int
wordsplit_init (struct wordsplit *wsp, const char *input, size_t len,
@ -314,21 +325,17 @@ wordsplit_init (struct wordsplit *wsp, const char *input, size_t len,
if (!wsp->ws_escape[WRDSX_QUOTE])
wsp->ws_escape[WRDSX_QUOTE] = "";
}
else if (wsp->ws_flags & WRDSF_CESCAPES)
{
wsp->ws_escape[WRDSX_WORD] = wordsplit_escape[WS_ESC_C_WS];
wsp->ws_escape[WRDSX_QUOTE] = wordsplit_escape[WS_ESC_C];
wsp->ws_options |= WRDSO_OESC_QUOTE | WRDSO_OESC_WORD
| WRDSO_XESC_QUOTE | WRDSO_XESC_WORD;
}
else
{
if (wsp->ws_flags & WRDSF_CESCAPES)
{
wsp->ws_escape[WRDSX_WORD] = wordsplit_c_escape_tab;
wsp->ws_escape[WRDSX_QUOTE] = wordsplit_c_escape_tab;
wsp->ws_options |= WRDSO_OESC_QUOTE | WRDSO_OESC_WORD
| WRDSO_XESC_QUOTE | WRDSO_XESC_WORD;
}
else
{
wsp->ws_escape[WRDSX_WORD] = "";
wsp->ws_escape[WRDSX_QUOTE] = "\\\\\"\"";
wsp->ws_options |= WRDSO_BSKEEP_QUOTE;
}
wsp->ws_escape[WRDSX_WORD] = "";
wsp->ws_escape[WRDSX_QUOTE] = "";
}
if (!(wsp->ws_options & WRDSO_PARAMV))
@ -700,14 +707,8 @@ wsnode_quoteremoval (struct wordsplit *wsp)
{
const char *str = wsnode_ptr (wsp, p);
size_t slen = wsnode_len (p);
int unquote;
if (wsp->ws_flags & WRDSF_QUOTE)
unquote = !(p->flags & _WSNF_NOEXPAND);
else
unquote = 0;
if (unquote)
if (!(p->flags & _WSNF_NOEXPAND))
{
if (!(p->flags & _WSNF_WORD))
{
@ -2303,28 +2304,32 @@ scan_word (struct wordsplit *wsp, size_t start, int consume_all)
return _WRDS_OK;
}
if (wsp->ws_flags & WRDSF_QUOTE)
if (command[i] == '\\')
{
if (command[i] == '\\')
if (i + 1 == len)
{
if (++i == len)
break;
i++;
break;
}
if (wsplt_unquote_char (wsp->ws_escape[WRDSX_WORD], command[i+1]))
{
i += 2;
continue;
}
}
if (((wsp->ws_flags & WRDSF_SQUOTE) && command[i] == '\'') ||
((wsp->ws_flags & WRDSF_DQUOTE) && command[i] == '"'))
{
if (join && wsp->ws_tail)
wsp->ws_tail->flags |= _WSNF_JOIN;
if (wordsplit_add_segm (wsp, start, i, _WSNF_JOIN))
return _WRDS_ERR;
if (scan_qstring (wsp, i, &i))
return _WRDS_ERR;
start = i + 1;
join = 1;
}
if ((wsp->ws_flags & WRDSF_QUOTE) &&
(((wsp->ws_flags & WRDSF_SQUOTE) && command[i] == '\'') ||
((wsp->ws_flags & WRDSF_DQUOTE) && command[i] == '"')))
{
if (join && wsp->ws_tail)
wsp->ws_tail->flags |= _WSNF_JOIN;
if (wordsplit_add_segm (wsp, start, i, _WSNF_JOIN))
return _WRDS_ERR;
if (scan_qstring (wsp, i, &i))
return _WRDS_ERR;
start = i + 1;
join = 1;
}
if (command[i] == '$')
@ -2449,13 +2454,13 @@ wsplt_quote_char (const char *transtab, int c)
int
wordsplit_c_unquote_char (int c)
{
return wsplt_unquote_char (wordsplit_c_escape_tab, c);
return wsplt_unquote_char (wordsplit_escape[WS_ESC_C], c);
}
int
wordsplit_c_quote_char (int c)
{
return wsplt_quote_char (wordsplit_c_escape_tab, c);
return wsplt_quote_char (wordsplit_escape[WS_ESC_C], c);
}
void

View file

@ -1,5 +1,5 @@
/* wordsplit - a word splitter
Copyright (C) 2009-2021 Sergey Poznyakoff
Copyright (C) 2009-2023 Sergey Poznyakoff
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@ -309,4 +309,14 @@ const char *wordsplit_strerror (wordsplit_t *ws);
void wordsplit_clearerr (wordsplit_t *ws);
enum
{
WS_ESC_C, /* C-style escapes, for quoted strings */
WS_ESC_C_WS, /* C-style escapes plus whitespace. For unquoted words */
WS_ESC_DQ, /* Escape double-quote and backslash. */
WS_ESC_DQ_WS, /* Escape double-quote, backslash, and whitespace. */
};
extern char const *wordsplit_escape[];
#endif

4
wsp.c
View file

@ -1,5 +1,5 @@
/* wsp - test program for wordsplit
Copyright (C) 2014-2021 Sergey Poznyakoff
Copyright (C) 2014-2023 Sergey Poznyakoff
Wordsplit is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@ -221,6 +221,8 @@ getwsopt (int argc, char **argv, struct wsopt *wso, struct wsclosure *wsc)
}
arg = argv[wsoptind++];
}
else
arg = NULL;
wso->setfn (wso->tok, negate, arg, wsc);
}
return 0;