mirror of
git://git.gnu.org.ua/wordsplit.git
synced 2025-04-25 16:19:54 +03:00
Fix default escape settings.
* wordsplit.c (wordsplit_escape): New global. (wordsplit_init): Backslash interpretation is disabled if not expliticitly configured. (wsnode_quoteremoval): Unquote unless _WSNF_NOEXPAND is set. (scan_word): Fix backslash handling if WRDSF_QUOTE flags are set. * wsp.c: Fix option handling. * wordsplit.at: Test handling of C-style escapes. * README: Document changes. * wordsplit.3: Likewise.
This commit is contained in:
parent
403b1c769f
commit
e2f0c64db9
6 changed files with 146 additions and 78 deletions
25
README
25
README
|
@ -1,7 +1,10 @@
|
|||
README file for the wordsplit library
|
||||
See the end of file for copying conditions.
|
||||
|
||||
* Overview
|
||||
|
||||
This package provides a set of C functions for parsing input strings.
|
||||
Default parsing rules are are similar to those used in Bourne shell.
|
||||
Default parsing rules are similar to those used in Bourne shell.
|
||||
This includes tilde expansion, variable expansion, quote removal, word
|
||||
splitting, command substitution, and path expansion. Parsing is
|
||||
controlled by a number of settings which allow the caller to alter
|
||||
|
@ -46,7 +49,7 @@ programs. It consists of the following files:
|
|||
wordsplit.c - Main source file.
|
||||
wordsplit.3 - Documentation.
|
||||
|
||||
For most uses, you will need only these three. The rest of files
|
||||
For most uses, you will need only these three. The remaining files
|
||||
are for building the autotest-based testsuite:
|
||||
|
||||
wsp.c - Auxiliary test program.
|
||||
|
@ -54,7 +57,7 @@ are for building the autotest-based testsuite:
|
|||
|
||||
* Incorporating wordsplit into your project
|
||||
|
||||
The project is designed to be used as a git submodule. To incorporate
|
||||
Wordsplit is designed to be used as a git submodule. To incorporate
|
||||
it into your project, first select the location for the wordsplit
|
||||
directory within your project. Then add the submodule at this
|
||||
location. The rest is quite straightforward: you need to add
|
||||
|
@ -117,7 +120,7 @@ Add wordsplit.c to the nodist_program_SOURCES variable:
|
|||
|
||||
The nodist_ prefix is necessary to prevent Make from trying to
|
||||
distribute this file from the current directory (where it doesn't
|
||||
exist of course). During compilation it will be located using VPATH.
|
||||
exist, of course). During compilation it will be located using VPATH.
|
||||
|
||||
Finally, add both wordsplit/wordsplit.c and wordsplit/wordsplit.h to
|
||||
the EXTRA_DIST variable and modify AM_CPPFLAGS as shown in the
|
||||
|
@ -213,18 +216,18 @@ Then, add the following fragment to build the auxiliary files:
|
|||
|
||||
* History
|
||||
|
||||
First version of wordsplit appeared in March 2009 as a part of the
|
||||
First version of wordsplit appeared in March 2009 as part of the
|
||||
Wydawca[1] project. Its main usage was to assist in configuration
|
||||
file parsing. The parser subsystem proved to be quite useful and
|
||||
soon evolved into a separate project - Grecs[2]. This package had been
|
||||
since used (as a git submodule) in a number of other projects, such as
|
||||
GNU Dico[3] and Direvent[4], to name a few.
|
||||
|
||||
In 2010 the wordsplit sources were incorporated to the GNU
|
||||
Mailutils[5] package, where they replaced the obsolete argcv module.
|
||||
Mailutils uses its own configuration package, which meant that using
|
||||
Grecs was not expedient. Therefore the sources had been exported from
|
||||
Grecs. Since then both Mailutils and Grecs versions are periodically
|
||||
In 2010 wordsplit sources were incorporated to the GNU Mailutils[5]
|
||||
package, where they replaced the obsolete argcv module. Mailutils
|
||||
uses its own configuration package, which meant that using Grecs was
|
||||
not expedient. Therefore the sources had been exported from
|
||||
Grecs. Since then both Mailutils and Grecs versions were periodically
|
||||
synchronized.
|
||||
|
||||
Several other projects, such as GNU Rush[6] and fileserv[7], followed
|
||||
|
@ -275,7 +278,7 @@ the following information:
|
|||
|
||||
* Copying
|
||||
|
||||
Copyright (C) 2009-2021 Sergey Poznyakoff
|
||||
Copyright (C) 2009-2023 Sergey Poznyakoff
|
||||
|
||||
Permission is granted to anyone to make or distribute verbatim copies
|
||||
of this document as received, in any medium, provided that the
|
||||
|
|
79
wordsplit.3
79
wordsplit.3
|
@ -14,7 +14,7 @@
|
|||
.\" You should have received a copy of the GNU General Public License
|
||||
.\" along with wordsplit. If not, see <http://www.gnu.org/licenses/>.
|
||||
.\"
|
||||
.TH WORDSPLIT 3 "July 24, 2019" "WORDSPLIT" "Wordsplit User Reference"
|
||||
.TH WORDSPLIT 3 "June 22, 2023" "WORDSPLIT" "Wordsplit User Reference"
|
||||
.SH NAME
|
||||
wordsplit \- split string into words
|
||||
.SH SYNOPSIS
|
||||
|
@ -299,16 +299,15 @@ the \fBWRDSF_DQUOTE\fR flag. The macro \fBWRDSF_QUOTE\fR enables both.
|
|||
Backslash interpretation translates unquoted
|
||||
.I escape sequences
|
||||
into corresponding characters. An escape sequence is a backslash followed
|
||||
by one or more characters. By default, each sequence \fB\\\fIC\fR
|
||||
appearing in unquoted words is replaced with the character \fIC\fR. In
|
||||
doubly-quoted strings, two backslash sequences are recognized:
|
||||
\fB\\\\\fR translates to a single backslash, and \fB\\\(dq\fR
|
||||
translates to a double-quote.
|
||||
by one or more characters. By default, that is if no flags are
|
||||
supplied, no escape sequences are defined, and each sequence
|
||||
\fB\\\fIC\fR is reproduced verbatim.
|
||||
.PP
|
||||
There are several ways to enable backslash interpretation and to
|
||||
define escape sequences. The simplest one is to use the
|
||||
\fBWRDSF_CESCAPES\fR flag. This flag defines the C-like escape
|
||||
sequences:
|
||||
.PP
|
||||
Two flags are provided to modify this behavior. If
|
||||
.I WRDSF_CESCAPES
|
||||
flag is set, the following escape sequences are recognized:
|
||||
.sp
|
||||
.nf
|
||||
.ta 8n 18n 42n
|
||||
.ul
|
||||
|
@ -329,19 +328,59 @@ for a two-digit hex number is replaced with ASCII character \fINN\fR.
|
|||
The sequence \fB\\0\fINNN\fR, where \fINNN\fR stands for a three-digit
|
||||
octal number is replaced with ASCII character whose code is \fINNN\fR.
|
||||
.PP
|
||||
The \fBWRDSF_ESCAPE\fR flag allows the caller to customize escape
|
||||
sequences. If it is set, the \fBws_escape\fR member must be
|
||||
initialized. This member provides escape tables for unquoted words
|
||||
(\fBws_escape[0]\fR) and quoted strings (\fBws_escape[1]\fR). Each
|
||||
table is a string consisting of an even number of characters. In each
|
||||
pair of characters, the first one is a character that can appear after
|
||||
backslash, and the following one is its translation. For example, the
|
||||
above table of C escapes is represented as
|
||||
\fB\(dq\\\\\\\\"\\"a\\ab\\bf\\fn\\nr\\rt\\tv\\v\(dq\fR.
|
||||
Additionally, outside of quoted strings (if these are enabled by the
|
||||
use of \fBWRDSF_DQUOTE\fR flag) backslash character can be used to
|
||||
escape horizontal whitespace: horizontal space (ASCII 32) and
|
||||
tab (ASCII 9) characters.
|
||||
.PP
|
||||
It is valid to initialize \fBws_escape\fR elements to zero. In this
|
||||
The \fBWRDSF_CESCAPES\fR bit is included in the default flag
|
||||
set \fBWRDSF_DEFFLAGS\fR.
|
||||
.PP
|
||||
The \fBWRDSF_ESCAPE\fR flag provides a more elaborate way of defining
|
||||
escape sequences. If it is set, the \fBws_escape\fR member must be
|
||||
initialized. This member provides escape tables for unquoted words
|
||||
(\fBws_escape[WRDSX_WORD]\fR) and quoted strings
|
||||
(\fBws_escape[WRDSX_QUOTE]\fR). Each table is a string consisting of
|
||||
an even number of characters. In each pair of characters, the first
|
||||
one is a character that can appear after backslash, and the following
|
||||
one is its translation. For example, the table of C escapes is
|
||||
represented as follows:
|
||||
.TP
|
||||
\fB\(dq\\\\\\\\"\\"a\\ab\\bf\\fn\\nr\\rt\\tv\\v\(dq\fR
|
||||
.PP
|
||||
It is valid to initialize \fBws_escape\fR elements to NULL. In this
|
||||
case, no backslash translation occurs.
|
||||
.PP
|
||||
For convenience, the global variable
|
||||
.B wordsplit_escape
|
||||
defines several most often used escape translation tables:
|
||||
.PP
|
||||
.EX
|
||||
extern char const *wordsplit_escape[];
|
||||
.EE
|
||||
.PP
|
||||
It is indexed by the following constants:
|
||||
.TP
|
||||
.B WS_ESC_C
|
||||
C-style escapes, the definition of which is shown above. This is the
|
||||
translation table that is used within quoted strings when
|
||||
.B WRDSF_CESCAPES
|
||||
is in effect.
|
||||
.TP
|
||||
.B WS_ESC_C_WS
|
||||
The \fBWS_ESC_C\fR table augmented by two entries: for horizontal tab
|
||||
character and whitespace. This is the table that is used for unquoted
|
||||
words when
|
||||
.B WRDSF_CESCAPES
|
||||
is in effect.
|
||||
.TP
|
||||
.B WS_ESC_DQ
|
||||
Backslash character escapes double-quote and itself. Useful for
|
||||
handling doubly-quoted strings in various Internet protocols.
|
||||
.TP
|
||||
.B WS_ESC_DQ_WS
|
||||
Escape double-quote, backslash, horizontal tab and whitespace characters.
|
||||
.PP
|
||||
Interpretation of octal and hex escapes is controlled by the following
|
||||
bits in \fBws_options\fR:
|
||||
.TP
|
||||
|
|
19
wordsplit.at
19
wordsplit.at
|
@ -1,5 +1,5 @@
|
|||
# Test suite for wordsplit -*- Autotest -*-
|
||||
# Copyright (C) 2014-2021 Sergey Poznyakoff
|
||||
# Copyright (C) 2014-2023 Sergey Poznyakoff
|
||||
#
|
||||
# Wordsplit is free software; you can redistribute it and/or modify
|
||||
# it under the terms of the GNU General Public License as published by
|
||||
|
@ -523,8 +523,8 @@ TOTAL: 3
|
|||
|
||||
WSPGROUP()
|
||||
|
||||
TESTWSP([C escapes on],[wcp-c-escape],[-cescapes],
|
||||
[a\ttab form\ffeed and new\nline],
|
||||
TESTWSP([C escapes on],[wcp-c-escape],[-nodefault -dquote -cescapes],
|
||||
["a\ttab" "form\ffeed" and "new\nline"],
|
||||
[NF: 4
|
||||
0: a\ttab
|
||||
1: form\ffeed
|
||||
|
@ -533,8 +533,8 @@ TESTWSP([C escapes on],[wcp-c-escape],[-cescapes],
|
|||
TOTAL: 4
|
||||
])
|
||||
|
||||
TESTWSP([C escapes off],[wcp-c-escape-off],[-nocescapes],
|
||||
[a\ttab form\ffeed and new\nline],
|
||||
TESTWSP([C escapes off],[wcp-c-escape-off],[-nodefault -dquote -nocescapes],
|
||||
["a\ttab" "form\ffeed" and "new\nline"],
|
||||
[NF: 4
|
||||
0: attab
|
||||
1: formffeed
|
||||
|
@ -543,6 +543,15 @@ TESTWSP([C escapes off],[wcp-c-escape-off],[-nocescapes],
|
|||
TOTAL: 4
|
||||
])
|
||||
|
||||
TESTWSP([C escapes on (unquoted)],[wcp-c-escape],[-nodefault -cescapes],
|
||||
[a\ttab \"form\ffeed\" and\ new\\nline],
|
||||
[NF: 3
|
||||
0: a\ttab
|
||||
1: "\"form\ffeed\""
|
||||
2: "and new\\nline"
|
||||
TOTAL: 3
|
||||
])
|
||||
|
||||
TESTWSP([ws elimination],[wsp-ws-elim],[-delim ' ()' -ws -return_delims],
|
||||
[( list items )],
|
||||
[NF: 4
|
||||
|
|
85
wordsplit.c
85
wordsplit.c
|
@ -1,5 +1,5 @@
|
|||
/* wordsplit - a word splitter
|
||||
Copyright (C) 2009-2021 Sergey Poznyakoff
|
||||
Copyright (C) 2009-2023 Sergey Poznyakoff
|
||||
|
||||
This program is free software; you can redistribute it and/or modify it
|
||||
under the terms of the GNU General Public License as published by the
|
||||
|
@ -67,6 +67,8 @@ is_name_char (struct wordsplit *wsp, int c)
|
|||
#define to_num(c) \
|
||||
(ISDIGIT(c) ? c - '0' : (ISXDIGIT(c) ? toupper(c) - 'A' + 10 : 255 ))
|
||||
|
||||
static int wsplt_unquote_char (const char *transtab, int c);
|
||||
|
||||
#define ALLOC_INIT 128
|
||||
#define ALLOC_INCR 128
|
||||
|
||||
|
@ -247,7 +249,16 @@ wordsplit_init0 (struct wordsplit *wsp)
|
|||
wsp->ws_errno = 0;
|
||||
}
|
||||
|
||||
char wordsplit_c_escape_tab[] = "\\\\\"\"a\ab\bf\fn\nr\rt\tv\v";
|
||||
char const *wordsplit_escape[] = {
|
||||
/* C-style escapes, for quoted strings */
|
||||
[WS_ESC_C] = "\\\\\"\"a\ab\bf\fn\nr\rt\tv\v",
|
||||
/* C-style escapes, outsize of quoted strings */
|
||||
[WS_ESC_C_WS] = "\\\\\"\"a\ab\bf\fn\nr\rt\tv\v \t\t",
|
||||
/* Escape double-quote and backslash. */
|
||||
[WS_ESC_DQ] = "\\\\\"\"",
|
||||
/* Escape double-quote, backslash, and whitespace. */
|
||||
[WS_ESC_DQ_WS] = "\\\\\"\" \t\t"
|
||||
};
|
||||
|
||||
static int
|
||||
wordsplit_init (struct wordsplit *wsp, const char *input, size_t len,
|
||||
|
@ -314,21 +325,17 @@ wordsplit_init (struct wordsplit *wsp, const char *input, size_t len,
|
|||
if (!wsp->ws_escape[WRDSX_QUOTE])
|
||||
wsp->ws_escape[WRDSX_QUOTE] = "";
|
||||
}
|
||||
else if (wsp->ws_flags & WRDSF_CESCAPES)
|
||||
{
|
||||
wsp->ws_escape[WRDSX_WORD] = wordsplit_escape[WS_ESC_C_WS];
|
||||
wsp->ws_escape[WRDSX_QUOTE] = wordsplit_escape[WS_ESC_C];
|
||||
wsp->ws_options |= WRDSO_OESC_QUOTE | WRDSO_OESC_WORD
|
||||
| WRDSO_XESC_QUOTE | WRDSO_XESC_WORD;
|
||||
}
|
||||
else
|
||||
{
|
||||
if (wsp->ws_flags & WRDSF_CESCAPES)
|
||||
{
|
||||
wsp->ws_escape[WRDSX_WORD] = wordsplit_c_escape_tab;
|
||||
wsp->ws_escape[WRDSX_QUOTE] = wordsplit_c_escape_tab;
|
||||
wsp->ws_options |= WRDSO_OESC_QUOTE | WRDSO_OESC_WORD
|
||||
| WRDSO_XESC_QUOTE | WRDSO_XESC_WORD;
|
||||
}
|
||||
else
|
||||
{
|
||||
wsp->ws_escape[WRDSX_WORD] = "";
|
||||
wsp->ws_escape[WRDSX_QUOTE] = "\\\\\"\"";
|
||||
wsp->ws_options |= WRDSO_BSKEEP_QUOTE;
|
||||
}
|
||||
wsp->ws_escape[WRDSX_WORD] = "";
|
||||
wsp->ws_escape[WRDSX_QUOTE] = "";
|
||||
}
|
||||
|
||||
if (!(wsp->ws_options & WRDSO_PARAMV))
|
||||
|
@ -700,14 +707,8 @@ wsnode_quoteremoval (struct wordsplit *wsp)
|
|||
{
|
||||
const char *str = wsnode_ptr (wsp, p);
|
||||
size_t slen = wsnode_len (p);
|
||||
int unquote;
|
||||
|
||||
if (wsp->ws_flags & WRDSF_QUOTE)
|
||||
unquote = !(p->flags & _WSNF_NOEXPAND);
|
||||
else
|
||||
unquote = 0;
|
||||
|
||||
if (unquote)
|
||||
if (!(p->flags & _WSNF_NOEXPAND))
|
||||
{
|
||||
if (!(p->flags & _WSNF_WORD))
|
||||
{
|
||||
|
@ -2303,28 +2304,32 @@ scan_word (struct wordsplit *wsp, size_t start, int consume_all)
|
|||
return _WRDS_OK;
|
||||
}
|
||||
|
||||
if (wsp->ws_flags & WRDSF_QUOTE)
|
||||
if (command[i] == '\\')
|
||||
{
|
||||
if (command[i] == '\\')
|
||||
if (i + 1 == len)
|
||||
{
|
||||
if (++i == len)
|
||||
break;
|
||||
i++;
|
||||
break;
|
||||
}
|
||||
if (wsplt_unquote_char (wsp->ws_escape[WRDSX_WORD], command[i+1]))
|
||||
{
|
||||
i += 2;
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
if (((wsp->ws_flags & WRDSF_SQUOTE) && command[i] == '\'') ||
|
||||
((wsp->ws_flags & WRDSF_DQUOTE) && command[i] == '"'))
|
||||
{
|
||||
if (join && wsp->ws_tail)
|
||||
wsp->ws_tail->flags |= _WSNF_JOIN;
|
||||
if (wordsplit_add_segm (wsp, start, i, _WSNF_JOIN))
|
||||
return _WRDS_ERR;
|
||||
if (scan_qstring (wsp, i, &i))
|
||||
return _WRDS_ERR;
|
||||
start = i + 1;
|
||||
join = 1;
|
||||
}
|
||||
if ((wsp->ws_flags & WRDSF_QUOTE) &&
|
||||
(((wsp->ws_flags & WRDSF_SQUOTE) && command[i] == '\'') ||
|
||||
((wsp->ws_flags & WRDSF_DQUOTE) && command[i] == '"')))
|
||||
{
|
||||
if (join && wsp->ws_tail)
|
||||
wsp->ws_tail->flags |= _WSNF_JOIN;
|
||||
if (wordsplit_add_segm (wsp, start, i, _WSNF_JOIN))
|
||||
return _WRDS_ERR;
|
||||
if (scan_qstring (wsp, i, &i))
|
||||
return _WRDS_ERR;
|
||||
start = i + 1;
|
||||
join = 1;
|
||||
}
|
||||
|
||||
if (command[i] == '$')
|
||||
|
@ -2449,13 +2454,13 @@ wsplt_quote_char (const char *transtab, int c)
|
|||
int
|
||||
wordsplit_c_unquote_char (int c)
|
||||
{
|
||||
return wsplt_unquote_char (wordsplit_c_escape_tab, c);
|
||||
return wsplt_unquote_char (wordsplit_escape[WS_ESC_C], c);
|
||||
}
|
||||
|
||||
int
|
||||
wordsplit_c_quote_char (int c)
|
||||
{
|
||||
return wsplt_quote_char (wordsplit_c_escape_tab, c);
|
||||
return wsplt_quote_char (wordsplit_escape[WS_ESC_C], c);
|
||||
}
|
||||
|
||||
void
|
||||
|
|
12
wordsplit.h
12
wordsplit.h
|
@ -1,5 +1,5 @@
|
|||
/* wordsplit - a word splitter
|
||||
Copyright (C) 2009-2021 Sergey Poznyakoff
|
||||
Copyright (C) 2009-2023 Sergey Poznyakoff
|
||||
|
||||
This program is free software; you can redistribute it and/or modify it
|
||||
under the terms of the GNU General Public License as published by the
|
||||
|
@ -309,4 +309,14 @@ const char *wordsplit_strerror (wordsplit_t *ws);
|
|||
|
||||
void wordsplit_clearerr (wordsplit_t *ws);
|
||||
|
||||
enum
|
||||
{
|
||||
WS_ESC_C, /* C-style escapes, for quoted strings */
|
||||
WS_ESC_C_WS, /* C-style escapes plus whitespace. For unquoted words */
|
||||
WS_ESC_DQ, /* Escape double-quote and backslash. */
|
||||
WS_ESC_DQ_WS, /* Escape double-quote, backslash, and whitespace. */
|
||||
};
|
||||
|
||||
extern char const *wordsplit_escape[];
|
||||
|
||||
#endif
|
||||
|
|
4
wsp.c
4
wsp.c
|
@ -1,5 +1,5 @@
|
|||
/* wsp - test program for wordsplit
|
||||
Copyright (C) 2014-2021 Sergey Poznyakoff
|
||||
Copyright (C) 2014-2023 Sergey Poznyakoff
|
||||
|
||||
Wordsplit is free software; you can redistribute it and/or modify it
|
||||
under the terms of the GNU General Public License as published by the
|
||||
|
@ -221,6 +221,8 @@ getwsopt (int argc, char **argv, struct wsopt *wso, struct wsclosure *wsc)
|
|||
}
|
||||
arg = argv[wsoptind++];
|
||||
}
|
||||
else
|
||||
arg = NULL;
|
||||
wso->setfn (wso->tok, negate, arg, wsc);
|
||||
}
|
||||
return 0;
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue