HTTPのクエリパラメータにコロン(:)を書くのは不正なのか。

PHP の $_SERVER['REQUEST_URI'] と parse_url() の予想外な動作について。 - こせきの技術日記

の続き。

PHPのparse_url()は、

"/abc?a=x&time=09:00&x=y" はパースできるのに、
"/abc?a=x&time=09:00" だと失敗する。

相対URIで「動作しない」仕様だかららしいのだが、それはともかく、コロンのパーセントエンコードが必須なのか気になったので調べた。

URIの仕様 RFC 3986

まず、基礎となる URI の仕様 RFC 3986 がある。

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
Uniform Resource Identifier (URI): 一般的構文日本語訳
- RFC 1738 - A Gopher URL Format 古いURL仕様 (Updated by 3986)
- RFC1738 日本語訳
- RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax 古いURI仕様 (Obsoleted by 3986)
- RFC2396J RFC2396 日本語訳
情報処理推進機構：情報セキュリティ：調査・研究報告書：情報セキュリティ技術動向調査（2009 年下期） URIのエスケープ
- IPAの記事。非常に詳しくて参考になった。

RFC 3986 で、クエリに使える文字を定義しているABNFは以下の通り。クエリは?から#または末尾までと定義されている。

query         = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

結構いろいろ使える。ただし、これらの文字を自由に使えるというわけではない。

これとは別に2.2で予約文字というのが定義されている。

reserved    = gen-delims / sub-delims
gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

queryはゆるいのに、resesrvedはきびしい。

以下、reservedについて説明している箇所を引用し、自分の理解をコメントする(日本語訳ではない)。「コンポーネント」というのは、スキーム、パス、クエリなどの、URIを構成する部品のこと。

A component's ABNF syntax rule will not use the reserved or gen-delims rule names directly;
reservedシンタックスはコンポーネントのABNFシンタックスでは直接使用されない。

each syntax rule lists the characters allowed within that component (i.e., not delimiting it),
各シンタックスルールは、そのコンポーネントで許可された文字をリストする。

and any of those characters that are also in the reserved set are "reserved" for use as subcomponent delimiters within the component.
コンポーネントで許可されていてreservedにも含まれる文字は、サブコンポーネントのデリミタとして使うため、予約されている。

Only the most common subcomponents are defined by this specification;
もっとも共通のサブコンポーネントだけを、この仕様で定義する。

other subcomponents may be defined by a URI scheme's specification, or
それ以外のサブコンポーネントはURIスキームの仕様や、

by the implementation-specific syntax of a URI's dereferencing algorithm,
URI参照解決のアルゴリズム実装のシンタックスによって定義されるだろう。

provided that such subcomponents are delimited by characters in the reserved set allowed within that component.
コンポーネントは、予約文字で区切ってサブコンポーネントにできる。
RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

要するに、

予約文字はコンポーネントをサブコンポーネントに分割するために使う。
サブコンポーネントの仕様は各URIスキームの仕様やアプリケーションで決める。

さらに、以下の説明がある。URIを組み立てるアプリケーションについて。

URI producing applications should percent-encode data octets that correspond to characters in the reserved set
URIを組み立てるアプリケーションは、reservedの文字をパーセントエンコードすべき。

unless these characters are specifically allowed by the URI scheme to represent data in that component.
でも、特別に、URIスキームが許可していれば使ってもよい。
RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

基本、予約文字はエンコードしなければならない。ただし、httpスキームの仕様で、クエリにコロンやスラッシュを使ってもいいよ、というなら、生のまま使える。

また、URIをパースするアプリケーションについて。

If a reserved character is found in a URI component and no delimiting role is known for that character,
デリミタの役割が知られていない予約文字がコンポーネントに見つかった場合は、

then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.
ASCIIの該当文字として解釈しなければならない。
RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

これだと、PHPのparse_uri()がコロン(:)を理由にパースに失敗したらダメだと思う(そういう主張はされてないけど)。

で、次に読むのは http URIスキームの仕様だ、と思って調べたんだけど、そんなのは見つからなかった。httpスキームの仕様は単体で存在しないの？

HTML 4.01

自分は今ウェブサービスを作っているので、HTMLは一切関係ない。

関係無いんだけど、他に該当しそうな仕様が見つからないので、参考になりそうなところを見てみる。

If the method is "get" and the action is an HTTP URI,
メソッドがGETでactionの先がHTTP URIだったら、

the user agent takes the value of action,
actionのURIに、

appends a `?' to it,
?をひっつけて、

then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type.
フォームのデータをapplication/x-www-form-urlencoded でエンコードしてくっつける。

The user agent then traverses the link to this URI.
そんでそのURIにアクセスしろ。

In this scenario, form data are restricted to ASCII codes.
このシナリオで扱えるのはASCIIのデータだけだけど。ズコー
Forms in HTML documents

最後の行はさておき。HTML 4.01 の GET リクエストでは、URI の query に x-www-form-urlencoded を使えと言っている。urlencodedという名前を考えると、当たり前のような気もするが、、これはRFC 3986の http スキームについても言える話なんだろうか。

HTML 4.01における x-www-form-urlencoded の仕様は、以下の通り。RFC 1738、1994年のURL仕様を参照している。

Control names and values are escaped.
名前と値はエスケープする。

Space characters are replaced by `+',
スペースは+に。

and then reserved characters are escaped as described in [RFC1738], section 2.2:
予約文字は RFC1738 2.2 に従ってエスケープする。すなわち、

Non-alphanumeric characters are replaced by `%HH',
英数文字以外は%HHに置換する。 (全部？)

a percent sign and two hexadecimal digits representing the ASCII code of the character.
%と十六進数のASCIIコードで云々。

Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
改行はCRLF %0D%0A。

The control names/values are listed in the order they appear in the document.
名前と値は文書の順に並べる。

The name is separated from the value by `=' and
名前と値は=で区切り、

name/value pairs are separated from each other by `&'.
各ペアは&で区切る。
Forms in HTML documents

例の、値をセミコロンで区切る話はべつのところに出てくる。

We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
Performance, Implementation, and Design Notes

でもセミコロンで区切ったら x-www-form-urlencoded の仕様に適合しなくなるのでは？

RFC 1738 URL仕様 (1994 古い)

HTML 4.01が参照している RFC 1738 の2.2は何て言ってるか。

Octets must be encoded
Octetsは以下の場合にエンコードしなければならない。

if they have no corresponding graphic character within the US-ASCII coded character set,
ASCIIの表示可能な文字でない場合。

if the use of the corresponding character is unsafe, or
その文字が安全でない場合。

if the corresponding character is reserved for some other interpretation within the particular URL scheme.
特定のURLスキームで予約されてる場合。
RFC 1738 - A Gopher URL Format

Unsafeで挙げられてるのは、スペースと

<>"#%{}|\^~[]`

で、

All unsafe characters must always be encoded within a URL.
unsafeな文字は常にエンコードしなければならない。
RFC 1738 - A Gopher URL Format

Reservedで挙げられているのは、

;/?:@=&

で、

Thus, only alphanumerics,
英数、

the special characters "$-_.+!*'(),",
非予約文字($-_.+!*'(),)、

and reserved characters used for their reserved purposes
予約された目的で使われる予約文字だけは、

may be used unencoded within a URL.
エンコードせずに使える。
RFC 1738 - A Gopher URL Format

ということなので、データに予約文字が含まれるならエンコードは必須だろうと思う。

RFC 3986 の「スキームが予約文字の使用を許可できる」という話は、 RFC 1738には出てこない。

application/x-www-from-urlencoded

HTML 4.01 以外の application/x-www-from-urlencoded の仕様。

独立して application/x-www-form-urlencoded を規定する仕様書はまだ存在しません。
application/x-www-form-urlencoded

application/x-www-form-urlencoded ‐ 通信用語の基礎知識

RFC 1866 (HTML 2.0)以来、HTML5草案まで使われ続けてきた。
トラックバックpingでも、このContent-Type名を使用する。
しかし、x-という問題がある。この改善のため、application/www-form-urlencodedをIANAに登録する提案は以前からなされていたが、HTML5のために再び草案が復活した(I-D[hoehrmann-urlencoded-01] [外部リンク] )。
application/www-form-urlencodedのドラフト仕様では、8ビットであり、符号はUTF-8に固定。このためcharsetパラメーターは不正であるとする。
http://www.wdic.org/w/WDIC/application/x-www-form-urlencoded

2011/03のドラフトを見てみたけど、

draft-hoehrmann-urlencoded-01 - The application/www-form-urlencoded format

URIのqueryコンポーネントで使えるようには見えない。全然エスケープが足りてない。

ここまで調べたことのまとめ。

予約文字(:とか/とか)は基本的にエンコードすべきもの。

RFC 1738 は、予約文字を常にエンコードする。
RFC 3986 は、スキームが特別に許可するなら生の予約文字をデータ表現に使ってよい。

HTML 4.01はRFC 1738を参照しているので常にエンコードする。

RFC 3986 を採用する場合、 http スキームがクエリをどのように定義しているかは不明。

でもさー

httpのクエリに生のコロンやスラッシュが含まれることで、どんな害があるのか、わからない。

URIの可読性を考えたら、少なくともhttpスキームについては、もっと緩めてもいいように思える。RFC 3986 ならそれが可能なのだし。

ちなみに、Googleはコロン(:)をエンコードしない処理を入れているようだった。Googleでa:bと検索すると、ブラウザのURL欄にはq=a:bと出る。画像検索だとa%3Abになる。

エスケープする文字一覧

全てのASCII記号から、エスケープする文字だけ表示するスクリプト。

#! /usr/bin/env ruby
# -*- coding: utf-8 -*-

ascii = []

# 表示可能なASCII文字全部。空白(32)は無し。
(33..126).each do |i|
  ascii << i.chr
end

puts "* エンコード対象の文字一覧"
puts

# 記号だけ残す。
ascii.reject!{|c| c =~ /[a-zA-Z0-9]/ }
puts "           all: " +  ascii.join

# RFC3986 非予約文字
unreserved = %q{-._~}

# RFC2396 非予約文字
unrsvd2396 = unreserved + %q{!*'()} #'

# RFC1738 非予約文字
unrsvd1738 = %q{-._!*'()$,+} #'

# RFC3936 クエリ文字 %はエンコード形式でしか使えないので除外。
query    = %q{/?:@-._~!$&'()*+,;=} #'

# ECMAScript encodeURI()
encodeuri = %q{-._~:/?#@!$&'()*+,;=} #'

# 非・非予約文字
puts "       RFC3986: " + ascii.map {|c| unreserved.index(c).nil? ? c : ' ' }.join

# RFC2396の非・非予約文字。ECMAScript encodeURIComponent()はこれをエスケープする。
puts "       RFC2396: " + ascii.map {|c| unrsvd2396.index(c).nil? ? c : ' ' }.join

# RFC1738の非・非予約文字。
puts "       RFC1738: " + ascii.map {|c| unrsvd1738.index(c).nil? ? c : ' ' }.join

# ECMA encodeURI()
puts "ECMA encodeURI: " + ascii.map {|c| encodeuri.index(c).nil? ? c : ' ' }.join

# クエリで使えない文字
puts "     not query: " + ascii.map {|c| query.index(c).nil? ? c : ' ' }.join

# Ruby URI::UNSAFE /[^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]/n より
rubysafe = %q{-_.!~*'();/?:@&=+$,[]} #'
puts "rubyURI.escape: " + ascii.map {|c| rubysafe.index(c).nil? ? c : ' ' }.join

結果は、

* エンコード対象の文字一覧

           all: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
       RFC3986: !"#$%&'()*+,  /:;<=>?@[\]^ `{|}
       RFC2396:  "#$%&    +,  /:;<=>?@[\]^ `{|}
       RFC1738:  "# %&        /:;<=>?@[\]^ `{|}~
ECMA encodeURI:  "  %            < >  [\]^ `{|}
     not query:  "# %            < >  [\]^ `{|}
rubyURI.escape:  "# %            < >   \ ^ `{|}

上から順に、

全ASCII記号
RFC3986 でスキームで許可されていなければエンコードすべき文字。unreservedな文字以外。PHP rawurlencode()が実装。
RFC2396 はObsolete。参考まで。ECMAScript encodeURIComponent()が実装。
RFC1738 でエンコードすべき文字。
ECMAScriptのencodeURI()がエンコードする文字。
RFC3986 のqueryで使えない文字。これらはクエリでエンコード必須。
RubyのURI.escape()がデフォルトでエンコードする文字。URI::UNSAFE。

RubyのURI.escape()はECMAScriptのencodeURI()と同じく、URIをまるごとエンコードするためものだと思うんだけど、、[]をエンコードしていない。どこから出てきた仕様？文字セットを自由に設定できるのはいいが、デフォルトの使い道はなさそう。