Dataset

Downloading and processing

Traning

Training dataset uses Reddit data from year 2011 to 2012. It can be built with this script, which downloads raw data from a third party dump and extract comparable pairs of comments for classification tasks.

git clone https://github.com/golsun/DialogRPT
cd DialogRPT
sh data.sh

Testing

Testing dataset uses Reddit data from year 2013. It can be downloaded here or use the command below

wget https://xiagnlp2.blob.core.windows.net/dialogrpt/test.zip
unzip test.zip

Data Statistics

Task	Training size	Testing size
`updown`	40.7 M	> 5k
`width`	22.3 M	> 5k
`depth`	25.1 M	> 5k

Data example

The table below shows one line from a test tsv file (test/human_feedback/updown.tsv). Each line of the tsv file contains the following 11 columns delimited by tab.

	Column	Example
0	`Context`	Dear Redditors. What do you consider the most important invention in human history?
1	`Response_A`	Still the Wheel…
2	`Response_B`	every invention has been the most important until the next one. One of the recent important ones was a clock. That’s what allowed accurage navigation across Longitutde.
3	`Context_IDs`	t3_1tvflj
4	`Response_A_ID`	t1_cebtqav
5	`Response_B_ID`	t1_cebtpsw
6	`Hour_Gap`	0.03
7	`Response_A_Feedback`	31
8	`Response_B_Feedback`	-3
9	`Response_A_Normalized_Rank`	0.8261
10	`Response_B_Normalized_Rank`	0.0000

Header Description

The table below describes each column of the tsv file.

Column	Type	Description
`Context`	str	if multi-turn, turns delimited by `_EOS_`
`Response`	str	a response of the given Context
`Context_IDs`	str	Reddit IDs of turns in `Context`, separated by space if multi-turn
`Response_ID`	str	Reddit ID of `Response`
`Hour_Gap`	float	The difference (in hours) between the time when `Response_A` and `Response_B` were created
`Response_Feedback`	int	depends on tasks: - `updown`: number of upvotes; - `width`: number of direct replies; - `depth`: number of replies of its longest follow-up thread
`Response_Normalized_Rank`	float	if all `m` responses of `Context` are ranked by `Response_Feedback`, the `i`-th response gets a noramlized rank of `i/(m-1)`. `i` starts from 0, and `i==0` indicates the lowest

Data file structure

This illustrates the expected file structure of the training dataset

├── data
   └── bz2
       ├── RC_2011-01.bz2          # downloaded
       ├── RS_2011-01.bz2
       ├── ...
   └── jsonl
       ├── 2011-01_edges.tsv       # generated by `python src/data.py bz2`
       ├── 2011-01_nodes.jsonl
       ├── 2011-01_roots.jsonl
       ├── ...
   └── subs
       ├── AskReddit
           ├── 2011_feedback.tsv   # generated by `python src/data.py basic`
           ├── 2011_time.tsv
           ├── 2011_trees.pkl
           ├── 2011_txt.tsv
           ├── 2011_updown.tsv     # generated by `python src/data.py updown`
           ├── 2011_updown_ids.tsv
           ├── 2011_depth.tsv      # generated by `python src/data.py depth`
           ├── 2011_depth_ids.tsv
           ├── 2011_width.tsv      # generated by `python src/data.py width`
           ├── 2011_width_ids.tsv
           └── ...
       └── ...
   └── out
       ├── updown     # generated by `python src/data.py updown`
           ├── raw.tsv
           ├── raw.tsv.train
           ├── raw.tsv.vali
           ├── train.tsv
           ├── vali.tsv
       ├── depth      # generated by `python src/data.py depth`
           └── ...
       └── width      # generated by `python src/data.py width`
           └── ...

Go back Home

View on Github

EMNLP Paper

Reddit Dialogue Feedback Dataset

A dataset to learn which dialogue response gets better human feedback.