Downloading and processing


Training dataset uses Reddit data from year 2011 to 2012. It can be built with this script, which downloads raw data from a third party dump and extract comparable pairs of comments for classification tasks.

git clone
cd DialogRPT


Testing dataset uses Reddit data from year 2013. It can be downloaded here or use the command below


Data Statistics

Task Training size Testing size
updown 40.7 M > 5k
width 22.3 M > 5k
depth 25.1 M > 5k

Data example

The table below shows one line from a test tsv file (test/human_feedback/updown.tsv). Each line of the tsv file contains the following 11 columns delimited by tab.

  Column Example
0 Context Dear Redditors. What do you consider the most important invention in human history?
1 Response_A Still the Wheel…
2 Response_B every invention has been the most important until the next one. One of the recent important ones was a clock. That’s what allowed accurage navigation across Longitutde.
3 Context_IDs t3_1tvflj
4 Response_A_ID t1_cebtqav
5 Response_B_ID t1_cebtpsw
6 Hour_Gap 0.03
7 Response_A_Feedback 31
8 Response_B_Feedback -3
9 Response_A_Normalized_Rank 0.8261
10 Response_B_Normalized_Rank 0.0000

Header Description

The table below describes each column of the tsv file.

Column Type Description
Context str if multi-turn, turns delimited by _EOS_
Response str a response of the given Context
Context_IDs str Reddit IDs of turns in Context, separated by space if multi-turn
Response_ID str Reddit ID of Response
Hour_Gap float The difference (in hours) between the time when Response_A and Response_B were created
Response_Feedback int depends on tasks:
- updown: number of upvotes;
- width: number of direct replies;
- depth: number of replies of its longest follow-up thread
Response_Normalized_Rank float if all m responses of Context are ranked by Response_Feedback, the i-th response gets a noramlized rank of i/(m-1). i starts from 0, and i==0 indicates the lowest

Data file structure

This illustrates the expected file structure of the training dataset

├── data
   └── bz2
       ├── RC_2011-01.bz2          # downloaded
       ├── RS_2011-01.bz2
       ├── ...
   └── jsonl
       ├── 2011-01_edges.tsv       # generated by `python src/ bz2`
       ├── 2011-01_nodes.jsonl
       ├── 2011-01_roots.jsonl
       ├── ...
   └── subs
       ├── AskReddit
           ├── 2011_feedback.tsv   # generated by `python src/ basic`
           ├── 2011_time.tsv
           ├── 2011_trees.pkl
           ├── 2011_txt.tsv
           ├── 2011_updown.tsv     # generated by `python src/ updown`
           ├── 2011_updown_ids.tsv
           ├── 2011_depth.tsv      # generated by `python src/ depth`
           ├── 2011_depth_ids.tsv
           ├── 2011_width.tsv      # generated by `python src/ width`
           ├── 2011_width_ids.tsv
           └── ...
       └── ...
   └── out
       ├── updown     # generated by `python src/ updown`
           ├── raw.tsv
           ├── raw.tsv.train
           ├── raw.tsv.vali
           ├── train.tsv
           ├── vali.tsv
       ├── depth      # generated by `python src/ depth`
           └── ...
       └── width      # generated by `python src/ width`
           └── ...
