Go back Home | View on Github | EMNLP Paper |
Dataset
Downloading and processing
Traning
Training dataset uses Reddit data from year 2011 to 2012. It can be built with this script, which downloads raw data from a third party dump and extract comparable pairs of comments for classification tasks.
git clone https://github.com/golsun/DialogRPT
cd DialogRPT
sh data.sh
Testing
Testing dataset uses Reddit data from year 2013. It can be downloaded here or use the command below
wget https://xiagnlp2.blob.core.windows.net/dialogrpt/test.zip
unzip test.zip
Data Statistics
Task | Training size | Testing size |
---|---|---|
updown |
40.7 M | > 5k |
width |
22.3 M | > 5k |
depth |
25.1 M | > 5k |
Data example
The table below shows one line from a test tsv file (test/human_feedback/updown.tsv
). Each line of the tsv file contains the following 11 columns delimited by tab.
Column | Example | |
---|---|---|
0 | Context |
Dear Redditors. What do you consider the most important invention in human history? |
1 | Response_A |
Still the Wheel… |
2 | Response_B |
every invention has been the most important until the next one. One of the recent important ones was a clock. That’s what allowed accurage navigation across Longitutde. |
3 | Context_IDs |
t3_1tvflj |
4 | Response_A_ID |
t1_cebtqav |
5 | Response_B_ID |
t1_cebtpsw |
6 | Hour_Gap |
0.03 |
7 | Response_A_Feedback |
31 |
8 | Response_B_Feedback |
-3 |
9 | Response_A_Normalized_Rank |
0.8261 |
10 | Response_B_Normalized_Rank |
0.0000 |
Header Description
The table below describes each column of the tsv file.
Column | Type | Description |
---|---|---|
Context |
str | if multi-turn, turns delimited by _EOS_ |
Response |
str | a response of the given Context |
Context_IDs |
str | Reddit IDs of turns in Context , separated by space if multi-turn |
Response_ID |
str | Reddit ID of Response |
Hour_Gap |
float | The difference (in hours) between the time when Response_A and Response_B were created |
Response_Feedback |
int | depends on tasks: - updown : number of upvotes;- width : number of direct replies; - depth : number of replies of its longest follow-up thread |
Response_Normalized_Rank |
float | if all m responses of Context are ranked by Response_Feedback , the i -th response gets a noramlized rank of i/(m-1) . i starts from 0, and i==0 indicates the lowest |
Data file structure
This illustrates the expected file structure of the training dataset
├── data
└── bz2
├── RC_2011-01.bz2 # downloaded
├── RS_2011-01.bz2
├── ...
└── jsonl
├── 2011-01_edges.tsv # generated by `python src/data.py bz2`
├── 2011-01_nodes.jsonl
├── 2011-01_roots.jsonl
├── ...
└── subs
├── AskReddit
├── 2011_feedback.tsv # generated by `python src/data.py basic`
├── 2011_time.tsv
├── 2011_trees.pkl
├── 2011_txt.tsv
├── 2011_updown.tsv # generated by `python src/data.py updown`
├── 2011_updown_ids.tsv
├── 2011_depth.tsv # generated by `python src/data.py depth`
├── 2011_depth_ids.tsv
├── 2011_width.tsv # generated by `python src/data.py width`
├── 2011_width_ids.tsv
└── ...
└── ...
└── out
├── updown # generated by `python src/data.py updown`
├── raw.tsv
├── raw.tsv.train
├── raw.tsv.vali
├── train.tsv
├── vali.tsv
├── depth # generated by `python src/data.py depth`
└── ...
└── width # generated by `python src/data.py width`
└── ...
Go back Home | View on Github | EMNLP Paper |