Dataset Details | CSQA dataset

Dataset Details

Question Nomenclature

Question type Question sub-type Question sub-sub-type
ques_type_id=1
Simple Question (subject-based)
ques_type_id=2
Secondary question
sec_ques_type=1
Subject based question
sec_ques_sub_type=1
Direct (Singular)

sec_ques_sub_type=2
Indirect (Singular)

sec_ques_sub_type=3
Indirect (Plural)

sec_ques_sub_type=4
Direct (Plural)
sec_ques_type=2
Object based question
ques_type_id=3
Clarification (for secondary) question
ques_type_id=4
Set-based question
set_op_choice=1
OR

set_op_choice=2
AND

set_op_choice=3
Difference
is_inc=1
Incomplete version of set-based ques.
ques_type_id=5
Boolean (Factual Verification) question
bool_ques_type = 1
Verification | 2 entities, both direct

bool_ques_type = 2
Verification | 2 entities, one direct and one indirect, subject is indirect

bool_ques_type = 3
Verification | 2 entities, one direct and one indirect, object is indirect

bool_ques_type = 4
Verification | 3 entities, all direct, 2 are query entities

bool_ques_type = 5
Verification | 3 entities, 2 direct, 2(direct) are query entities, subject is indirect

bool_ques_type = 6
Verification | one entity, multiple entities (as object) referred indirectly
ques_type_id=6
Incomplete question (for secondary)
inc_ques_type=1
Incomplete | object parent is changed, subject and predicate remain same

inc_ques_type=2
Only subject is changed, parent and predicate remains same

inc_ques_type=3
Incomplete count-based ques
ques_type_id=7
Comparative and Quantitative questions (involving single entity)
count_ques_sub_type=1
Quantitative (count) single entity

count_ques_sub_type=2
Quantitative (min/max) single entity

count_ques_sub_type=3
Quantitative (atleast/atmost) single entity (which)

count_ques_sub_type=5
Quantitative (atleast/atmost) single entity (count)

count_ques_sub_type=7
Quantitative Indirect (count) single entity

is_incomplete=1
Incomplete form (of the category of question)
count_ques_sub_type=4
Comparative (more/less) single entity (count)

count_ques_sub_type=6
Comparative(more/less) single entity (which)

count_ques_sub_type=8
Comparative Indirect (more/less) single entity (count)

count_ques_sub_type=9
Comparative Indirect (more/less) single entity (which)
ques_type_id=8
Comparative and Quantitative questions (involving multiple(2) entities)
count_ques_sub_type=1
Quantitative with Logical Operators

count_ques_sub_type=2
Quantitative (count) multiple entity

count_ques_sub_type=3
Quantitative (min/max) multiple entity

count_ques_sub_type=4
Quantitative (atleast/atmost) multiple entity (which)

count_ques_sub_type=6
Quantitative (atleast/atmost) multiple entity (count)

count_ques_sub_type=8
Quantitative Indirect (count) multiple entity

is_incomplete=1
Incomplete form (of the category of question)
count_ques_sub_type=5
Comparative (more/less) multiple entity (count)

count_ques_sub_type=7
Comparative(more/less) multiple entity (which)

count_ques_sub_type=9
Comparative Indirect (more/less) single entity(count)

count_ques_sub_type=10
Comparative Indirect (more/less) multiple entity (which)

Dataset statistics (Number of QA pairs for each question type)

Question Type Train Valid Test
Simple|Direct 465184 52189 81994
Simple|Indirect 293692 32877 54854
Simple|Incomplete 58627 6658 10045
Comparative|Count over More/Less|Mult. entity type|Direct 36658 3791 7711
Comparative|Count over More/Less|Mult. entity type|Indirect 7783 808 1177
Comparative|Count over More/Less|Mult. entity type|Incomplete 15137 1564 3249
Comparative|Count over More/Less|Single entity type|Direct 47682 4738 5224
Comparative|Count over More/Less|Single entity type|Indirect 9100 932 922
Comparative|Count over More/Less|Single entity type|Incomplete 19324 1929 1972
Comparative|More/Less|Mult. entity type|Direct 36538 3711 7655
Comparative|More/Less|Mult. entity type|Indirect 6797 645 1184
Comparative|More/Less|Mult. entity type|Incomplete 15086 1546 3209
Comparative|More/Less|Single entity type|Direct 47149 4725 5520
Comparative|More/Less|Single entity type|Indirect 7087 736 925
Comparative|More/Less|Single entity type|Incomplete 19107 1910 2064
Logical|Union|Direct 70694 7345 14418
Logical|Intersection|Direct 31205 3278 5708
Logical|Difference|Direct 3726 373 661
Logical|Incomplete 6372 765 1679
Quantitative|Atleast/ Atmost/ Approx. the same/Equal|Mult. entity type|Direct 21110 2161 3910
Quantitative|Atleast/ Atmost/ Approx. the same/Equal|Single entity type|Direct 27613 2790 2306
Quantitative|Count over Atleast/ Atmost/ Approx. the same/Equal|Mult. entity type|Direct 21257 2272 3850
Quantitative|Count over Atleast/ Atmost/ Approx. the same/Equal|Single entity type|Direct 27507 2801 2288
Quantitative|Count|Logical operators|Direct 21734 2089 3753
Quantitative|Count|Logical operators|Indirect 10802 991 2035
Quantitative|Count|Mult. entity type|Direct 24561 2472 4329
Quantitative|Count|Single entity type|Direct 51584 5125 4477
Quantitative|Count|Single entity type|Indirect 15995 1519 2547
Quantitative|Count|Single entity type|Incomplete 20050 1990 -
Verification|Single/Multiple Entity|Direct 47505 5376 10150
Verification|Single/Multiple Entity|Indirect 83325 9218 16578
Clarification (All) 77835 8164 12121
Indirect (All) 407784 45216 75640
Incomplete (All) 172957 18341 23220
Logical|Multiple Relations|Direct 49970 5164 9598
Quantitative|Min/Max|Single entity type 29409 2942 342
Quantitative|Min/Max|Mult. entity type 21098 2133 2695

Overall Dataset Statistics

Dataset Statistics Train Valid Test
Total No. of Dialogs(chat sessions) 152391 16813 27797
Avg. No. of Utterances per dialog 15.9 15.65 19.44
Total No. of Utterances having Question/Answer 1.2M .13M .27M
Length of user’s question (in words) 9.7 9.68 10.28
Length of system’s response (in words) 4.74 4.67 4.37
Avg. No. of Dialog states per dialog 3.89 3.84 4.53
Vocab size (freq>=10) 0.1M - -