Scalable SPARQL querying using path partitioning

Buwen Wu, Yongluan Zhou, Pingpeng Yuan, Ling Liu, Hai Jin

Publikation: Bidrag til bog/antologi/rapport/konference-proceedingKonferencebidrag i proceedingsForskningpeer review

Resumé

The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.
OriginalsprogEngelsk
TitelProceedings of the 31st IEEE International Conference on Data Engineering
Antal sider12
ForlagIEEE
Publikationsdato2015
Sider795-806
DOI
StatusUdgivet - 2015
Begivenhed31st IEEE International Conference on Data Engineering - Seoul, Sydkorea
Varighed: 13. apr. 201517. apr. 2015

Konference

Konference31st IEEE International Conference on Data Engineering
LandSydkorea
BySeoul
Periode13/04/201517/04/2015

Fingeraftryk

Cluster computing
Costs

Citer dette

Wu, B., Zhou, Y., Yuan, P., Liu, L., & Jin, H. (2015). Scalable SPARQL querying using path partitioning. I Proceedings of the 31st IEEE International Conference on Data Engineering (s. 795-806). IEEE. https://doi.org/10.1109/ICDE.2015.7113334
Wu, Buwen ; Zhou, Yongluan ; Yuan, Pingpeng ; Liu, Ling ; Jin, Hai. / Scalable SPARQL querying using path partitioning. Proceedings of the 31st IEEE International Conference on Data Engineering. IEEE, 2015. s. 795-806
@inproceedings{c26ba6659f0c486585f6667c3c418674,
title = "Scalable SPARQL querying using path partitioning",
abstract = "The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.",
author = "Buwen Wu and Yongluan Zhou and Pingpeng Yuan and Ling Liu and Hai Jin",
year = "2015",
doi = "10.1109/ICDE.2015.7113334",
language = "English",
pages = "795--806",
booktitle = "Proceedings of the 31st IEEE International Conference on Data Engineering",
publisher = "IEEE",
address = "United States",

}

Wu, B, Zhou, Y, Yuan, P, Liu, L & Jin, H 2015, Scalable SPARQL querying using path partitioning. i Proceedings of the 31st IEEE International Conference on Data Engineering. IEEE, s. 795-806, 31st IEEE International Conference on Data Engineering, Seoul, Sydkorea, 13/04/2015. https://doi.org/10.1109/ICDE.2015.7113334

Scalable SPARQL querying using path partitioning. / Wu, Buwen; Zhou, Yongluan; Yuan, Pingpeng; Liu, Ling; Jin, Hai.

Proceedings of the 31st IEEE International Conference on Data Engineering. IEEE, 2015. s. 795-806.

Publikation: Bidrag til bog/antologi/rapport/konference-proceedingKonferencebidrag i proceedingsForskningpeer review

TY - GEN

T1 - Scalable SPARQL querying using path partitioning

AU - Wu, Buwen

AU - Zhou, Yongluan

AU - Yuan, Pingpeng

AU - Liu, Ling

AU - Jin, Hai

PY - 2015

Y1 - 2015

N2 - The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.

AB - The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.

U2 - 10.1109/ICDE.2015.7113334

DO - 10.1109/ICDE.2015.7113334

M3 - Article in proceedings

SP - 795

EP - 806

BT - Proceedings of the 31st IEEE International Conference on Data Engineering

PB - IEEE

ER -

Wu B, Zhou Y, Yuan P, Liu L, Jin H. Scalable SPARQL querying using path partitioning. I Proceedings of the 31st IEEE International Conference on Data Engineering. IEEE. 2015. s. 795-806 https://doi.org/10.1109/ICDE.2015.7113334